M S K I - arXiv · 2015-11-17 · Scandal in Bohemia) Let us imagine a scene with the following...

14
Under review as a conference paper at ICLR 2016 S HERLOCK : MODELING S TRUCTURED K NOWLEDGE IN I MAGES Mohamed Elhoseiny 1,2 , Scott Cohen 1 , Walter Chang 1 , Brian Price 1 , Ahmed Elgammal 2 1 Adobe Research 2 Department of Computer Science, Rutgers University ABSTRACT How to build a machine learning method that can continuously gain structured visual knowledge by learning structured facts? Our goal in this paper is to address this question by proposing a problem setting, where training data comes as struc- tured facts in images with different types including (1) objects(e.g., <boy>), (2) attributes (e.g., <boy,tall>), (3) actions (e.g., <boy, playing>), (4) interactions (e.g., <boy, riding, a horse >). Each structured fact has a semantic language view (e.g., < boy, playing>) and a visual view (an image with this fact). A human is able to efficiently gain visual knowledge by learning facts in a never ending process, and as we believe in a structured way (e.g., understanding “playing” is the action part of < boy, playing>, and hence can generalize to recognize <girl, playing > if just learn <girl> additionally). Inspired by human visual percep- tion, we propose a model that is (1) able to learn a representation, we name as wild-card, which covers different types of structured facts, (2) could flexibly get fed with structured fact language-visual view pairs in a never ending way to gain more structured knowledge, (3) could generalize to unseen facts, and (4) allows retrieval of both the fact language view given the visual view (i.e., image) and vice versa. We also propose a novel method to generate hundreds of thousands of structured fact pairs from image caption data, which are necessary to train our model and can be useful for other applications. 1 I NTRODUCTION It is a capital mistake to theorize in advance of the facts. -Sherlock Holmes (A Scandal in Bohemia) Let us imagine a scene with the following facts: <man>, <baby>, <toy>, <man, smiling>, <baby, smiling>, <baby, sitting on, chair>, <man, sitting on, chair>, <baby, sitting on, chair>, <baby, holding, toy>, <man, feeding, baby>. We might expect that the imagined scene will be very close to the image in Fig. 1 due to the precise structured description. On the other hand, if we were given the same image and asked to describe it, we might expect only a short title “man feeding a baby”. Providing the given structured facts from this image assumes a detective’s eye that look for structured details that we aim to model. State-of-the-art captioning methods (e.g., Karpathy & Fei-Fei (2015); Vinyals et al. (2015); Xu et al. (2015); Mao et al. (2015)) rely on the idea of generating a sequence of words given an image of a scene, inspired by the success of sequence to sequence training of neural nets in machine translation systems (e.g., Cho et al. (2014)). While it is an impressive step, the mechanism of these captioning systems makes them incapable of conveying structured information in an image and providing a confidence of the generated caption given the facts in the image. It might also provide a limited description like “man feeding a baby”, which makes the image search difficult on the other direction due to lack of representation. Captions and unstructured tags are mainly a vehicle to communicate facts with humans. However, they may not be the best way to represent that knowledge in a way that is searchable for the machine. There are advantages to having explicit, structured knowledge for image search. If one searches for images of a “red flower”, a bag-of-words approach that considers ”red” and “flower” separately may return images of flowers that are not red but have red elsewhere in the image. It is important to know that a user is looking for the fact <flower,red>. Modeling the connection between the provided structured facts in its language form and its visual view (i.e., an image containing it) facilitates gaining richer visual knowledge, which is our focus in this paper. Several applications can make use of modeling 1 arXiv:1511.04891v1 [cs.CV] 16 Nov 2015

Transcript of M S K I - arXiv · 2015-11-17 · Scandal in Bohemia) Let us imagine a scene with the following...

Page 1: M S K I - arXiv · 2015-11-17 · Scandal in Bohemia) Let us imagine a scene with the following facts:

Under review as a conference paper at ICLR 2016

SHERLOCK: MODELING STRUCTURED KNOWLEDGEIN IMAGES

Mohamed Elhoseiny1,2, Scott Cohen1, Walter Chang1, Brian Price1, Ahmed Elgammal21Adobe Research2Department of Computer Science, Rutgers University

ABSTRACT

How to build a machine learning method that can continuously gain structuredvisual knowledge by learning structured facts? Our goal in this paper is to addressthis question by proposing a problem setting, where training data comes as struc-tured facts in images with different types including (1) objects(e.g., <boy>), (2)attributes (e.g., <boy,tall>), (3) actions (e.g., <boy, playing>), (4) interactions(e.g., <boy, riding, a horse >). Each structured fact has a semantic language view(e.g., < boy, playing>) and a visual view (an image with this fact). A humanis able to efficiently gain visual knowledge by learning facts in a never endingprocess, and as we believe in a structured way (e.g., understanding “playing” isthe action part of < boy, playing>, and hence can generalize to recognize <girl,playing > if just learn <girl> additionally). Inspired by human visual percep-tion, we propose a model that is (1) able to learn a representation, we name aswild-card, which covers different types of structured facts, (2) could flexibly getfed with structured fact language-visual view pairs in a never ending way to gainmore structured knowledge, (3) could generalize to unseen facts, and (4) allowsretrieval of both the fact language view given the visual view (i.e., image) andvice versa. We also propose a novel method to generate hundreds of thousandsof structured fact pairs from image caption data, which are necessary to train ourmodel and can be useful for other applications.

1 INTRODUCTION

It is a capital mistake to theorize in advance of the facts. -Sherlock Holmes (AScandal in Bohemia)

Let us imagine a scene with the following facts: <man>, <baby>, <toy>, <man, smiling>,<baby, smiling>, <baby, sitting on, chair>, <man, sitting on, chair>, <baby, sitting on, chair>,<baby, holding, toy>, <man, feeding, baby>. We might expect that the imagined scene will bevery close to the image in Fig. 1 due to the precise structured description. On the other hand, if wewere given the same image and asked to describe it, we might expect only a short title “man feedinga baby”. Providing the given structured facts from this image assumes a detective’s eye that lookfor structured details that we aim to model. State-of-the-art captioning methods (e.g., Karpathy& Fei-Fei (2015); Vinyals et al. (2015); Xu et al. (2015); Mao et al. (2015)) rely on the idea ofgenerating a sequence of words given an image of a scene, inspired by the success of sequence tosequence training of neural nets in machine translation systems (e.g., Cho et al. (2014)). While it isan impressive step, the mechanism of these captioning systems makes them incapable of conveyingstructured information in an image and providing a confidence of the generated caption given thefacts in the image. It might also provide a limited description like “man feeding a baby”, whichmakes the image search difficult on the other direction due to lack of representation. Captions andunstructured tags are mainly a vehicle to communicate facts with humans. However, they may notbe the best way to represent that knowledge in a way that is searchable for the machine. There areadvantages to having explicit, structured knowledge for image search. If one searches for imagesof a “red flower”, a bag-of-words approach that considers ”red” and “flower” separately may returnimages of flowers that are not red but have red elsewhere in the image. It is important to know that auser is looking for the fact <flower,red>. Modeling the connection between the provided structuredfacts in its language form and its visual view (i.e., an image containing it) facilitates gaining richervisual knowledge, which is our focus in this paper. Several applications can make use of modeling

1

arX

iv:1

511.

0489

1v1

[cs

.CV

] 1

6 N

ov 2

015

Page 2: M S K I - arXiv · 2015-11-17 · Scandal in Bohemia) Let us imagine a scene with the following facts:

Under review as a conference paper at ICLR 2016

that connection, such as structured fact tagging, high precision image search from text, generatingcomprehensive descriptions of complicated scenes, and making higher level reasoning about a scene.

How to model structured facts to enable machines to gain structured visual knowledge?

Figure 1: Sherlock Problem: Gaining Structured Visual Knowledge

Another motivating per-spective of modeling struc-tured facts is to look atgaining visual knowledgeas learning more and morestructured facts since it isboth richer and allows bet-ter generalization to unseenfacts. For example if amodel learned during train-ing about <girl, smiling>,<boy, petting, a dog>, <girl, riding, a horse>, and <boy>, it should be able to recognize the fact<boy, petting, a horse> even though it did not see that fact before. In our work, we aim to coverthree types of facts by the same model, which we define as follows. First order facts in images aredefined by the set of objects and the scene in an image (e.g,. <baby>, <toy>, <man>). Secondorder facts are defined by attributes and single-frame action that might be performed by the objects(e.g., < baby, smiling>, <man, dancing>). An object interacting with another object defines a thirdorder fact in the image (e.g., <man, feeding, baby >, <baby, sitting on, chair>). We denote thefirst, second, and third order facts by <Subject>, <Subject, Predicate>, and <Subject, Predicate,Object> respectively, abbreviated as <S>, <S,P>, and <S,P,O>; see Fig. 1. Inspired by the mod-ifier concept in language grammar, we model higher order facts as visual modifiers of the low orderfacts. For example, <baby, smiling> is applying <smiling> visual modifier of <baby>. Based onthis notion, we propose a model for learning representation of structured fact covering its differentorders, and can continuously be fed with these facts to gain structured knowledge. Specifically, bothlanguage and visual views of a structured fact inhabit in a continuous space. Modeling structuredfacts in a continuous space allows us to extend the gained knowledge from the training facts to un-seen facts. Since the proposed setting is aiming for a model that is an eye for details and potentiallyallows higher order reasoning (Fig. 1), we denote this problem by the Sherlock Problem.

There are indeed other research problems that aim to explicity understand facts about an image. Themain paradigm in object (e.g., Simonyan & Zisserman (2015)), scene (e.g., Zhou et al. (2014)), andactivity categorization (e.g., Gkioxari & Malik (2015)) methods is to have a set of discrete categoriesthat the system can recognize. This faces a scalability problem since adding a new category meanschanging the architecture of the model by adding thousands or parameters and re-training the model(e.g., needed for adding a new output node). This is also shared with recent attribute based meth-ods(e.g., Chen & Grauman (2014)) that realizes the idea that attribute appearance is dependent onthe class as opposed to earlier models (e.g., Lampert et al. (2009)), which motivates them to jointlylearn a classifier for every class attribute pair suffering from scalability problems as the number ofthe classes and attributes grows. Attributed classes is a subset of the second order facts that we aimto cover together with other types of facts. Our goal is to model structured facts in a way that isscalable, i.e., to avoid the need of changing the model while adding more facts. Furthermore, weaim that the model can gain structured visual knowledge by being continuously fed with instancesof structured facts, that could be of different types (e.g., <woman, smiling>, or <person, playing,soccer>, < man, tall>, <kid>) with example image(s). At any point the machine does not havea fixed dictionary of trained object, scene, activity categories. From this point, we might refer to astructured fact as simply a fact for simplicity.

Recent works in language and vision involves using unannotated text to improve object recognitionand/or achieve zero-shot learning. In Frome et al. (2013); Norouzi et al. (2014) and Socher et al.(2013), word embedding language models (e.g., Mikolov et al. (2013)) was adopted to representclass names as vectors. Their framework is based on mapping images into the learned languagemodel then perform classification in that space. Other works models the opposite direction by map-ping unstructured text descriptions for classes into a visual classifier Elhoseiny et al. (2013); Ba et al.(2015). In contrast to these works, our goal is to gain visual knowledge for not only first order facts(i.e., objects<S>) but also for second<S,P>, and third order facts<S,P,O>, and also maintain thestructure of the facts. Let’s assume that |S|, |P|, and |O| denotes the number of subjects, predicates,

2

Page 3: M S K I - arXiv · 2015-11-17 · Scandal in Bohemia) Let us imagine a scene with the following facts:

Under review as a conference paper at ICLR 2016

objects respectively. The scale of unique second and third order facts is bounded by |S| × |P|, and|S| × |P| × |O| possibilities respectively that could easily reach hundreds of thousands or millionsof unique facts and needs careful attention while designing a model maintaining the structure weaim at. Some of the earlier captioning systems involves learning an intermediate object, verb, sub-ject representation of images prior to generating captions, e.g. Kulkarni et al. (2013); Yu & Siskind(2013) however, they fall into the aforementioned scalability limitation.

Very recently, Johnson et al. (2015) proposed an interesting work to retrieve images using scenegraphs, which captures similar object attribute and relationship information to our representation.But the Scene Graph work does not provide a method to automatically extract a scene graph foran image, but rather manually obtained a dataset for 5000 scenes. It can only solve for the bestgrounding of a given a manually annotated scene graph to an image, which is key difference to ourwork together with our learning representation. In addition, our work is on a much larger scale andmodels two-way retrieval.

How to collect structured fact annotations needed to train a Sherlock Model?: In order to train amodel for our setting, we needed to collect structured fact annotations in the form of language view,visual view pairs(i.e., <baby, sitting on, chair> as the language a view and an image with this factas a visual view), which is a challenging task. We started by manually annotating and mining severalexisting datasets to extract structured fact annotations, which we found limiting for both coveringdifferent types of facts and number of image as detailed later in Sec. 3. One of the most interestingrelevant works is Never Ending Image Learner (NEIL) Chen et al. (2013), where they showed thatvisual concepts predefined in an ontology can be learnt by collecting its training data from the web.In a follow-up work, Divvala et al. (2014) similarly web-collected images for concepts related toa predefined object using Google-N-gram data. This opens the question of whether we can collectstructured fact annotations from the web. There are two issue that we face for our setting. First,it is hard to define the space of structured visual knowledge and then search for it. Second, usingGoogle image search is not reliable to collect data for concepts with less images in the web. Themain assumption for this method depends on both the likelihood that the top retrieved image belongsto the searched concept, and the availability of image annotated with the searched concept. Theseproblems motivated us to propose a novel method to automatically annotate structured facts byprocessing images-caption data. Our Sherlock Automatic Fact Annotation (SAFA) started by pro-cessing challenging unconstrained noisy captions to extract fact language view by multiple methods.Then, we grounded the facts extracted from captions to image regions as fact visual view. There aretwo advantages of collecting structured fact annotations by the proposed method. First, it assumesthat structured facts in the captions is likely to be located in the given image, which is expected tooccur in image-caption data with a very high probability compared to the assumptions in Chen et al.(2013); Divvala et al. (2014) and makes it likely to produce accurate annotations as we show in ourresults. Second, it could be used to collect knowledge very cheaply for several tasks. Our automaticannotations resulted in hundreds of thousands of images and tens of thousands of unique knowledgeannotations, needed to feed our model, in just several hours with our current implementation whosespeed is far from optimalThe contributions of this paper are as follows:• We introduce the Sherlock problem of structured knowledge modeling in an image and propose

one model that can learn structured facts of different types and perform both-view retrieval (retrieve structured fact language view (i.e. <S:people, P:walking , >) given the visual view (i.e.image) and vice versa).

• We propose an automatic stuctured fact annotation method based on sophisticated Natural Lan-guage Processing methods for acquiring high quality structured fact annotation pairs at large scalefrom free-form image descriptions. We applied the pipeline to MS COCO Lin et al. (2014) andFlickr30K Entities Plummer et al. (2015); Young et al. (2014) image caption datasets. In total, webuild a structured fact dataset of more than 816, 000 language&image-view fact pairs coveringmore than 202, 000 unique facts in the language view.

• We develop a novel learning representation network architecture to model the jointly model struc-tured fact language and visual views by mapping both views into a common space and using awild card loss to uniformly represent first, second, and third order facts.

Our modeling approach is scalable to new facts without any change to the network architecture.

2 PROBLEM DEFINITION

3

Page 4: M S K I - arXiv · 2015-11-17 · Scandal in Bohemia) Let us imagine a scene with the following facts:

Under review as a conference paper at ICLR 2016

Figure 2: Problem Definition

We aim at modeling structured knowledge inimages as a problem that comes with views, onein the visual domain V and one in the languagedomain L. We start by introducing some no-tation. Let f be a structured fact, fv ∈ V de-notes the view of f in the visual domain, andfl ∈ L denotes the view of f in the languagedomain. For instance, an annotated fact, withlanguage view fl =<S:girl, P:riding, O:bike>would have a corresponding visual view fv asan image where this fact occurs; see example inFig. 2.

Our goal is to learn a representation that coversfirst-order facts <S> (objects), second-order facts <S,P> (actions and attributes), and third-orderfacts <S,P,O> (interaction and positional facts). We represent all types of facts as an embeddingproblem into what we call “structured fact space”. We define “structured fact space” as a learningrepresentation of three hyper-dimensions that we denote as φS ∈ RdS , φP ∈ RdP , and φO ∈ RdO(Fig. 2). We denote the embedding functions from a visual view of a fact fv to φS , φP , and φO asφVS (fv), φVP (fv), and φVO(fv), respectively. Similarly, we denote the embedding functions of from alanguage view of a fact fl to φS , φP , and φO as φLS(fl), φLP (fl), and φLO(fl), respectively. We denotethe concatenation of the visual view hyper-dimensions’ embedding as φV(fv), and the languageview hyper-dimensions’ embedding as φL(fl), where φV(fv) and φL(fl) are visual embedding andlanguage embedding of f , respectively:

φV(fv) = [φVS (fv), φVP (fv), φ

VO(fv)], φ

L(fl) = [φLS(fl), φLP (fl), φ

LO(fl)] (1)

It is not hard to see that third-order facts <S,P,O> can be directly embedded to the structured factspace by φV(fv) for the image view and φL(fl) for the language view. A question remains how first-and second-order facts be represented in Eq. 1, so that we have a unified fact learning model thatcovers facts of all orders. In the rest of this section, we present the notion of fact modifiers and howthey are reflected in our proposed learning representation.

2.1 HIGH ORDER FACTS AS MODIFIERS

First-order facts are facts that indicate an object like <S: person>. Second-order facts gets morespecific about the subject (e.g. <S: person, P: playing>). Third-order facts get even more specific(<S: person, P: playing, O: piano>). Inspired by the concept of modifiers in language grammar,we propose to define higher order facts as lower order facts with an additional modifier applied toit. For example, adding the modifier P: eating to the fact <S: kid>, constructs the fact <S: kid, P:eating>. Further applying the modifier O: ice cream to the fact <S: kid, P: eating>, construct thefact <S: kid, P: eating, O: ice cream>. Similarly, attributes could be seen as modifiers to a subject(e.g., applying P: smiling to fact <S: baby> constructs the fact <S: baby, P: smiling>).

2.2 WILD-CARD REPRESENTATION

Based on our “fact modifier” observation, we propose to represent both first- and second-order factsas wild cards, as illustrated in Eq. 2 and 3 for first-order and second-order facts, respectively. Wedenote the wild-card modifier by “∗”.

First-Order Facts wild-card representation <S>

φV(fv) = [φVS (fv), φVP (fv) = ∗, φVO(fv) = ∗], φL(fl) = [φLS(fl), φ

LP (fl) = ∗, φLO(fl) = ∗] (2)

Second-Order Facts wild-card representation <S, P>

φV(fv) = [φVS (fv), φVP (fv), φ

VO(fv) = ∗], φL(fl) = [φLS(fl), φ

LP (fl), φ

LO(fl) = ∗] (3)

Setting φP and φO to ∗ for first-order facts is interpreted to mean that the P and O modifiers are notof interest for first-order facts, which is intuitive. Similarly, setting φO to ∗ for second-order factsindicates that the O modifier is not of interest for single-frame actions and attributes.

4

Page 5: M S K I - arXiv · 2015-11-17 · Scandal in Bohemia) Let us imagine a scene with the following facts:

Under review as a conference paper at ICLR 2016

Lower order facts do not necessarily mean that a higher order fact does not exist in it. For example,<person> fact in an image does not mean that he is not performing an action or has a particularattribute like tall. It rather means that we don’t know. Hence, the wild cards (i.e. ∗) of the lowerorder facts are not penalized during training in our loss, as we illustrate later at the end of Sec. 4.We name both first and second-order facts as wild-card fact.

3 DATA COLLECTION OF STRUCTURED FACTS

In order to train a machine learning model that connects the structured fact language view in Lwith its visual view in V , we need to collect large scale data in the form of (fv , fl) pairs. Datacollection especially for large scale problems has become an increasingly challenging task. It isfurther challenging in our setting since our knowledge model relies on the localized association ofa structured language fact fl with an image fv when such facts occur. In particular, it is a complextask to collect annotations especially for second-order facts < S,P > and third-order facts <S, P,O>. Also, multiple structured language facts could be assigned to the same image (e.g., <S: man,P: smiling>, and < S :man, P: wearing, O: glass>. If these facts refer to the same man, the sameimage example could be used to learn about both facts.

Table 1: Our fact augmentation of six existing datasetsUnique Facts Number of Images

< S > . < S,P >. < S,P,O > . total facts < S > < S,P > < S,P,O > total imagesINTERACT 0 0 60 60 0 0 3171 3171

VisualPhrases 11 4 17 32 3594 372 1745 5711Stanford40 0 11 29 40 0 2886 6646 9532

PPMI 0 0 24 24 0 0 4209 4209SPORT 14 0 6 20 398 0 300 698

Pascal Actions 0 5 5 10 0 2640 2663 5303Union 25 20 141 186 3992 5898 18734 28624

We began our data collection byaugmenting existing datasets withfact language view labels (i.e.,fl):PPMI Yao & Fei-Fei (2010),Stanford40 Yao et al. (2011), Pas-cal Actions Everingham et al.,Sports Gupta (2009), Visual Phrases Sadeghi & Farhadi (2011), INTERACT Antol et al. (2014)datasets. The union of these 6 datasets resulted in 186 facts with 28,624 images as broken out inTable 1.

We also extracted structured facts from the Scene Graph dataset Johnson et al. (2015) which hasmanually annotated 5000 images by MTurkers in a graph structure from which first-, second-,and third-order relationship can be easily extracted. We extracted 110,000 second-order facts and112,000 third-order facts. The majority of these are positional relationships, and covers only 5000scenes.

We further propose a method to automatically collect structured fact annotations from datasets thatcome in the form of image-caption pairs, which can more quickly provide useful facts (Sec. 3.1).Our proposed method opens the door for easy continual fact collection in the future from captiondatasets and even the web or in general any naturally occurring documents with captioned images.

3.1 SHERLOCK AUTOMATIC FACT ANNOTATION (SAFA) FROM IMAGES WITH CAPTIONS

While crowd sourcing can be used to provide additional data, our task of labelling second- andthird-order facts is complex, and acquiring enough data would be expensive.

Instead, we automatically obtain a large quantity of high quality facts from caption datasets usingnatural language processing methods. Since caption writing is free-form and an easy task for crowd-sourcing workers, such free-form descriptions are readily available in existing image caption datasets(e.g. MS COCO Lin et al. (2014)) and more data can be efficiently collected in the future for

Figure 3: Sherlock Automatic Fact Annotation (SAFA)

5

Page 6: M S K I - arXiv · 2015-11-17 · Scandal in Bohemia) Let us imagine a scene with the following facts:

Under review as a conference paper at ICLR 2016

improved learning. We can also mine existing tagged images from sources such as Flickr thattypically have single word, short phrase, or single sentence image descriptions.

In our work, we focused on the MS COCO image caption dataset Lin et al. (2014) and the newlycollected Flickr30K entities Plummer et al. (2015) to both collect and automatically generate newstructured fact annotations. In contrast to having annotations for only 5000 scenes (e.g., in themanually annotated Scene graph dataset), we automatically extracted facts from more than 600,000captions associated with 120,000 MS COCO images, and facts from 150,000 captions associatedwith 30,000 scenes for the Flickr30K dataset; this provided us with 30 times more additional cover-age over the initial structured facts that were associated with the provided scene data.

We propose SAFA as a two step automatic annotation process: (i) fact extraction from captions, (ii)fact localization in images.

First, the captions associated with the given image are analyzed to extract sets of clauses that areconsidered as candidate < S,P >, and < S,P,O > facts in the image. We extract clauses usingtwo state-of-the-art methods: Sedona sed (2015) and Clausie Del Corro & Gemulla (2013). Clausesform facts but are not necessarily facts by themselves.

Captions can provide a tremendous amount of information to image understanding systems. How-ever, developing NLP systems to accurately and completely extract structured knowledge from free-form text is an open research problem. We addressed several challenging linguistic issues by evolv-ing our NLP pipeline to: 1) correct many common spelling and punctuation mistakes, 2) resolveword sense ambiguity within clauses, and 3) learn both a common spatial preposition lexicon (e.g.,”next to”, ”on top of”, ”in front of”) that consists of over 110 such terms, as well as a lexicon ofover two dozen collection phrase adjectives (e.g., ”group of”, ”bunch of”, ”crowd of”, ”herd of”).These strategies allowed us to extract more and more interesting structured knowledge for learningto understand our images.

Second, we try to localize these clauses within the image (see Fig. 3). The subset of clauses that aresuccessfully located in the image are saved as additional fact annotations for training our model. Wecollected 146,515, 157,122, and 76,772 annotations from Flickr30K Entities, MS COCO training,and validation sets, respectively. While we ignored some types of clauses that were likely to produceincorrect annotations, we achieved a total of 380,409 second- and third-order fact annotations. Withfuture improvement and by carefully considering more clauses, SAFA can potentially collect manymore such annotations.

We note that the process of localizing facts in an image is constrained by information in the dataset.For MS COCO, the dataset contains object annotations for about 80 different objects as providedby the training and validation sets. This allowed us to localize first-order facts for objects usingbounding box information. In order to locate higher-order facts in images, we started by definingvisual entities. In case of the MS COCO dataset Lin et al. (2014), we define a visual entity as anynoun that is either (1) one of Microsoft COCO dataset objects, (2) a noun in the WordNet ontologyMiller (1995); Leacock & Chodorow (1998) that is an immediate or indirect hyponym of one of theMS COCO objects. Since, wordNet is searchable by a sense not a word, we performed word sensedisambiguation on the sentences using a State-of-the-art method Zhong & Ng (2010), or (3) one ofthe SUN dataset scenes Xiao et al. (2010). We expect visual entities to be appear either in the Sor the O part (if it exists) of a candidate fact fl. This allows us to then localize facts for images inthe MS COCO dataset. Given a candidate third-order fact, we first try to assign each S and O toone of the visual entities. If S and O elements are not visual entities, then the clause is ignored.Otherwise, the clauses are processed by several heuristics, detailed in the supplementary materials.Our heuristics take into account whether the subject or the object is singular or plural, or a scene.E.g.,instance, for clauses in the fact< S :men, P : chasing, O : soccer ball>, our method takes intoaccount that “men” may require the union of multiple candidate bounding boxes, while for “soccerball”, it is expected that there is a single bounding box.

In Flickr30K Entities dataset Plummer et al. (2015), the bounding box annotations are presented asphrase labels for sentences (for each phrase in a caption that refers to an entity in the scene). Adifference in processing occurs due to the definition of a visual entity in Flickr30K. In this dataseta visual entity is considered to be a phrase with a bounding box annotation or one of the SUNscenes. Several heuristics were developed and applied to collect these fact annotations; see oursupplementary materials for details.

6

Page 7: M S K I - arXiv · 2015-11-17 · Scandal in Bohemia) Let us imagine a scene with the following facts:

Under review as a conference paper at ICLR 2016

Figure 4: Sherlock Models. See Fig. 2 for the full system.

4 SHERLOCK MODELS

Our goal is to propose a two-view structured fact embedding model with four properties: (1) canbe continuously fed with new facts without changing the architecture, (2) is able learn with wildcard to support all types of facts, (3) could generalize to unseen facts, (4) allows two way retrieval(i.e., retrieve relevant facts in language view given an image, and retrieve relevant images given afact in a language view). Satisfying these properties can be achieved by using a generative modelp(fv, fl) that connects the visual and the language views of f , where more importantly fv and fvinhabit in a continuous space. Our method is to model p(fv, fl) ∝ s(φV(fv), φ

L(fl)), where s(·, ·) isa similarity function defined over the structured fact space denoted by S, where S is a discriminativespace of facts. Our objective is that two views of the same fact should be embedded so that they areclose to each other. The question now is how to model and train φV(fv), and φL(fl). We choose tomodel φV(fv) as a CNN encoder (e.g., Krizhevsky et al. (2012); Simonyan & Zisserman (2015)),and φL(fl) as RNN encoder (e.g., Mikolov et al. (2013); Pennington et al. (2014)), due to theirrecent success as encoders for images and words, respectively. We propose two models for learningfacts, denoted by Model 1 and Model 2. Model 1 and 2 shares the same structured fact languageembedding/encoder but differ in the structured fact image encoder.

We start by defining an activation operator ψ(θ, a), where a is an input, and θ is a series of one ormore neural network layers (may include different layer types, e.g., convolution, one pooling, thenanother convolution and pooling). The operator ψ(θ, a) applies θ parameters layer by layer to finallycompute the activation of θ subnetwork given a. We will use the operator ψ(·, ·) to define Model 1and Model 2 structured fact image encoders.

Model 1 (structured fact CNN image encoder): In Model 1, a structured fact is visually encodedby sharing Convolutional layer parameters (denoted by θvc ), and fully connected layer parameters(denoted by θuv ); see Fig. 4(a). Then W v

S , W vP , and W v

O transformation matrices are applied toproduce φVS (fv),φVP (fv) , and φVO(fv):

φVS (fv) = WSv ψ(θuv , ψ(θcv, fv)), φ

VP (fv) = WP

v ψ(θuv , ψ(θcv, fv)),

φVO(fv) = WOv ψ(θuv , ψ(θcv, fv))

(4)

Model 2 (structured fact CNN image encoder): In contrast to Model 1, we use different convolu-tional layers for S that for P and O, inspired by the idea that P and O are modifiers to S (Fig. 4(b)).Starting from fv , there is a common set of convolutional layers, denoted by θc0v , then the networksplits into two branches, producing two sets of convolutional layers θcSv and θcPO

v , followed by twosets of fully connected layers θuS

v and θuPOv . Finally φVS (fv),φVP (fv) , and φVO(fv) are computed by

applying W vS , W v

P , and W vO transformation matrices:

φVS (fv) = WSv ψ(θuS

v , ψ(θcSv , ψ(θvc0 , fv))), φVP (fv) = WP

v ψ(θuPOv , ψ(θcPO

v , ψ(θvc0 , fv))),

φVO(fv) = WOv ψ(θuPO

v , ψ(θcPOv , ψ(θvc0 , fv)))

(5)

Structured fact RNN language encoder: Structured fact language view is encoded using RNNword embedding vectors for S, P and, O. Hence, in our case φLS(fl) = RNNθl(f

Sl ), φLP (fl) =

RNNθl(fPl ), φLO(fl) = RNNθl(f

Ol ), where fSl , fPl , and fOl are the Subject, Predicate, and Object

7

Page 8: M S K I - arXiv · 2015-11-17 · Scandal in Bohemia) Let us imagine a scene with the following facts:

Under review as a conference paper at ICLR 2016

parts of fl ∈ L. For each of them, the literals are dropped and if any of fSl , fPl , or fOl containmultiple words, we compute the average vector as the representation of that part. We denote theRNN language encoder parameters by θL. In our experiments, θl is fixed to a pretrained wordvector embedding model (e.g. Mikolov et al. (2013); Pennington et al. (2014)) for fSl , fPl , and fOl ;see Fig 4(c).

Loss function: One way to model p(fv, fl) for Model 1 and Model 2 is to assume that p(fv, fl) ∝=exp(−lossw(fv, fl)), and minimize lossw(fv, fl) distance loss, which we define as follows

lossw(fv, fl) = wfS · ‖φVS (fv)− φLS(fl)‖2 + wf

P · ‖φVP (fv)− φLP (fl)‖2 + wfO · ‖φVO(fv)− φLO(fl)‖2

(6)

which minimizes the distances between the embedding of the visual view and the language view.Our solution to penalize wild-card facts is to ignore the wild-card modifiers in the loss. Herewf

S = 1,wfP = 1, wf

O = 1 for <S,P,O> facts , wfS = 1, wf

P = 1, wfO = 0 for <S,P> facts, and wf

S = 1,wfP = 0, wf

O = 0 for <S> facts. Hence lossw does not penalize the O modifier for second-orderfacts or the P andO modifiers for first-order facts, which follows our definition a wild-card modifier.

Testing (Two-view retrieval): After a Sherlock Model is trained (either Model 1 or 2), we embedall the test fvs by φV(fv), and all the testing fls with φL(fl). For language view retrieval given animage, we compute the cosine similarity between a given φV(fv) in the test set and all the φL(fl),which indicates relevance for each fl for the given fv . For image retrieval given an fl, we compute thecosine similarity between the given φL(fl) with all φV(fv) in the test set, which indicates relevancefor each fv for the given fl. For wild card facts, the wild-card part is ignored in the similarity.

5 EXPERIMENTS

5.1 SAFA EVALUATION

We start with our evaluation of the automatically collected annotation by the two-step SAFA process,where facts’ language view are first extracted from the caption, then the facts are located in theimage. We propose three questions to evaluate each annotation:

(Q1) Is the extracted fact correct (Yes/No)? The purpose of this question is to evaluate errors cap-tured by the first step, which extract facts by Sedona or Clausie.

(Q2) Is the fact located in the image (Yes/No)? In some cases, there might be a fact mentioned inthe caption that does not exist in the image and mistakenly considered as an annotation.

(Q3) How accurate is the box assigned to a given fact (a to g)? a (about right), b (a bit big), c (a bitsmall), d (too small), e (too big), f (totally wrong box), g (fact does not exist or other). Instructionsof these questions as we illustrated to the participants could be found in this url1

We evaluate these three question for the facts that were successfully assigned a box (i.e. grounded)in the image, since the main purpose of this evaluation is to measure the usability of the collectedannotations as training data for our model. We created an Amazon Mechanical Turk form to askthese three questions to MTurk workers. So far, we collected a total of 8,535 evaluation responses,which are an evaluation of 2845 (fv, fl) pairs, each pair is evaluated by three workers each (3654,2242, and 2639 responses for COCO train, COCO validation, and Flickr30K Entities respectively).

Table 2 shows the evaluation results collected by MTurk workers so far. The results indicate thatthe collected data is useful for training, since 79.0% of the collected data are correct facts withbounding boxes that are either about right, a bit small or a bit big.

Table 2: SAFA Evaluation by MTurk workersQ1 Q2 Q3

Yes No Yes No a b c d e f gMSCOCO validation 88.82 11.18 87.46 12.54 65.28 11.63 3.18 5.32 0.71 1.76 12.12

MSCOCO train 91.49 8.51 90.47 9.53 66.83 8.17 2.06 3.54 0.49 0.36 5.76Flickr30K Entities 88.29 11.71 87.40 12.60 69.58 8.17 2.03 2.08 0.55 0.36 8.80

1https://dl.dropboxusercontent.com/u/479679457/Sherlock_SAFA_eval_Instructions.html

8

Page 9: M S K I - arXiv · 2015-11-17 · Scandal in Bohemia) Let us imagine a scene with the following facts:

Under review as a conference paper at ICLR 2016

5.2 KNOWLEDGE MODELING EXPERIMENTS

Since our Sherlock models is continuous on both language and visual views of facts, we can per-form two way retrieval from the visual view given the language view and vice versa. We start bypresenting evaluation metrics used in our small, medium scale, and large scale experiments for bothview retrieval

Metrics for visual view retrieval (retrieving fv given fl ): To retrieve an image ( visual view) givena language view like (e.g. <S: person, P: riding, O: horse>), we measure the performance by mAP(Mean Average Precision) and ROC AUC performance on the test set of each designated dataset inthis section. An image fv is considered positive only if there is a pair (fl, fv in the annotations. Evenif the retrieved image is relevant but such pair does not exist, it is considered not correct. We alsouse mAP10, mAP100 variants of the mAP metric that computes the mAP the evaluation based ononly the top 10 or 100 retrieved images, which is useful for evaluating large scale experiments.

Metrics for language view retrieval (retrieving fl given fv ): To retrieve fact language view givenan image. we use top 1, top 5, top 10 accuracy for evaluation. We also used MRR (mean reciprocalranking) metric which is basically 1/r where r is the rank of the correct class. An important issuewith our setting is that there might be multiple facts in the same images. Given that there are Lcorrect facts in the given image to achieve top 1 performance these L facts must all be in the topL retrieved facts. Accordingly, top K retrieved facts means the L facts are in the top L + K − 1retrieved facts. Similar to visual-view retrieval, fact language view fl is considered correct only ifthere is a pair (fl, fv in the annotations.

Evaluation metrics for the Sherlock Problem and especially for scale of several tens of thousandsunique facts and near a million image, which is the scale of our largest experiment. It is not hard tosee that the aforementioned metrics are very harsh especially in the large scale setting. For instance,if the correct fact <S:man,P: jumping> in an image, and our model return <S:person, P:jumping>,these metrics gives a Sherlock model zero credit for this result. Also, the evaluation is limitedto the ground truth fact annotations. There might be several facts in an image but the annotationis only provided one and miss the several others(e.g., an image with <S:man,P: walking>, and<S:man,P: wearing, O: hat>). While these these metrics were used for our quantitative evaluation,we qualitatively found it is harsh for our large scale experiment and we think defining metrics forthe Sherlock problem setting is an interesting area to explore.

Structured fact language Encoder: In all the following experiments, θl is defined as byGloVE840B RNN model Pennington et al. (2014), which is used for encoding structured fact inthe language view for both Model 1 and Model 2 (i.e., φLS(fl), φLP (fl), and φLO(fl)). We used Caffeframework Jia et al. (2014) to implement our models.

Experiments Overview: We performed small& medium experiments in Sec 5.2.1 an 5.2.2, andlarge scale experiments in Sec 5.2.3 to evaluate our work. In the small and the medium scale exper-iments, each fact language view fl has corresponding tens of visual views fv (i.e., images) where asubset is used for training and the other set is used for testing. So, each image we test on belong toa fact that was seen by other images in the training set. The purpose of these experiments is mainlyto contrast against some existing methods with fixed dictionary of facts, and also to compare to witha recent version of the well-known CCA multiview method Gong et al. (2014) that could be alsoapplied in our setting. In the large scale experiment, the collected data form is more than 816,000(fv, fl) pairs, covering more than 202,000 unique facts in the language view fl. The training testingsplit is performed by randomly splitting all the pairs into 80% training pairs and 20% testing pairs.This results in 168,691 testing (fv, fl) pairs with 58,417 unique fl and at test time, where 31,677 outof them are unseen during training. We contrast Model 1 and 2 on small,medium and large scaleexperiments and show that Model 2 is better, as detailed later.

5.2.1 EXPERIMENTS ON PASCAL ACTIONS EVERINGHAM ET AL.

The purpose of this experiment is to show that generative models like our proposed model candiscriminate between classes without having a discrete label assigned. Also, we show howModel 1 compares to a recent version of CCA as a multiview method Gong et al. (2014). Inparticular, we applied four experiments on Pascal Actions dataset using CCA with two differ-ent features, CNN classification by fine-tuning AlexNet Krizhevsky et al. (2012) on Pascal Ac-

9

Page 10: M S K I - arXiv · 2015-11-17 · Scandal in Bohemia) Let us imagine a scene with the following facts:

Under review as a conference paper at ICLR 2016

tions, and our proposed Model 1. To explore the behavior of CCA on different visual view fea-tures for fv , we performed two CCA experiments using pool5 layer activation features fromAlexNet Krizhevsky et al. (2012) for the first experiment and fc6 activation features for the sec-ond experiment from the same AlexNet. As language view features for fl, we applied ΦL(fl) usingGloVE as language view features. Since CCA Gong et al. (2014), does not support wildcards,we fill the wild-card parts of ΦL(fl) with zeros. For Model 1 in this experiment, we constructed

Table 3: Pascal Actions ExperimentsLanguage View retrieval% Visual View Retrieval

Top 1 Acc Top 5 Acc MRR AP% AP10% AUCCCA- pool5 Gong et al. (2014) 31.34 68.4 48.84 11 24.5 0.53

CCA-fc6 Gong et al. (2014) 20.99 70.2 41.12 13.0 24.1% 0.5619Model 1 (AlexNet as image encoder) 66.52 93.38 78.26 69.47 94.2 0.9077

Top 1 Acc Top 2 Acc Top 3 AccCNN Classification ( Krizhevsky et al. (2012)) 68.7 83.26 89.28 - - -

Chance 10 - -

in θcv as the first five convolutionallayers in AlexNet Krizhevsky et al.(2012) and θuv as the two follow-ing fully connected layers fc6 andfc7. θcv and θvu are initialized withan AlexNet model pretrained on theImageNet dataset Deng et al. (2009).WSv , WP

v , and WOv are initialized randomly.

Table 3 shows the performance of CCA-pool5, CCA-fc6, and Model 1 for retrieval from both thelanguage and the visual views. It is not hard to see that CCA-pool5 outperforms CCA-fc6 forlanguage view retrieval by a margin and CCA-fc6 is slightly worse than CCA-pool5 for visual-view retrieval. Both are clearly better than Chance perfomance for the 10 classes on Pascal12 whichis 10%. We think this behavior is due that pool5 contains more spatial information comparedto fc6 where spatial information is collapsed, and spatial information is important in recognizingactivities. Comparing Model 1 to CCA results, Model 1 results significantly outperforms both CCApool5 and fc6. Our intuition for this result is that Model 1 learns spatial convolution filters whichis not available in CCA, and adapt it to discriminate between actions. Finally, Table 3 also showsthat CNN classification(e.g., Krizhevsky et al. (2012)) performs only slightly better than Model 1.However, CNN classification is not applicable for the setting where facts are unseen, and does notsupport two-view retrieval. We also think that using a more discriminative loss (e.g., ranking loss)compared to the Euclidean loss, used in our experiments, would fill this small gap. Since there are≈100 billion pairs in our large scale experiments (our goal), minimizing the ranking loss is not trivialfor our setting. However, it is an interesting future work to further improve discriminative power ofSherlock Models.

5.2.2 SMALL AND MID-SCALE EXPERIMENTS

We performed several experiments to compare between Model 1 and Model 2 on severaldatasets, which are Stanford40 Yao et al. (2011), Pascal Actions Yao & Fei-Fei (2010), VisualPhrases Sadeghi & Farhadi (2011), and the union of six datasets described earlier in Table 1 inSec. 3. We used the training and test splits defined on the annotations that came of those datasets.For the union of six datasets, we unioned the training and testing annotations to get the final split.

Model 1 and Model 2 setup: For Model 2, θc0v is constructed by the convolutional layers andpooling layer in VGG-16 named conv_1_1 until pool3 layer, which has seven convolution layers.θcSv and θcPO

v are two branches of convolution and pooling layers that have the same architectures asVGG-16 layers named conv_4_1 until pool5 layer which makes six convolution-pooling layersin each branch. Finally, θuS

v and θuPOv are constructed as two instances of fc6 and fc7 layers in

VGG-16 network. While, the two branches of layers share same construction but they are optimizedover different losses as detailed in Sec 4 and will be different when the model gets trained. In contrastto the experiment in the previous section, Model 1 is constructed here from VGG-16 by constructingθcv as the layer conv_1_1 to pool5, and θuv as the two following fully connected layers fc6 andfc7 in VGG-16. WS

v , WPv , and WO

v in both Model 1 and 2 are initialized randomly, and the restare initialized from VGG-16 trained on ImageNet dataset.

Table 4: Small and Medium Scale ExperimentsLanguage View retrieval Visual View retrieval

Top1 Acc Top 5 Acc MRR AP AP10 AP100 AUCStandord40 (40 facts) Model 1 66.29 88.77 76.17 74.45 99.47 92.64 0.97

Model 2 68.80 89.41 77.87 72.81 98.18 91.93 0.97Chance 2.5 - - - - - -

VisualPhrases (31 facts) Model 1 24.90 43.70 35.70 29.55 55.50 36.14 0.92Model 2 25.30 43.80 36.48 30.27 50.43 36.77 0.93Chance 3.2 - - - - - -

Pascal Actions (10 facts) Model 1 64.75 93.49 76.81 80.12 100.00 96.94 0.94Model 2 64.97 93.79 77.04 80.68 100.00 97.21 0.95Chance 10 - - - - - -

6DS (186 facts) Model 1 47.07 64.4 54.18 28.2 51.32 43.13 0.890Model 2 48.5 64.59 54.32 32.40 54.44 45.07 0.94

VGG-16 CNN 51.5 - - - - - -Chance 0.54 - - - - - -

Table 4 show the performance of bothModel 1 and Model 2 on these fourdatasets for both-view retrieval tasks.We may notice that Model 2 worksrelatively better as the dataset size in-creases. The performances of Model1 and 2 are very similar in smalldatasets like Pascal Actions. In the

10

Page 11: M S K I - arXiv · 2015-11-17 · Scandal in Bohemia) Let us imagine a scene with the following facts:

Under review as a conference paper at ICLR 2016

6DS experiment, we also performed the CNN classification but using VGG-Net for this experiment,which leads to the same conclusion we discussed in the previous experiment in Sec. 5.2.1.

Why Model 2 works better than Model 1? Our intuition behind this result is that Model 2 learnsa different set of convolutional filters θcPO

v to understand the PO branch as visual modifiers. Thismakes a separate bank of filters θcPO

v to learn action/attributes and interaction related concepts, whchis different from the filter bank learnt to discriminate between different subjects for the S branch θcSv .In contrast, Model 1 is trained by optimizing the same bank of filters θcv for SPO altogether, whichmight contradicting to optimize for both S and PO together; see Fig 4.

5.2.3 LARGE SCALE EXPERIMENT

In this experiment, we used all the data described in Sec. 3 including the automatically annotateddata. This data consists mainly of second- and third-order facts. We further augmented this datawith 2000 images for each MSCOCO object (80 classes) as first-order facts. We also used objectannotations in sceneGraph dataset as first-order fact annotations with a maximum of 2000 images perobject. We ignored the facts with spelling mistakes. Finally, we randomly split all the annotation into

Table 5: Coverage of facts in the Large Scale Set-ting

S SP SPO TotalTraining All facts 6116 57681 107472 171269Testing All facts 2733 22237 33447 58417

Training/Test Intersection 1923 13043 11774 26740Test unseen facts 810 9194 21673 31677

Total facts 6,926 66,875 129,145 202,946

80%-20% split, constructing sets of 647,746(fv, fl) training pairs (with 171,269 unique factlanguage views fl) and 168,691 (fv, fl) test-ing pairs (with 58,417 unique fl), for a total of(fv, fl) 816,436 pairs, 202,946 unique fl. Ta-ble 5 shows the coverage of different types offacts in the training and the test split and the in-tersection between them that there is a total of31,677 unique unseen facts out of the 58,417 testing facts in the language view. The majority of thefacts have only one example; see Fig 5 and 6. Model 1 and Model 2 setup is the exactly the sameas defined in Sec. 5.2.2.

500 1000 1500unique facts

0

200

400

600

800

imag

es

1923 <S> facts (total 43477 images)

2000 4000 6000 8000 10000 12000unique facts

0

200

400

600

800

13043 <S,P> facts (total 61734 images)

2000 4000 6000 8000 10000unique facts

0

50

100

150

200

250

11774 <S,P,O> facts (total 30829 images)

Figure 5: 26,740 unique testing facts that have at least one training example (majority of trainingfacts have one of few examples), total of 136,040 images

200 400 600 800unique facts

0

0.5

1

1.5

2

2.5

3

imag

es

810 <S> facts (total 851 images)

2000 4000 6000 8000unique facts

0

1

2

3

49194 <S,P> facts (total 9596 images)

0.5 1 1.5 2unique facts ×10 4

0

1

2

3

421673 <S,P,O> facts (total 22204 images)

Figure 6: 31,677 unique Unseen testing facts (majority of training facts have one of few examples),total of 32,651 images

Since we perform retrieval in both directions, we computed all pairs of similarity between facts andimages. In this setting, it is computationally expensive to compute the similarity between all pairs.We used KD-Trees databases for Approximate Nearest Neighbor (ANN) search. We used FLANNlibrary to create the ANN databases Muja & Lowe (2009), and we restrict to compute the 100 nearestneighbors for fl given fv , and vice versa.

Table 6 show the performance of Model 1 and Model 2. The results indicate that Model 2 is betterthan Model 1 for retrieval from both-view, which is consistent with our medium scale results andour intuition. Model 2 is also multiple orders of magnitude better than chance. Figure 7 and Figure 8show two qualitative examples for both language view and image view retrieval; red facts in lan-guage view means it is not seen in the training data. Figure 7 (right) show <S:airplane, P: flying>example, where the retrieval results works well but the Average Precision performance metric giveus zero score since none of these examples and annotated as <S: airplane, P:flying>, which indi-cates the need to design better metrics for Sherlock problem. One severe problem in the visual view

11

Page 12: M S K I - arXiv · 2015-11-17 · Scandal in Bohemia) Let us imagine a scene with the following facts:

Under review as a conference paper at ICLR 2016

Figure 7: Examples for Language View retrieval and Visual View Retrieval ( red color facts meansthat fact is not seen during the training)

Figure 8: Examples for Language View retrieval (left) and Visual View Retrieval (right)

retrieval metrics is that the majority of the testing fact language views have only one positive an-notation. Figure 7 (left) shows that <S:dog,P:riding, O:wave> has the closest distance to the givenimage. However, <S:dog,P:riding, O:wave> is never seen in the training data. Figure 8 (right)shows several examples that shows hoe the trained model understand the difference between <man,eats, slice >, and <girl, eating, slice > (i.e., gender). It also understands what “group” means in the< S:group, P: covered, O: mountain> example. Our model is also able to retrieve facts like <girl,using, racket >, which is not seen in the training data.

Table 6: Large Scale Experiment Model 1 and 2

Language View retrieval % Visual view Retrieval %Top1 Acc Top 5 Acc Top 10 Acc MRR mAP100 mAP10

Model 1 13.27 14.19 14.80 7.88 0.61 0.62Model 2 15.41 16.45 17.1 9.60 0.77 0.77Chance 0.0017 - - - - -

6 CONCLUSION

We introduce the Sherlock problem, the problem of associating high-order visual and languagefacts. We present a novel neural network approach for mapping visual facts and language factsinto a common, continuous structured fact space that allows us to associate natural language factswith images and images with natural language structured descriptions. In future work, we plan toimprove upon this model, and well as explore its applications toward high-precision image taggingand search, caption generation, and image knowledge abstraction.

12

Page 13: M S K I - arXiv · 2015-11-17 · Scandal in Bohemia) Let us imagine a scene with the following facts:

Under review as a conference paper at ICLR 2016

REFERENCES

Sedona. In Sedona Clause Extraction From Sentences. Supplementary Materials, 2015.

Antol, Stanislaw, Zitnick, C Lawrence, and Parikh, Devi. Zero-shot learning via visual abstraction.In ECCV. 2014.

Ba, Jimmy, Swersky, Kevin, Fidler, Sanja, and Salakhutdinov, Ruslan. Predicting deep zero-shotconvolutional neural networks using textual descriptions. In ICCV, 2015.

Chen, Chao-Yeh and Grauman, Kristen. Inferring analogous attributes. In CVPR, 2014.

Chen, Xinlei, Shrivastava, Ashish, and Gupta, Arpan. Neil: Extracting visual knowledge from webdata. In ICCV, 2013.

Cho, Kyunghyun, Van Merrienboer, Bart, Gulcehre, Caglar, Bahdanau, Dzmitry, Bougares, Fethi,Schwenk, Holger, and Bengio, Yoshua. Learning phrase representations using rnn encoder-decoder for statistical machine translation. In EMNLP, 2014.

Del Corro, Luciano and Gemulla, Rainer. Clausie: clause-based open information extraction. InWWW, 2013.

Deng, Jia, Dong, Wei, Socher, Richard, Li, Li-Jia, Li, Kai, and Fei-Fei, Li. Imagenet: A large-scalehierarchical image database. In CVPR. IEEE, 2009.

Divvala, Santosh K, Farhadi, Alireza, and Guestrin, Carlos. Learning everything about anything:Webly-supervised visual concept learning. In CVPR, 2014.

Elhoseiny, Mohamed, Saleh, Burhan, and Elgammal, Ahmed. Write a classifier: Zero-shot learningusing purely textual descriptions. In ICCV, 2013.

Everingham, M., Van Gool, L., Williams, C. K. I., Winn, J., and Zisserman, A. ThePASCAL Visual Object Classes Challenge 2012 (VOC2012) Results. http://www.pascal-network.org/challenges/VOC/voc2012/workshop/index.html.

Frome, Andrea, Corrado, Greg S, Shlens, Jon, Bengio, Samy, Dean, Jeff, Mikolov, Tomas, et al.Devise: A deep visual-semantic embedding model. In NIPS, 2013.

Gkioxari, Georgia and Malik, Jitendra. Finding action tubes. In CVPR, 2015.

Gong, Yunchao, Ke, Qifa, Isard, Michael, and Lazebnik, Svetlana. A multi-view embedding spacefor modeling internet images, tags, and their semantics. International journal of computer vision,106(2):210–233, 2014.

Gupta, Abhinav. Sports Dataset. http://www.cs.cmu.edu/˜abhinavg/Downloads.html, 2009. [Online; accessed 15-July-2015].

Jia, Yangqing, Shelhamer, Evan, Donahue, Jeff, Karayev, Sergey, Long, Jonathan, Girshick, Ross,Guadarrama, Sergio, and Darrell, Trevor. Caffe: Convolutional architecture for fast feature em-bedding. In ACM Multimedia, 2014.

Johnson, Justin, Krishna, Ranjay, Stark, Michael, Li, Li-Jia, Shamma, David, Bernstein, Michael,and Fei-Fei, Li. Image retrieval using scene graphs. In CVPR, 2015.

Karpathy, Andrej and Fei-Fei, Li. Deep visual-semantic alignments for generating image descrip-tions. In CVPR. 2015.

Krizhevsky, Alex, Sutskever, Ilya, and Hinton, Geoffrey E. Imagenet classification with deep con-volutional neural networks. In NIPS, 2012.

Kulkarni, Gaurav, Premraj, Visruth, Ordonez, Vicente, Dhar, Sudipta, Li, Siming, Choi, Yejin, Berg,Alexander C, and Berg, Tamara. Babytalk: Understanding and generating simple image descrip-tions. TPAMI, 35(12):2891–2903, 2013.

Lampert, Christoph H, Nickisch, Hannes, and Harmeling, Stefan. Learning to detect unseen objectclasses by between-class attribute transfer. In CVPR, 2009.

13

Page 14: M S K I - arXiv · 2015-11-17 · Scandal in Bohemia) Let us imagine a scene with the following facts:

Under review as a conference paper at ICLR 2016

Leacock, Claudia and Chodorow, Martin. Combining local context and wordnet similarity for wordsense identification. WordNet: An electronic lexical database, 1998.

Lin, Tsung-Yi, Maire, Michael, Belongie, Serge, Hays, James, Perona, Pietro, Ramanan, Deva,Dollar, Piotr, and Zitnick, C Lawrence. Microsoft coco: Common objects in context. In ECCV.Springer, 2014.

Mao, Junhua, Xu, Wei, Yang, Yi, Wang, Jiang, and Yuille, Alan. Deep captioning with multimodalrecurrent neural networks (m-rnn). ICLR, 2015.

Mikolov, Tomas, Sutskever, Ilya, Chen, Kai, Corrado, Greg S, and Dean, Jeff. Distributed represen-tations of words and phrases and their compositionality. In NIPS, 2013.

Miller, George A. Wordnet: a lexical database for english. Communications of the ACM, 1995.

Muja, Marius and Lowe, David. Flann-fast library for approximate nearest neighbors user manual.Computer Science Department, University of British Columbia, Vancouver, BC, Canada, 2009.

Norouzi, Mohammad, Mikolov, Tomas, Bengio, Samy, Singer, Yoram, Shlens, Jonathon, Frome,Andrea, Corrado, Greg S, and Dean, Jeffrey. Zero-shot learning by convex combination of se-mantic embeddings. In ICLR, 2014.

Pennington, Jeffrey, Socher, Richard, and Manning, Christopher D. Glove: Global vectors for wordrepresentation. EMNLP, 2014.

Plummer, Bryan, Wang, Liwei, Cervantes, Chris, Caicedo, Juan, Hockenmaier, Julia, and Lazebnik,Svetlana. Flickr30k entities: Collecting region-to-phrase correspondences for richer image-to-sentence models. In ICCV, 2015.

Sadeghi, Mohammad Amin and Farhadi, Ali. Recognition using visual phrases. In CVPR, 2011.

Simonyan, Karen and Zisserman, Andrew. Very deep convolutional networks for large-scale imagerecognition. In ICLR, 2015.

Socher, Richard, Ganjoo, Milind, Sridhar, Hamsa, Bastani, Osbert, Manning, Christopher D., andNg, Andrew Y. Zero shot learning through cross-modal transfer. In NIPS, 2013.

Vinyals, Oriol, Toshev, Alexander, Bengio, Samy, and Erhan, Dumitru. Show and tell: A neuralimage caption generator. 2015.

Xiao, Jianxiong, Hays, James, Ehinger, Krista, Oliva, Aude, Torralba, Antonio, et al. Sun database:Large-scale scene recognition from abbey to zoo. In CVPR, 2010.

Xu, Kelvin, Ba, Jimmy, Kiros, Ryan, Courville, Aaron, Salakhutdinov, Ruslan, Zemel, Richard, andBengio, Yoshua. Show, attend and tell: Neural image caption generation with visual attention. InICML. 2015.

Yao, Bangpeng and Fei-Fei, Li. Grouplet: A structured image representation for recognizing humanand object interactions. In CVPR, 2010.

Yao, Bangpeng, Jiang, Xiaoye, Khosla, Aditya, Lin, Andy Lai, Guibas, Leonidas, and Fei-Fei, Li.Human action recognition by learning bases of action attributes and parts. In ICCV, 2011.

Young, Peter, Lai, Alice, Hodosh, Micah, and Hockenmaier, Julia. From image descriptions tovisual denotations: New similarity metrics for semantic inference over event descriptions. TACL,pp. 67–78, 2014.

Yu, Haonan and Siskind, Jeffrey Mark. Grounded language learning from video described withsentences. In ACL, 2013.

Zhong, Zhi and Ng, Hwee Tou. It makes sense: A wide-coverage word sense disambiguation systemfor free text. In ACL, 2010.

Zhou, Bolei, Lapedriza, Agata, Xiao, Jianxiong, Torralba, Antonio, and Oliva, Aude. Learning deepfeatures for scene recognition using places database. In NIPS, 2014.

14