Multi modal retrieval and generation with deep distributed models

@graphificRoelof Pieters

Mul--modalRetrievalandGenera-onwithDeepDistributedModels

26April2016 KTH

www.csc.kth.se/~roelof/ [email protected]

https://twitter.com/graphific

http://@graphific

http://www.csc.kth.se/~roelof/


mailto:[email protected]

Creative AI > a “brush” > rapid experimentation

human-machine collaboration

Multi-modal retrieval

3

Modalities

4

[Karlgren 2014, NLP Sthlm Meetup]5

Digital Media Deluge: text

http://www.apple.com

[ http://lexicon.gavagai.se/lookup/en/lol ]6

Digital Media Deluge: text

lol ?

…

http://lexicon.gavagai.se/lookup/en/lol

[Youtube Blog, 2010]7

Digital Media Deluge: video

https://youtube.googleblog.com/2010/11/great-scott-over-35-hours-of-video.html

[Reelseo, 2015]8

Digital Media Deluge: video

http://www.reelseo.com/hours-minute-uploaded-youtube/

[Reelseo, 2015]9

Digital Media Deluge: audio


[Reelseo, 2015]10

Digital Media Deluge: audio


Challenges

11

• Volume

• Velocity

• Variety

Can we make it searchable?

12

Language

Language: Compositionality

Principle of compositionality:

the “meaning (vector) of a complex expression (sentence) is determined by:

— Gottlob Frege (1848 - 1925)

- the meanings of its constituent expressions (words) and

- the rules (grammar) used to combine them”

13

• NLP treats words mainly (rule-based/statistical approaches at least) as atomic symbols:

• or in vector space:

• also known as “one hot” representation.

• Its problem ?

Word Representation

Love Candy Store

[0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 …]

Candy [0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 …] ANDStore [0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 …] = 0 !

14

Word Representation

15

Distributional semanticsDistributional meaning as co-occurrence vector:

16

Deep Distributional representations

• Taking it further:

• Continuous word embeddings

• Combine vector space semantics with the prediction of probabilistic models

• Words are represented as a dense vector:

Candy =

17

• Can theoretically (given enough units) approximate “any” function

• and fit to “any” kind of data

• Efficient for NLP: hidden layers can be used as word lookup tables

• Dense distributed word vectors + efficient NN training algorithms:

• Can scale to billions of words !

Neural Networks for NLP

18

Word Embeddings: SocherVector Space Model

adapted rom Bengio, “Representation Learning and Deep Learning”, July, 2012, UCLA

In a perfect world:

20


adapted rom Bengio, “Representation Learning and Deep Learning”, July, 2012, UCLA

In a perfect world:

the country of my birththe place where I was born

21


Figure (edited) from Bengio, “Representation Learning and Deep Learning”, July, 2012, UCLA

In a perfect world:

the country of my birththe place where I was born ?

…

22

Word Embeddings: Turian (2010)

Turian, J., Ratinov, L., Bengio, Y. (2010). Word representations: A simple and general method for semi-supervised learning

code & info: http://metaoptimize.com/projects/wordreprs/ 23

http://metaoptimize.com/projects/wordreprs/

Word Embeddings: Turian (2010)

Turian, J., Ratinov, L., Bengio, Y. (2010). Word representations: A simple and general method for semi-supervised learning

code & info: http://metaoptimize.com/projects/wordreprs/ 24

http://metaoptimize.com/projects/wordreprs/

Word Embeddings: Collobert & Weston (2011)

Collobert, R., Weston, J., Bottou, L., Karlen, M., Kavukcuoglu, K., Kuksa, P. (2011) . Natural Language Processing (almost) from Scratch

25

http://arxiv.org/pdf/1103.0398v1.pdf

Multi-embeddings: Stanford (2012)

Eric H. Huang, Richard Socher, Christopher D. Manning, Andrew Y. Ng (2012)Improving Word Representations via Global Context and Multiple Word Prototypes

26

http://www.socher.org/index.php/Main/ImprovingWordRepresentationsViaGlobalContextAndMultipleWordPrototypes

Linguistic Regularities: Mikolov (2013)

code & info: https://code.google.com/p/word2vec/ Mikolov, T., Yih, W., & Zweig, G. (2013). Linguistic Regularities in Continuous Space Word Representations

27

https://code.google.com/p/word2vec/

http://research.microsoft.com/pubs/189726/rvecs.pdf

Word Embeddings for MT: Mikolov (2013)

Mikolov, T., Le, V. L., Sutskever, I. (2013) . Exploiting Similarities among Languages for Machine Translation

28

http://arxiv.org/pdf/1309.4168.pdf

Word Embeddings for MT: Kiros (2014)

29

Recursive Embeddings for Sentiment: Socher (2013)

Socher, R., Perelygin, A., Wu, J., Chuang, J.,Manning, C., Ng, A., Potts, C. (2013) Recursive Deep Models for Semantic Compositionality Over a Sentiment Treebank.

code & demo: http://nlp.stanford.edu/sentiment/index.html30

http://nlp.stanford.edu/~socherr/EMNLP2013_RNTN.pdf

http://nlp.stanford.edu/sentiment/index.html

Paragraph Vectors: Dai et al. (2014)

31

Paragraph Vectors: Dai et al. (2014)

32

Can we make it searchable?

33

Other modalities

• Image -> vector -> embedding ? ?

• Video -> vector -> embedding ? ?

• Audio -> vector -> embedding ? ?

34

Other modalities: Embeddings?

•A host of statistical machine learning techniques

•Enables the automatic learning of feature hierarchies

•Generally based on artificial neural networks

Deep Learning?

• Manually designed features are often over-specified, incomplete and take a long time to design and validate

• Learned Features are easy to adapt, fast to learn

• Deep learning provides a very flexible, (almost?) universal, learnable framework for representing world, visual and linguistic information.

• Deep learning can learn unsupervised (from raw text/audio/images/whatever content) and supervised (with specific labels like positive/negative)

(as summarised by Richard Socher 2014)

Deep Learning?

37

2006+ : The Deep Learning Conspirators

• Image -> vector -> embedding

• Video -> vector -> embedding ? ?


39

Image Embeddings

40

Convolutional Neural Nets for Images

classification demo

http://ml4a.github.io/dev/demos/mnist_confusion.html

41


http://ml4a.github.io/dev/demos/demo_convolution.html

42


Zeiler and Fergus 2013, Visualizing and Understanding Convolutional Networks

43


44


45

Deep Nets

46

Deep Nets

47

Convolutional Neural Nets: Embeddings?

[-0.34, 0.28, …]4096-dimensional fc7 AlexNet CNN

48(Karpathy)

http://cs.stanford.edu/people/karpathy/cnnembed/

49

Convolutional Neural Nets: Embeddings?

http://ml4a.github.io/dev/demos/tsne-viewer.html

http://ml4a.github.io/dev/demos/tsne-viewer.html

• Image -> vector -> embedding ??

• Video -> vector -> embedding


50

Video Embeddings

51

Convolutional Neural Nets for Video

3D Convolutional Neural Networks for Human Action Recognition, Ji et al., 2010

52


Sequential Deep Learning for Human Action Recognition, Baccouche et al., 2011

53


Large-scale Video Classification with Convolutional Neural Networks, Karpathy et al., 2014

54


Large-scale Video Classification with Convolutional Neural Networks, Karpathy et al., 2014

55


[Large-scale Video Classification with Convolutional Neural Networks, Karpathy et

al., 2014[Le et al. '11]

vs classic 2d convnet:

56


[Large-scale Video Classification with Convolutional Neural Networks, Karpathy et al., 2014

57


Sequential Deep Learning for Human Action Recognition, Baccouche et al., 2011

58


Long-term Recurrent Convolutional Networks for Visual Recognition and Description, Donahue et al., 2015

59


Beyond Short Snippets: Deep Networks for Video Classification, Ng et al., 2015]

60


Delving Deeper into Convolutional Networks for Learning Video Representations, Ballas et al., 2016

• Image -> vector -> embedding ??

• Video -> vector -> embedding ??

• Audio -> vector -> embedding

61

Audio Embeddings

62

Zero-shot Learning

[Sander Dieleman, 2014]

http://benanne.github.io/2014/08/05/spotify-cnns.html

63

Audio Embeddings

[Sander Dieleman, 2014]

http://benanne.github.io/2014/08/05/spotify-cnns.html

demo

http://everynoise.com/engenremap.html

• Can we take this further?

65

Multi Modal Embeddings?

• unsupervised pre-training (on many images)

• in parallel train a neural network (Language) Model

• train linear mapping between (image) representations and (word) embeddings, representing the different “classes”

66

Zero-shot Learning

DeViSE model (Frome et al. 2013)

• skip-gram text model on wikipedia corpus of 5.7 million documents (5.4 billion words) - approach from (Mikolov et al. ICLR 2013)

67

Frome, A., Corrado, G.S., Shlens, J., Bengio, S., Dean, J., Mikolov, T., Ranzato, M.A. (2013) Devise: A deep visual-semantic embedding model

http://arxiv.org/abs/1301.3781


Encoder: A deep convolutional network (CNN) and long short-term memory recurrent network (LSTM) for learning a joint image-sentence embedding. Decoder: A new neural language model that combines structure and content vectors for generating words one at a time in sequence.

Encoder-Decoder pipeline (Kiros et al 2014)

68

Kiros, R., Salakhutdinov, R., Zemerl, R. S. (2014) Unifying Visual-Semantic Embeddings with Multimodal Neural Language Models


Kiros, R., Salakhutdinov, R., Zemerl, R. S. (2014) Unifying Visual-Semantic Embeddings with Multimodal Neural Language Models

• matches state-of-the-art performance on Flickr8K and Flickr30K without using object detections

• new best results when using the 19-layer Oxford convolutional network.

• linear encoders: learned embedding space captures multimodal regularities (e.g. *image of a blue car* - "blue" + "red" is near images of red cars)

Encoder-Decoder pipeline (Kiros et al 2014)

69


Image-Text Embeddings

70

Socher et al (2013) Zero Shot Learning Through Cross-Modal Transfer (info)

http://www.socher.org/index.php/Main/Zero-ShotLearningThroughCross-ModalTransfer

Image-Captioning

• Andrej Karpathy Li Fei-Fei , 2015. Deep Visual-Semantic Alignments for Generating Image Descriptions (pdf) (info) (code)

• Oriol Vinyals, Alexander Toshev, Samy Bengio, Dumitru Erhan , 2015. Show and Tell: A Neural Image Caption Generator (arxiv)

• Kelvin Xu, Jimmy Ba, Ryan Kiros, Kyunghyun Cho, Aaron Courville, Ruslan Salakhutdinov, Richard Zemel, Yoshua Bengio, Show, Attend and Tell: Neural Image Caption Generation with Visual Attention (arxiv) (info) (code)

http://cs.stanford.edu/people/karpathy/cvpr2015.pdf

http://cs.stanford.edu/people/karpathy/deepimagesent/

https://github.com/karpathy/neuraltalk2



http://kelvinxu.github.io/projects/capgen.html

https://github.com/kelvinxu/arctic-captions

“A person riding a motorcycle on a dirt road.”???

Image-Captioning

Kelvin Xu, Jimmy Ba, Ryan Kiros, Kyunghyun Cho, Aaron

“Two hockey players are fighting over the puck.”???

Image-Captioning

• Let’s turn it around!

• Generative Models

• (we wont cover, but common architectures):

• Auto encoders (AE), variational variants: VAE

• Generative Adversarial Nets (GAN)

• Variational Recurrent Neural Net (VRNN)

74

Generative Models

Wanna Play ?

Text generation (RNN)

75

Karpathy (2015), The Unreasonable Effectiveness of Recurrent Neural Networks (blog)

http://karpathy.github.io/2015/05/21/rnn-effectiveness/

Wanna Play ?

Text generation

76



“A stop sign is flying in blue skies.”

“A herd of elephants flying in the blue skies.”

Elman Mansimov, Emilio Parisotto, Jimmy Lei Ba, Ruslan Salakhutdinov, 2015. Generating Images from Captions with Attention (arxiv) (examples)

Caption -> Image generation


http://www.cs.toronto.edu/~emansim/cap2im.html

Turn Convnet Around: “Deep Dream”

Image -> NN -> What do you (think) you see -> Whats the (text) label

Image -> NN -> What do you (think) you see -> feed back activations ->

optimize image to “fit” to the ConvNets “hallucination” (iteratively)

see also: www.csc.kth.se/~roelof/deepdream/


http://www.csc.kth.se/~roelof/deepdream/


see also: www.csc.kth.se/~roelof/deepdream/


see also: www.csc.kth.se/~roelof/deepdream/ codeyoutubeRoelof Pieters 2015



https://github.com/graphific/DeepDreamVideo

https://www.youtube.com/watch?v=oyxSerkkP4o

https://www.flickr.com/photos/graphific/albums/72157657250972188

Single Units

https://www.flickr.com/photos/graphific/albums/72157657250972188

Inter-modal: “Style Net”

Leon A. Gatys, Alexander S. Ecker, Matthias Bethge , 2015. A Neural Algorithm of Artistic Style (GitXiv)

http://gitxiv.com/posts/jG46ukGod8R7Rdtud/a-neural-algorithm-of-artistic-style

90

+

+

=

https://github.com/alexjc/neural-doodle

Neural Doodle

https://github.com/alexjc/neural-doodle

Gene Kogan, 2015. Why is a Raven Like a Writing Desk? (vimeo)

https://vimeo.com/139123754

• Image Analogies, 2001, A. Hertzmann, C. Jacobs, N. Oliver, B. Curless, D. Sales

• A Neural Algorithm of Artistic Style, 2015. Leon A. Gatys, Alexander S. Ecker, Matthias Bethge

• Combining Markov Random Fields and Convolutional Neural Networks for Image Synthesis, 2016, Chuan Li, Michael Wand

• Semantic Style Transfer and Turning Two-Bit Doodles into Fine Artworks, 2016, Alex J. Champandard

• Texture Networks: Feed-forward Synthesis of Textures and Stylized Images, 2016, Dmitry Ulyanov, Vadim Lebedev, Andrea Vedaldi, Victor Lempitsky

• Perceptual Losses for Real-Time Style Transfer and Super-Resolution, 2016, Justin Johnson, Alexandre Alahi, Li Fei-Fei

• Precomputed Real-Time Texture Synthesis with Markovian Generative Adversarial Networks, 2016, Chuan Li, Michael Wand

• @DeepForger

93

“Style Transfer” papers

• https://soundcloud.com/graphific/neural-music-walk

• https://soundcloud.com/graphific/pyotr-lstm-tchaikovsky

• https://soundcloud.com/graphific/neural-remix-net

94

Audio Generation

A Recurrent Latent Variable Model for Sequential Data, 2016, J. Chung, K. Kastner, L. Dinh, K. Goel, A. Courville, Y. Bengio

https://soundcloud.com/graphific/neural-music-walk

https://soundcloud.com/graphific/pyotr-lstm-tchaikovsky

https://soundcloud.com/graphific/neural-remix-net

Wanna be Doing Deep Learning?

python has a wide range of deep learning-related libraries available

Deep Learning with Python

Low level

High level

deeplearning.net/software/theano

caffe.berkeleyvision.org

tensorflow.org/

lasagne.readthedocs.org/en/latest

and of course:

keras.io

http://deeplearning.net/software/theano/

http://caffe.berkeleyvision.org/

http://deeplearning.net/software/pylearn2/

http://keras.io/

http://keras.io/

Questions?love letters? existential dilemma’s? academic questions? gifts?

find me at:www.csc.kth.se/~roelof/

[email protected]

Code & Papers?

Collaborative Open Computer Science

.com

@graphific




http://@graphific

Questions?love letters? existential dilemma’s? academic questions? gifts?

find me at:www.csc.kth.se/~roelof/

[email protected]

Generative “creative” AI “stuff”?

.net

@graphific




http://@graphific


human-machine collaboration


(YouTube, Paper)

https://www.youtube.com/watch?v=ob1y8mJ6rfk

http://meyumer.com/pdfs/SemanticEditing.pdf


(YouTube, Paper)

https://www.youtube.com/watch?v=7FQrJ6sScbk

http://www.meyumer.com/pdfs/PmAutoencoder.pdf


(Vimeo, Paper)

https://vimeo.com/33408708

http://graphics.stanford.edu/~lfyg/cds.pdf

105

Generative Adverserial Nets

Emily Denton, Soumith Chintala, Arthur Szlam, Rob Fergus, 2015. Deep Generative Image Models using a Laplacian Pyramid of Adversarial Networks (GitXiv)

106


Alec Radford, Luke Metz, Soumith Chintala , 2015. Unsupervised Representation Learning with Deep Convolutional Generative Adversarial Networks (GitXiv)

107



108



”turn” vector created from four averaged samples of faces looking left vs looking right.

walking through the manifold


top: unmodified samplesbottom: same samples dropping out ”window” filters


Multi modal retrieval and generation with deep distributed models

Education

Transcript of Multi modal retrieval and generation with deep distributed models