Multi modal retrieval and generation with deep distributed models

110
@graphific Roelof Pieters Mul--modal Retrieval and Genera-on with Deep Distributed Models 26 April 2016 KTH www.csc.kth.se/~roelof/ [email protected]

Transcript of Multi modal retrieval and generation with deep distributed models

Page 1: Multi modal retrieval and generation with deep distributed models

@graphificRoelof Pieters

Mul--modalRetrievalandGenera-onwithDeepDistributedModels

26April2016 KTH

www.csc.kth.se/~roelof/ [email protected]

Page 2: Multi modal retrieval and generation with deep distributed models

Creative AI > a “brush” > rapid experimentation

human-machine collaboration

Page 3: Multi modal retrieval and generation with deep distributed models

Multi-modal retrieval

3

Page 4: Multi modal retrieval and generation with deep distributed models

Modalities

4

Page 5: Multi modal retrieval and generation with deep distributed models

[Karlgren 2014, NLP Sthlm Meetup]5

Digital Media Deluge: text

Page 6: Multi modal retrieval and generation with deep distributed models

[ http://lexicon.gavagai.se/lookup/en/lol ]6

Digital Media Deluge: text

lol ?

Page 7: Multi modal retrieval and generation with deep distributed models

[Youtube Blog, 2010]7

Digital Media Deluge: video

Page 8: Multi modal retrieval and generation with deep distributed models

[Reelseo, 2015]8

Digital Media Deluge: video

Page 9: Multi modal retrieval and generation with deep distributed models

[Reelseo, 2015]9

Digital Media Deluge: audio

Page 10: Multi modal retrieval and generation with deep distributed models

[Reelseo, 2015]10

Digital Media Deluge: audio

Page 11: Multi modal retrieval and generation with deep distributed models

Challenges

11

• Volume

• Velocity

• Variety

Page 12: Multi modal retrieval and generation with deep distributed models

Can we make it searchable?

12

Language

Page 13: Multi modal retrieval and generation with deep distributed models

Language: Compositionality

Principle of compositionality:

the “meaning (vector) of a complex expression (sentence) is determined by:

— Gottlob Frege (1848 - 1925)

- the meanings of its constituent expressions (words) and

- the rules (grammar) used to combine them”

13

Page 14: Multi modal retrieval and generation with deep distributed models

• NLP treats words mainly (rule-based/statistical approaches at least) as atomic symbols:

• or in vector space:

• also known as “one hot” representation.

• Its problem ?

Word Representation

Love Candy Store

[0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 …]

Candy [0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 …] ANDStore [0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 …] = 0 !

14

Page 15: Multi modal retrieval and generation with deep distributed models

Word Representation

15

Page 16: Multi modal retrieval and generation with deep distributed models

Distributional semanticsDistributional meaning as co-occurrence vector:

16

Page 17: Multi modal retrieval and generation with deep distributed models

Deep Distributional representations

• Taking it further:

• Continuous word embeddings

• Combine vector space semantics with the prediction of probabilistic models

• Words are represented as a dense vector:

Candy =

17

Page 18: Multi modal retrieval and generation with deep distributed models

• Can theoretically (given enough units) approximate “any” function

• and fit to “any” kind of data

• Efficient for NLP: hidden layers can be used as word lookup tables

• Dense distributed word vectors + efficient NN training algorithms:

• Can scale to billions of words !

Neural Networks for NLP

18

Page 19: Multi modal retrieval and generation with deep distributed models
Page 20: Multi modal retrieval and generation with deep distributed models

Word Embeddings: SocherVector Space Model

adapted rom Bengio, “Representation Learning and Deep Learning”, July, 2012, UCLA

In a perfect world:

20

Page 21: Multi modal retrieval and generation with deep distributed models

Word Embeddings: SocherVector Space Model

adapted rom Bengio, “Representation Learning and Deep Learning”, July, 2012, UCLA

In a perfect world:

the country of my birththe place where I was born

21

Page 22: Multi modal retrieval and generation with deep distributed models

Word Embeddings: SocherVector Space Model

Figure (edited) from Bengio, “Representation Learning and Deep Learning”, July, 2012, UCLA

In a perfect world:

the country of my birththe place where I was born ?

22

Page 23: Multi modal retrieval and generation with deep distributed models

Word Embeddings: Turian (2010)

Turian, J., Ratinov, L., Bengio, Y. (2010). Word representations: A simple and general method for semi-supervised learning

code & info: http://metaoptimize.com/projects/wordreprs/ 23

Page 24: Multi modal retrieval and generation with deep distributed models

Word Embeddings: Turian (2010)

Turian, J., Ratinov, L., Bengio, Y. (2010). Word representations: A simple and general method for semi-supervised learning

code & info: http://metaoptimize.com/projects/wordreprs/ 24

Page 25: Multi modal retrieval and generation with deep distributed models

Word Embeddings: Collobert & Weston (2011)

Collobert, R., Weston, J., Bottou, L., Karlen, M., Kavukcuoglu, K., Kuksa, P. (2011) . Natural Language Processing (almost) from Scratch

25

Page 26: Multi modal retrieval and generation with deep distributed models

Multi-embeddings: Stanford (2012)

Eric H. Huang, Richard Socher, Christopher D. Manning, Andrew Y. Ng (2012)Improving Word Representations via Global Context and Multiple Word Prototypes

26

Page 27: Multi modal retrieval and generation with deep distributed models

Linguistic Regularities: Mikolov (2013)

code & info: https://code.google.com/p/word2vec/ Mikolov, T., Yih, W., & Zweig, G. (2013). Linguistic Regularities in Continuous Space Word Representations

27

Page 28: Multi modal retrieval and generation with deep distributed models

Word Embeddings for MT: Mikolov (2013)

Mikolov, T., Le, V. L., Sutskever, I. (2013) . Exploiting Similarities among Languages for Machine Translation

28

Page 29: Multi modal retrieval and generation with deep distributed models

Word Embeddings for MT: Kiros (2014)

29

Page 30: Multi modal retrieval and generation with deep distributed models

Recursive Embeddings for Sentiment: Socher (2013)

Socher, R., Perelygin, A., Wu, J., Chuang, J.,Manning, C., Ng, A., Potts, C. (2013) Recursive Deep Models for Semantic Compositionality Over a Sentiment Treebank.

code & demo: http://nlp.stanford.edu/sentiment/index.html30

Page 31: Multi modal retrieval and generation with deep distributed models

Paragraph Vectors: Dai et al. (2014)

31

Page 32: Multi modal retrieval and generation with deep distributed models

Paragraph Vectors: Dai et al. (2014)

32

Page 33: Multi modal retrieval and generation with deep distributed models

Can we make it searchable?

33

Other modalities

Page 34: Multi modal retrieval and generation with deep distributed models

• Image -> vector -> embedding ? ?

• Video -> vector -> embedding ? ?

• Audio -> vector -> embedding ? ?

34

Other modalities: Embeddings?

Page 35: Multi modal retrieval and generation with deep distributed models

•A host of statistical machine learning techniques

•Enables the automatic learning of feature hierarchies

•Generally based on artificial neural networks

Deep Learning?

Page 36: Multi modal retrieval and generation with deep distributed models

• Manually designed features are often over-specified, incomplete and take a long time to design and validate

• Learned Features are easy to adapt, fast to learn

• Deep learning provides a very flexible, (almost?) universal, learnable framework for representing world, visual and linguistic information.

• Deep learning can learn unsupervised (from raw text/audio/images/whatever content) and supervised (with specific labels like positive/negative)

(as summarised by Richard Socher 2014)

Deep Learning?

Page 37: Multi modal retrieval and generation with deep distributed models

37

2006+ : The Deep Learning Conspirators

Page 38: Multi modal retrieval and generation with deep distributed models
Page 39: Multi modal retrieval and generation with deep distributed models

• Image -> vector -> embedding

• Video -> vector -> embedding ? ?

• Audio -> vector -> embedding ? ?

39

Image Embeddings

Page 40: Multi modal retrieval and generation with deep distributed models

40

Convolutional Neural Nets for Images

classification demo

Page 41: Multi modal retrieval and generation with deep distributed models

41

Convolutional Neural Nets for Images

http://ml4a.github.io/dev/demos/demo_convolution.html

Page 42: Multi modal retrieval and generation with deep distributed models

42

Convolutional Neural Nets for Images

Zeiler and Fergus 2013, Visualizing and Understanding Convolutional Networks

Page 43: Multi modal retrieval and generation with deep distributed models

43

Convolutional Neural Nets for Images

Page 44: Multi modal retrieval and generation with deep distributed models

44

Convolutional Neural Nets for Images

Page 45: Multi modal retrieval and generation with deep distributed models

45

Deep Nets

Page 46: Multi modal retrieval and generation with deep distributed models

46

Deep Nets

Page 47: Multi modal retrieval and generation with deep distributed models

47

Convolutional Neural Nets: Embeddings?

[-0.34, 0.28, …]4096-dimensional fc7 AlexNet CNN

Page 49: Multi modal retrieval and generation with deep distributed models

49

Convolutional Neural Nets: Embeddings?

http://ml4a.github.io/dev/demos/tsne-viewer.html

Page 50: Multi modal retrieval and generation with deep distributed models

• Image -> vector -> embedding ??

• Video -> vector -> embedding

• Audio -> vector -> embedding ? ?

50

Video Embeddings

Page 51: Multi modal retrieval and generation with deep distributed models

51

Convolutional Neural Nets for Video

3D Convolutional Neural Networks for Human Action Recognition, Ji et al., 2010

Page 52: Multi modal retrieval and generation with deep distributed models

52

Convolutional Neural Nets for Video

Sequential Deep Learning for Human Action Recognition, Baccouche et al., 2011

Page 53: Multi modal retrieval and generation with deep distributed models

53

Convolutional Neural Nets for Video

Large-scale Video Classification with Convolutional Neural Networks, Karpathy et al., 2014

Page 54: Multi modal retrieval and generation with deep distributed models

54

Convolutional Neural Nets for Video

Large-scale Video Classification with Convolutional Neural Networks, Karpathy et al., 2014

Page 55: Multi modal retrieval and generation with deep distributed models

55

Convolutional Neural Nets for Video

[Large-scale Video Classification with Convolutional Neural Networks, Karpathy et

al., 2014[Le et al. '11]

vs classic 2d convnet:

Page 56: Multi modal retrieval and generation with deep distributed models

56

Convolutional Neural Nets for Video

[Large-scale Video Classification with Convolutional Neural Networks, Karpathy et al., 2014

Page 57: Multi modal retrieval and generation with deep distributed models

57

Convolutional Neural Nets for Video

Sequential Deep Learning for Human Action Recognition, Baccouche et al., 2011

Page 58: Multi modal retrieval and generation with deep distributed models

58

Convolutional Neural Nets for Video

Long-term Recurrent Convolutional Networks for Visual Recognition and Description, Donahue et al., 2015

Page 59: Multi modal retrieval and generation with deep distributed models

59

Convolutional Neural Nets for Video

Beyond Short Snippets: Deep Networks for Video Classification, Ng et al., 2015]

Page 60: Multi modal retrieval and generation with deep distributed models

60

Convolutional Neural Nets for Video

Delving Deeper into Convolutional Networks for Learning Video Representations, Ballas et al., 2016

Page 61: Multi modal retrieval and generation with deep distributed models

• Image -> vector -> embedding ??

• Video -> vector -> embedding ??

• Audio -> vector -> embedding

61

Audio Embeddings

Page 62: Multi modal retrieval and generation with deep distributed models

62

Zero-shot Learning

[Sander Dieleman, 2014]

Page 63: Multi modal retrieval and generation with deep distributed models

63

Audio Embeddings

[Sander Dieleman, 2014]

Page 65: Multi modal retrieval and generation with deep distributed models

• Can we take this further?

65

Multi Modal Embeddings?

Page 66: Multi modal retrieval and generation with deep distributed models

• unsupervised pre-training (on many images)

• in parallel train a neural network (Language) Model

• train linear mapping between (image) representations and (word) embeddings, representing the different “classes”

66

Zero-shot Learning

Page 67: Multi modal retrieval and generation with deep distributed models

DeViSE model (Frome et al. 2013)

• skip-gram text model on wikipedia corpus of 5.7 million documents (5.4 billion words) - approach from (Mikolov et al. ICLR 2013)

67

Frome, A., Corrado, G.S., Shlens, J., Bengio, S., Dean, J., Mikolov, T., Ranzato, M.A. (2013) Devise: A deep visual-semantic embedding model

Page 68: Multi modal retrieval and generation with deep distributed models

Encoder: A deep convolutional network (CNN) and long short-term memory recurrent network (LSTM) for learning a joint image-sentence embedding. Decoder: A new neural language model that combines structure and content vectors for generating words one at a time in sequence.

Encoder-Decoder pipeline (Kiros et al 2014)

68

Kiros, R., Salakhutdinov, R., Zemerl, R. S. (2014) Unifying Visual-Semantic Embeddings with Multimodal Neural Language Models

Page 69: Multi modal retrieval and generation with deep distributed models

Kiros, R., Salakhutdinov, R., Zemerl, R. S. (2014) Unifying Visual-Semantic Embeddings with Multimodal Neural Language Models

• matches state-of-the-art performance on Flickr8K and Flickr30K without using object detections

• new best results when using the 19-layer Oxford convolutional network.

• linear encoders: learned embedding space captures multimodal regularities (e.g. *image of a blue car* - "blue" + "red" is near images of red cars)

Encoder-Decoder pipeline (Kiros et al 2014)

69

Page 70: Multi modal retrieval and generation with deep distributed models

Image-Text Embeddings

70

Socher et al (2013) Zero Shot Learning Through Cross-Modal Transfer (info)

Page 71: Multi modal retrieval and generation with deep distributed models

Image-Captioning

• Andrej Karpathy Li Fei-Fei , 2015. Deep Visual-Semantic Alignments for Generating Image Descriptions (pdf) (info) (code)

• Oriol Vinyals, Alexander Toshev, Samy Bengio, Dumitru Erhan , 2015. Show and Tell: A Neural Image Caption Generator (arxiv)

• Kelvin Xu, Jimmy Ba, Ryan Kiros, Kyunghyun Cho, Aaron Courville, Ruslan Salakhutdinov, Richard Zemel, Yoshua Bengio, Show, Attend and Tell: Neural Image Caption Generation with Visual Attention (arxiv) (info) (code)

Page 72: Multi modal retrieval and generation with deep distributed models

“A person riding a motorcycle on a dirt road.”???

Image-Captioning

Kelvin Xu, Jimmy Ba, Ryan Kiros, Kyunghyun Cho, Aaron

Page 73: Multi modal retrieval and generation with deep distributed models

“Two hockey players are fighting over the puck.”???

Image-Captioning

Page 74: Multi modal retrieval and generation with deep distributed models

• Let’s turn it around!

• Generative Models

• (we wont cover, but common architectures):

• Auto encoders (AE), variational variants: VAE

• Generative Adversarial Nets (GAN)

• Variational Recurrent Neural Net (VRNN)

74

Generative Models

Page 75: Multi modal retrieval and generation with deep distributed models

Wanna Play ?

Text generation (RNN)

75

Karpathy (2015), The Unreasonable Effectiveness of Recurrent Neural Networks (blog)

Page 76: Multi modal retrieval and generation with deep distributed models

Wanna Play ?

Text generation

76

Karpathy (2015), The Unreasonable Effectiveness of Recurrent Neural Networks (blog)

Page 77: Multi modal retrieval and generation with deep distributed models
Page 78: Multi modal retrieval and generation with deep distributed models
Page 79: Multi modal retrieval and generation with deep distributed models

Karpathy (2015), The Unreasonable Effectiveness of Recurrent Neural Networks (blog)

Page 80: Multi modal retrieval and generation with deep distributed models

“A stop sign is flying in blue skies.”

“A herd of elephants flying in the blue skies.”

Elman Mansimov, Emilio Parisotto, Jimmy Lei Ba, Ruslan Salakhutdinov, 2015. Generating Images from Captions with Attention (arxiv) (examples)

Caption -> Image generation

Page 81: Multi modal retrieval and generation with deep distributed models

Turn Convnet Around: “Deep Dream”

Image -> NN -> What do you (think) you see -> Whats the (text) label

Image -> NN -> What do you (think) you see -> feed back activations ->

optimize image to “fit” to the ConvNets “hallucination” (iteratively)

Page 82: Multi modal retrieval and generation with deep distributed models

see also: www.csc.kth.se/~roelof/deepdream/

Turn Convnet Around: “Deep Dream”

Page 83: Multi modal retrieval and generation with deep distributed models

Turn Convnet Around: “Deep Dream”

see also: www.csc.kth.se/~roelof/deepdream/

Page 84: Multi modal retrieval and generation with deep distributed models

see also: www.csc.kth.se/~roelof/deepdream/ codeyoutubeRoelof Pieters 2015

Turn Convnet Around: “Deep Dream”

Page 85: Multi modal retrieval and generation with deep distributed models

https://www.flickr.com/photos/graphific/albums/72157657250972188

Single Units

Page 86: Multi modal retrieval and generation with deep distributed models

Inter-modal: “Style Net”

Leon A. Gatys, Alexander S. Ecker, Matthias Bethge , 2015. A Neural Algorithm of Artistic Style (GitXiv)

Page 87: Multi modal retrieval and generation with deep distributed models
Page 88: Multi modal retrieval and generation with deep distributed models

88

Page 89: Multi modal retrieval and generation with deep distributed models

89

Page 90: Multi modal retrieval and generation with deep distributed models

90

+

+

=

Page 91: Multi modal retrieval and generation with deep distributed models

https://github.com/alexjc/neural-doodle

Neural Doodle

Page 92: Multi modal retrieval and generation with deep distributed models

Gene Kogan, 2015. Why is a Raven Like a Writing Desk? (vimeo)

Page 93: Multi modal retrieval and generation with deep distributed models

• Image Analogies, 2001, A. Hertzmann, C. Jacobs, N. Oliver, B. Curless, D. Sales

• A Neural Algorithm of Artistic Style, 2015. Leon A. Gatys, Alexander S. Ecker, Matthias Bethge

• Combining Markov Random Fields and Convolutional Neural Networks for Image Synthesis, 2016, Chuan Li, Michael Wand

• Semantic Style Transfer and Turning Two-Bit Doodles into Fine Artworks, 2016, Alex J. Champandard

• Texture Networks: Feed-forward Synthesis of Textures and Stylized Images, 2016, Dmitry Ulyanov, Vadim Lebedev, Andrea Vedaldi, Victor Lempitsky

• Perceptual Losses for Real-Time Style Transfer and Super-Resolution, 2016, Justin Johnson, Alexandre Alahi, Li Fei-Fei

• Precomputed Real-Time Texture Synthesis with Markovian Generative Adversarial Networks, 2016, Chuan Li, Michael Wand

• @DeepForger

93

“Style Transfer” papers

Page 94: Multi modal retrieval and generation with deep distributed models

• https://soundcloud.com/graphific/neural-music-walk

• https://soundcloud.com/graphific/pyotr-lstm-tchaikovsky

• https://soundcloud.com/graphific/neural-remix-net

94

Audio Generation

A Recurrent Latent Variable Model for Sequential Data, 2016, J. Chung, K. Kastner, L. Dinh, K. Goel, A. Courville, Y. Bengio

Page 95: Multi modal retrieval and generation with deep distributed models

Wanna be Doing Deep Learning?

Page 96: Multi modal retrieval and generation with deep distributed models

python has a wide range of deep learning-related libraries available

Deep Learning with Python

Low level

High level

deeplearning.net/software/theano

caffe.berkeleyvision.org

tensorflow.org/

lasagne.readthedocs.org/en/latest

and of course:

keras.io

Page 97: Multi modal retrieval and generation with deep distributed models

Questions?love letters? existential dilemma’s? academic questions? gifts?

find me at:www.csc.kth.se/~roelof/

[email protected]

Code & Papers?

Collaborative Open Computer Science

.com

@graphific

Page 98: Multi modal retrieval and generation with deep distributed models
Page 99: Multi modal retrieval and generation with deep distributed models

Questions?love letters? existential dilemma’s? academic questions? gifts?

find me at:www.csc.kth.se/~roelof/

[email protected]

Generative “creative” AI “stuff”?

.net

@graphific

Page 100: Multi modal retrieval and generation with deep distributed models
Page 101: Multi modal retrieval and generation with deep distributed models

Creative AI > a “brush” > rapid experimentation

human-machine collaboration

Page 102: Multi modal retrieval and generation with deep distributed models

Creative AI > a “brush” > rapid experimentation

(YouTube, Paper)

Page 103: Multi modal retrieval and generation with deep distributed models

Creative AI > a “brush” > rapid experimentation

(YouTube, Paper)

Page 104: Multi modal retrieval and generation with deep distributed models

Creative AI > a “brush” > rapid experimentation

(Vimeo, Paper)

Page 105: Multi modal retrieval and generation with deep distributed models

105

Generative Adverserial Nets

Emily Denton, Soumith Chintala, Arthur Szlam, Rob Fergus, 2015. Deep Generative Image Models using a Laplacian Pyramid of Adversarial Networks (GitXiv)

Page 106: Multi modal retrieval and generation with deep distributed models

106

Generative Adverserial Nets

Alec Radford, Luke Metz, Soumith Chintala , 2015. Unsupervised Representation Learning with Deep Convolutional Generative Adversarial Networks (GitXiv)

Page 107: Multi modal retrieval and generation with deep distributed models

107

Generative Adverserial Nets

Alec Radford, Luke Metz, Soumith Chintala , 2015. Unsupervised Representation Learning with Deep Convolutional Generative Adversarial Networks (GitXiv)

Page 108: Multi modal retrieval and generation with deep distributed models

108

Generative Adverserial Nets

Alec Radford, Luke Metz, Soumith Chintala , 2015. Unsupervised Representation Learning with Deep Convolutional Generative Adversarial Networks (GitXiv)

”turn” vector created from four averaged samples of faces looking left vs looking right.

Page 109: Multi modal retrieval and generation with deep distributed models

walking through the manifold

Generative Adverserial Nets

Page 110: Multi modal retrieval and generation with deep distributed models

top: unmodified samplesbottom: same samples dropping out ”window” filters

Generative Adverserial Nets