Intelligent Thumbnail Selection
-
Upload
kamil-sindi -
Category
Internet
-
view
284 -
download
0
Transcript of Intelligent Thumbnail Selection
Intelligent Thumbnail SelectionKamil Sindi, Lead Data Scientist
JW Player
1. Company
a. Open-source video player
b. Hosting platform
c. 5% of global internet video traffic
d. 150+ team
2. Data Team
a. Handling 5MM events per minute
b. Storing 1TB+ per day
c. Stack: Storm (Trident), Kafka, Luigi,
Elasticsearch, Spark, AWS, MySQL Customers
Thumbnails are Important
● Your video's first impression
● Types: Upload, Manual, Auto (default)
● Manual >> Auto in Play Rate
● Current Auto is 10th second frame
● Many big publishers only use Manual
● 90% of Thumbnails are Auto! :-(
source: tastingtable.com (2016-10-12)
What’s a “Good” Thumbnail?
It’s subjective to the viewer!
Common themes:
● Not blurry
● Balanced brightness
● Centered objects
● Large text overlay
● Relevant to subject
vs
Source: Big Buck Bunny, Blender Studios
Manually Creating a Model is Hard
● Which features to extract?
● How to describe those features?
● How to weight features?
● How to penalize overfitting of models?
● Many techniques: SIFT, SURF, HOG?
Need to be an expert in Computer Vision :-(
Edge Detection Color Histogram Pixel Segmentation
So Many Image Features...
Deep Learning● Learn features implicitly
● Learn from examples
● Techniques to avoid overfitting
● Success in a lot of applications:
○ Image classification
○ Image captioning
○ Machine translation
○ Speech-to-Text
Inception
● Learn multiple models in parallel; concatenate
their outputs (“modules”)
● Factoring convolutions (“towers”): e.g. 1x1
convs followed by 3x3
● Parameter reduction: GoogleNet (5MM) vs.
AlexNet (60MM), VGG (200MM)
● Auxiliary classifiers for regularization
● Residual connections (Inceptionv4)
● Depthwise separable convolutions (Xception)
https://www.udacity.com/course/deep-learning--ud730
https://arxiv.org/abs/1409.4842
Source: Rethinking the Inception Architecture for Computer Vision
1. Dimensionality reduction: fewer
channels, strides, feature pooling
2. Parameter reduction: faster, less
overfitting
3. “Cheap” nonlinearity: 1x1 + 3x3 is non-lin
4. Cross-channel ⊥ spatial correlations
1x1 Convolutions: what’s the point?
1x1 convolution with strides Pooling with 1x1 convolution
Source: http://iamaaditya.github.io/2016/03/one-by-one-convolution/
In Convolutional Nets, there is no such thing as
“fully-connected layers”. There are only
convolution layers with 1x1 convolution kernels. –
Yann LeCun
InceptionV3 Architecture
https://research.googleblog.com/2016/03/train-your-own-image-classifier-with.html
Dog (0.80)Cat (0.05)Rat (0.01)...
Transfer Learning
1,000,000 images, 1,000 categories● Use pre-trained model
○ Cheaper (no GPU required)
○ Faster
○ Prevents overfitting
● Penultimate (“Bottleneck”) layer contains
image’s “essence” (CNN codes); acts as a
feature extractor
● Just add a linear classifier (Softmax; lin-SVM)
to Bottleneck
Fine Tuning + Tips
● Change classification layer +
backprop layers back
● Idea:
Early layers do basic filters; later
layers more dataset specific
● Generally use a pre-trained model
regardless of data size or similarity
Data Size (per class)
< 500 > 500 > 5,000
Similar to original Too small TL
TL + FT earlier layers
Not Similar Too smallTL on earlier
layersTL + FT entire
network
Other Applications of Transfer Learning
Google “Show and Tell”https://github.com/tensorflow/models/tree/master/im2txt
Image Captioning Image Search
http://www.slideshare.net/ScottThompson90/applying-transfer-learning-in-tensorflow
Training: Thesis
Train to differentiate between Manual and Auto
● Manual thumbnails are (usually) better than Auto
● Select Manual with high views and play rate;
Auto selection is random but low plays
● We have a lot of examples: 10K+ manual
● We used InceptionV3 pre-trained on ImageNet
Training: Examples
Positive (Manual)
Negative Examples
Negative (Auto)
Video Pre-Filter
Use FFMPEG to select top 100 frame
candidates
Methods:
● Color histogram changes to avoid
dupes
● Coded Macroblock information
● Remove “black” frame
● Measure motion vectors
Motion Vectors
Source: Sintel, Blender Studios
Engineering
Demo: Evaluation Tool
Demo: Examples Original Auto (10th second frame)
Top scored frames from new model
What’s Next
● Refinements:
○ Fine tuning to earlier layers
○ Other models: ResnetV2, Xception
○ Pre-Filtering: adaptive, hardware accel.
● Products:
○ New auto thumbnails
○ Thumbstrips
Resources
Blog Posts:● https://research.googleblog.com/2016/03/train-your-own-image-classifier-with.html● https://github.com/tensorflow/models/tree/master/inception● http://iamaaditya.github.io/2016/03/one-by-one-convolution/● http://www.slideshare.net/ScottThompson90/applying-transfer-learning-in-tensorflow● https://blog.keras.io/building-powerful-image-classification-models-using-very-little-data.html● http://cs231n.github.io/transfer-learning/● https://research.googleblog.com/2015/10/improving-youtube-video-thumbnails-with.html● https://pseudoprofound.wordpress.com/2016/08/28/notes-on-the-tensorflow-implementation-of-inception-v3/● https://adeshpande3.github.io/adeshpande3.github.io/The-9-Deep-Learning-Papers-You-Need-To-Know-About.html
Papers:● Rethinking the inception architecture for computer vision. https://arxiv.org/abs/1512.00567● Xception: Deep Learning with Depthwise Separable Convolutions. https://arxiv.org/abs/1610.02357● CNN Features off-the-shelf: an Astounding Baseline for Recognition. https://arxiv.org/abs/1403.6382● DeCAF: A Deep Convolutional Activation Feature for Generic Visual Recognition. https://arxiv.org/abs/1310.1531● How transferable are features in deep neural networks? https://arxiv.org/abs/1411.1792