Multimedia Information Retrieval: Bytes and pixels meet the challenges of human media interpretation

Multimedia Information Retrieval:

Bytes and pixels meet the challenges of human media interpretation

Martha LarsonDelft University of Technology and Radboud University Nijmegen29 June 2016, Communication Science, Radboud University Nijmegen

About me

● Where do I work?○ TU Delft: Multimedia Computing Group○ Radboud University: Multimedia Information Technology

● What do I do?○ Background: Speech and language,○ Research: Multimedia retrieval and recommender systems,○ Emphasis: How people interpret and use multimedia.

● What am I doing today?○ Sharing with you potential and open issues.

Today’s topics

● Introducing intelligent information systems○ Multimedia information retrieval (user is active)○ Recommender systems (user is passive)

● Computer Science and Multimedia○ The “love” relationship: lots of data○ The “hate” relationship: people’s interpretation of media

is not “neat”!● How to move forward?

○ Benchmarking challenges

Intelligent Information Systems

● Connect users with information,● Information: digital content, facts, products, services,● Include search engines and recommender systems,● Success is judged by satisfaction of user needs.

Information retrieval

Definition: Information retrieval (IR) is finding material of an unstructured nature that satisfies an information need from within large collections. http://nlp.stanford.edu/IR-book/html/htmledition/irbook.html

Recommender Systems

Definition: A recommender system tries to identify sets of items that are likely to be of interest to a certain user given some information from that user’s profile.

“Multimedia Clues” for the computer scientist● Text: Things people write about images and videos.● User interactions: What people click on, how long they

watch.● Pixel statistics: Colors, lines, textures, shot change

patterns.● Concept detection: Entities that can be detected in

images and videos (faces can be detected well).● Speech recognition: What is said in a video.● Sound detection: Sounds that can be detected (laughter

and gunshots can be detected well).

Visual Geo-location prediction

● Combine evidence from multiple images (e) taken in an area (Eg).

● Upweight elements that are distinctive for that particular area (WGeo).

Xinchao Li, Alan Hanjalic, Martha Larson. Geo-distinctive Visual Element Matching for Location Estimation of Images, Under review. http://arxiv.org/pdf/1601.07884v1.pdf

Good match: Lots of what’s unique

Visual Geo-location prediction

Xinchao Li, Alan Hanjalic, Martha Larson. Geo-distinctive Visual Element Matching for Location Estimation of Images, Under review. http://arxiv.org/pdf/1601.07884v1.pdf

Conventional search engine finds “what”

Alan Hanjalic, Christoph Kofler, and Martha Larson. 2012. Intent and its discontents: the user at the wheel of the online video search engine. In Proceedings of the 20th ACM international conference on Multimedia (MM '12). ACM, New York, NY, USA, 1239-1248.

I want a song called “koi pond”.I’m interested in garden koi ponds.

Intent-aware search responds to “why”


I am interested in the significance of koi ponds.

I want to build a koi pond.

User intent in video search

Our study identified five major reasons why people search for videos online:

● Information (declarative knowledge)● Experience for Learning (performative knowledge)● Experience for Exposure (“being there”)● Affect (change of mood)● Object (video as video)


Why are video moments important?

R. Vliegendhart, M. Larson, B. Loni and A. Hanjalic, "Exploiting the Deep-Link Commentsphere to Support Non-Linear Video Access," in IEEE Transactions on Multimedia, vol. 17, no. 8, pp. 1372-1384, Aug. 2015.

Viewer Expressive Reactions

R. Vliegendhart, M. Larson, B. Loni and A. Hanjalic, "Exploiting the Deep-Link Commentsphere to Support Non-Linear Video Access," in IEEE Transactions on Multimedia, vol. 17, no. 8, pp. 1372-1384, Aug. 2015.

Expressive reactions are not emotional in the classic sense.

They are also not completely personal...but..

The way people take a picture reflects what they are taking a picture of.

Pixel statistics reveal very simple information on how people take pictures.

We need people to judge if the computer guesses right.

Michael Riegler, Martha Larson, Mathias Lux, and Christoph Kofler. 2014. How 'How' Reflects What's What: Content-based Exploitation of How Users Frame Social Images. In Proceedings of the 22nd ACM international conference on Multimedia (MM '14).

Fashion and framing

Characterize the trend...

Jacket types are already very difficult for computers!

Crowdsourcing

People interpret images in exchange for micropayments.

Example: Amazon Mechanical Turk

MediaEval 2016Multimedia Benchmark Initiative

moving forward with benchmarking

MediaEval Multimedia Evaluation Benchmark

● offers tasks on multimedia access and retrieval,● exploits features derived from multiple modalities:

speech, audio, visual content, tags, users, context, ● solutions may or may not involve machine learning.

multimediaeval.org

This year: MediaEval workshop is right after ACM Multimedia 2016

in Amsterdam

Example MediaEval Tasks● Predicting Media Interestingness: Infer interesting

frames and segments of movies (using audio, visual features, text).

● Retrieving Diverse Social Images: Diversify image results lists (text, visual features).

● Context of Multimedia Experience: Predict multimedia content suitable for watching in stressful situations.

● Person Discovery: finding people in broadcast content.● Placing: geo-location estimation for social multimedia.

multimediaeval.org

Publications arising from MediaEvalhttp://www.citeulike.org/group/16499

2015 Workshop Participants80 participants from 25 countries

multimediaeval.org

MediaEval Proceedings Papers

multimediaeval.org

What sets MediaEval apart?

• … emphasizes the "multi" in multimedia: speech, audio, visual content, tags, users, context.

• … innovates new tasks and techniques focusing on the human and social aspects of multimedia content.

• … community driven.

multimediaeval.org

Predicting Media Interestingness Task

Automatically select frames or portions of movies which are the most interesting for a common viewer.

● Goal: Make use of the visual, audio and text content (features provided).

● Data: consists in ca 100 movie trailers, together with human annotations

● Metric: System performance is to be evaluated using standard Mean Average Precision.

Predicting Media Interestingness Task

http://multimediaeval.org

Retrieving Diverse Social Images Task

This task addresses the problem of image search result diversification in the context of social media:

● Goal: refine a ranked list of Flickr photos retrieved with general purpose multi-topic queries using provided visual, textual and user tagging credibility information.

● Metrics: results are evaluated with respect to their relevance to the query and the diverse representation of it.

● Data: ~40k images, social metadata, text models, CNN descriptors, user tagging credibility dataset, etc

Three data sets have been published at the MMSys dataset track.

Retrieving Diverse Social Images Task (cont.)

initial retrieval results

diversified results

Initial results

Diversified results

Context of Multimedia Experience Task

Develops multimodal techniques for automatic prediction of multimedia in a particular consumption content.

● Goal: Predict movies that are suitable to watch on airplanes.

● Data: Input to the prediction methods is movie trailers, and metadata from IMDb, Rotten Tomatoes and Metacritic.

● Metric: Output is evaluated using the Weighted F1 score, with expert labels as ground truth.

This year: Task is offered at the MediaEval workshop and at a joint-challenge workshop at http://www.icpr2016.org

Context of Multimedia Experience TaskDifferent context can lead to different preferences...

...people like to watch different movies than they would at home or in the cinema.

Multimodal Person Discovery in Broadcast TV Task

● Goal: Given raw TV broadcasts, each shot must be automatically tagged with the name(s) of people who can be both seen as well as heard in the shot.

● The list of people is not known a priori and their names must be discovered in an unsupervised way from provided text overlay or speech transcripts.

● Data: Multilingual corpus from INA (French), DW (German & English) and UPC (Catalan)

● Metric: standard information retrieval metrics based on a posteriori collaborative annotation of the corpus by the participants themselves.

Person Discovery Task

Person names must be discovered in speech track and/or sub-titles. Models cannot be trained on external data.

Slide credit: Johann Poignant, Hervé Bredin, Claude Barras, Person Discovery Task Organizers MediaEval 2015

Tackling the Person Discovery Task

Slide credit: Johann Poignant, Hervé Bredin, Claude Barras, Person Discovery Task Organizers MediaEval 2015

Wrap Up

● We want to connect users with information,in order to satisfy information needs.

● CS Love: Lots of data!● CS Hate: How do people really see multimedia, what do

they want?● Way forward: Continue to define new challenges and build

algorithms to address them.

Beyond the user-item matrix

CrowdRec project

● Exploiting multiple sources of information,● Leveraging the Crowd (crowdworkers, users, curators),● Evaluating large scale.

Context-driven Recommender systems:

“People have more in common with other people in the same

situation than they do with past versions of themselves”

Roberto Pagano, Paolo Cremonesi, Martha Larson, Balazs Hidasi, Domonkos Tikk, Alexandros Karatzoglou, and Massimo Quadrana The Contextual Turn: from Context-aware to Context-driven recommender systems. ACM RecSys 2016, to appear.

Turn from personalization• Context has been taken into account by coupling it with personalization, with context-aware recommender systems

• However being aware of the context is not enough for some domains: recommendations should be driven by the context

In traditional recsys, Immutable Preference paradigm (ImP):

• User tastes do not evolve

• Goals and needs are static

• Item catalog is static

• Trendiness, Seasonality, Capacity and life-cycle addresses by tweaks to existing models

Slide credit: Roberto Pagano


MusicI usually like heavy metal music, but now I have to work and I want to listen to some

soft music

Recommended for you:


Jaeyoung Choi, Eungchan Kim, Martha Larson, Gerald Friedland, and Alan Hanjalic. 2015. Evento 360: Social Event Discovery from Web-scale Multimedia Collection. ACM Multimedia 2015, pp. 193-196.

Thank youMohammad Soleymani, Guillaume Gravier, Bogdan Ionescu, Gareth Jones, Claire-Helene Demarty, Ngoc Duong, Frédéric Lefebvre, Yu-Gang Jiang, Bogdan Ionescu, Mats Sjöberg, Hanli Wang,, Toan Do, Richard Sutcliffe, Chris Fox, Richard Lewis, Tom Collins, Eduard Hovy, Deane L. Root, Igor Szoke, Xavier Anguera, Claude Barras, Hervé Bredin, Camille Guinaudeau, Jean Carrive, Yannick Estève, Javier Hernando, Juliette Kahn, Nam Le, Sylvain Meignier , Ramon Morros, Johann Poignant, Satoshi Tamura, Bart Thomee, Olivier Van Laere, Claudia Hauff , Jaeyoung Choi, Emmanuel Dellandréa, Liming Chen, Yoann Baveye, Mats Sjöberg, Christina Boididou, Symeon Papadopoulos, Stuart E. Middleton, Michael Riegler, Duc Tien, Dang Nguyen, Giulia Boato, Andreas Petlund, Michael Riegler, Concetto Spampinato, Bogdan Ionescu, Alexandru Lucian Gînscă, Maia Zaharieva, Mihai Lupu, Henning Müller, Adrian Popescu, Bogdan Boteanu, Alan Woodley, Shlomo Geva, Timothy Chappell, Richi Nayak, Gabi Constantin, Roberto Pagano, Paolo Cremonesi, Martha Larson, Balazs Hidasi, Domonkos Tikk, Alexandros Karatzoglou, Massimo Quadrana, Xinchao Li, Alan Hanjalic, Andreas Lommatzsch, Benjamin Kille, Fabian Abel, Daniel Kohlsdorf, Jonas Seiler, Róbert Pálovics, Andras Benczur...

Links

● Challenges (Benchmarks)○ MediaEval Multimedia Evaluation

(http://multimediaeval.org),○ CLEF NewsREEL News Recommendation challenge

(http://www.clef-newsreel.org),○ ACM RecSys 2016 Job Recommendation challenge

(http://2016.recsyschallenge.com).● Acknowledgements

○ Multimedia Commons (http://www.multimediacommons.org),○ EC-funded CrowdRec project (http://crowdrec.eu).

http://multimediaeval.org

http://www.clef-newsreel.org

http://2016.recsyschallenge.com

http://www.multimediacommons.org

http://crowdrec.eu

Multimedia Information Retrieval: Bytes and pixels meet the challenges of human media interpretation

Technology

Transcript of Multimedia Information Retrieval: Bytes and pixels meet the challenges of human media interpretation