Tell me why! ain't nothin' but a mistake describing media item differences with media fragments uri...

18
Tell me why! Ain't nothin' but a mistake? Describing Media Item Differences with Media Fragments URI and Speech Synthesis Thomas Steiner (tomac @google.com , @tomayac ) Raphaël Troncy ([email protected] , @rtroncy ) http://www.ourprg.com/wp-content/uploads/2013/03/wallpapers ru corvuscorax 2560x1440 chelyabinskiy meteor.jpg

description

Tell me why! ain't nothin' but a mistake describing media item differences with media fragments uri and speech synthesis

Transcript of Tell me why! ain't nothin' but a mistake describing media item differences with media fragments uri...

Page 1: Tell me why! ain't nothin' but a mistake  describing media item differences with media fragments uri and speech synthesis

Tell me why! Ain't nothin' but a mistake? Describing Media Item Differences with Media Fragments URI and Speech SynthesisThomas Steiner ([email protected], @tomayac)Raphaël Troncy ([email protected], @rtroncy)

http://www.ourprg.com/wp-content/uploads/2013/03/wallpapers ru corvuscorax 2560x1440 chelyabinskiy meteor.jpg

Page 2: Tell me why! ain't nothin' but a mistake  describing media item differences with media fragments uri and speech synthesis

Introduction

Context of this work:

● Event summarization based on multimedia data shared publicly on social networks.

● Developed an application that auto-generates media galleries.

Page 3: Tell me why! ain't nothin' but a mistake  describing media item differences with media fragments uri and speech synthesis

Media gallery creation steps

1) Extract media items from multiple social networks

[Rizzo2012] G. Rizzo, T. Steiner, R. Troncy, R. Verborgh, J.-L. Redondo García, R. Van de Walle. What fresh media are you looking for?: retrieving media items from multiple social networks. In Proceedings of the 2012 international workshop on Socially-aware multimedia, pp. 15–20, 2012

Page 4: Tell me why! ain't nothin' but a mistake  describing media item differences with media fragments uri and speech synthesis

Media gallery creation steps (cont.)

2) Deduplicate visually similar media items

[Steiner2013_1] Thomas Steiner, Ruben Verborgh, Joaquim Gabarró Vallés, and Rik Van de Walle. Near-duplicate Photo Deduplication in Event Media Shared on Social Networks. In Proceedings of the International Conference on Advanced IT, Engineering and Management, 2013

Page 5: Tell me why! ain't nothin' but a mistake  describing media item differences with media fragments uri and speech synthesis

Media gallery creation steps (cont.)

3) Rank media item clusters

[Steiner2013_2] Thomas Steiner. A Meteoroid on Steroids: Ranking Media Items Stemming from Multiple Social Networks. In Companion Publication of the IW3C2 WWW 2013 Conference, May 13–17, 2013, Rio de Janeiro, Brazil.

Page 6: Tell me why! ain't nothin' but a mistake  describing media item differences with media fragments uri and speech synthesis

Media gallery creation steps (cont.)

4) Compile media galleries

[Steiner2012_1] T Steiner, R Verborgh, J Gabarro, R Van de Walle. Defining aesthetic principles for automatic media gallery layout for visual and audial event summarization based on social networks. In Quality of Multimedia Experience (QoMEX), 2012 Fourth International Workshop on, 2012

[Steiner2013_3] Thomas Steiner and Christopher Chedeau. To Crop, Or Not to Crop: Compiling Online Media Galleries. In Companion Publication of the IW3C2 WWW 2013 Conference, May 13–17, 2013, Rio de Janeiro, Brazil

Page 7: Tell me why! ain't nothin' but a mistake  describing media item differences with media fragments uri and speech synthesis

Research Question

"Given a complex algorithm like a media item clustering algorithm, can we use Media Fragments URIs together with speech synthesis to describe the algorithm's results rationales?"

● Human raters that evaluate algorithm results are non-experts.● Can help algorithm developers improve the algorithms.● Generalization potential for the proof-of-concept.

Page 8: Tell me why! ain't nothin' but a mistake  describing media item differences with media fragments uri and speech synthesis

Media Fragments URIs

A media item tile is a spatial media fragment

xywh.js—Polyfill for spatial media fragments

<img src="kitten.jpg#xywh=100,100,50,50"/><img src="kitten.jpg#xywh=pixel:100,100,50,50"/><img src="kitten.jpg#xywh=percent:25,25,50,50"/>

Available as open source on GitHub:

https://github.com/tomayac/xywh.js

Page 9: Tell me why! ain't nothin' but a mistake  describing media item differences with media fragments uri and speech synthesis

Media Fragments URIs (cont.)

Using a tile-wise average-histogram-based media item deduplication algorithm with face detection.

Makes use of Media Fragments URIs [Troncy2012] to make semantic statements about fragments of media items:

@base <http://example.org/> .@prefix ma: <http://www.w3.org/ns/ma-ont> .@prefix foaf: <http://xmlns.com/foaf/0.1/> .@prefix db: <http://dbpedia.org/resource/> .@prefix dbo: <http://dbpedia.org/ontology/> .@prefix col: <http://purl.org/colors/rgb/> .

<video> a ma:MediaResource .<video#t=,10&xywh=0,0,30,40> a ma:MediaFragment ; foaf:depicts db:Face .<video#t=,10&xywh=0,0,10,10> a ma:MediaFragment ; dbo:colour col:f00 .

[Troncy2012] R. Troncy, E. Mannens, S. Pfeiffer, D. Van Deursen, M. Hausenblas, P. Jagenstedt, J. Jansen, Y. Lafon, C. Parker, and T. Steiner, “Media Fragments URI 1.0 (basic),” Recommendation, W3C, 2012

Page 10: Tell me why! ain't nothin' but a mistake  describing media item differences with media fragments uri and speech synthesis

Deduplicating media items

Each tile of a media item has its unique URI:

● http://example.org/image.png#xywh=0,0,10,10 We can leverage this fact to make semantic statements about media item similarity, for example, to debug the deduplication algorithm.

Page 11: Tell me why! ain't nothin' but a mistake  describing media item differences with media fragments uri and speech synthesis

Deduplicating media items (cont.)

Algorithm Matching Conditions

Cond. 1: Out of m tiles of a media item with n tiles (m <= n), theaverage color of at most tiles_threshold tiles may differ not morethan similarity_threshold from their counterpart tiles.

Cond. 2: The numbers f1 and f2 of detected faces in both media itemshave to be the same. We note that the algorithm does not recognizefaces, but only detects them.

Cond. 3: If the average colors of a tile and its counterpart tile are withinthe black-and-white tolerance bw_tolerance, these tiles are notconsidered and tiles_threshold is decreased accordingly.

Page 12: Tell me why! ain't nothin' but a mistake  describing media item differences with media fragments uri and speech synthesis

Deduplicating media items (cont.)

Using a speech synthesizer and speech generation to make spoken statements based on RDF statements about visual similarity of media item tiles.

Based on Speak.js (https://github.com/kripken/speak.js)

Page 13: Tell me why! ain't nothin' but a mistake  describing media item differences with media fragments uri and speech synthesis

Deduplicating media items (cont.)

Human Rater Decisions

Clustering Consent: Two or more media items are clustered by thealgorithm and the human rater agrees. The human rater wants tounderstand why they were clustered.

Clustering Dissent: Two or more media items are clustered by thealgorithm, but the human rater thinks that they should not have beenclustered. The human rater wants to understand why they wereincorrectly clustered.

Non-Clustering Dissent: Two or more media items are not clusteredby the algorithm, but the human rater thinks that they should havebeen clustered. The human rater wants to understand why theywere not clustered.

Page 14: Tell me why! ain't nothin' but a mistake  describing media item differences with media fragments uri and speech synthesis

Deduplicating media items (cont.)

Low-level debug output

- Similarity threshold: 15 (Cond. 1)- Tiles threshold: 67 (Cond. 1)- Similar tiles: 52 (Cond. 1)- Faces left: 0. Faces right: 0 (Cond. 2)- BW tolerance: 1 (Cond. 3)- Not considered tiles: 22 (Cond. 3)- Effective tiles threshold: 45 (Cond. 3)

Needs to be lifted to normal human language in order to be understandable by non-domain experts.

Page 15: Tell me why! ain't nothin' but a mistake  describing media item differences with media fragments uri and speech synthesis

Natural Speech Generation

Reiter and Dale [Reiter2000] differentiate three phases of speech generation:

Document planning determines the content and structure of a document.

Microplanning decides which words, syntactic structures, etc. are used to communicate the chosen content and structure.

Realization maps the abstract representations used by microplanning into text.

[Reiter2000] E. Reiter and R. Dale, Building Natural Language Generation Systems, Studies in Natural Language Processing. Cambridge University Press, 2000.

Page 16: Tell me why! ain't nothin' but a mistake  describing media item differences with media fragments uri and speech synthesis

Natural Speech Generation (cont.)

Document Planning: We need to convey the currently selected tiles_threshold and similarity_threshold, the number of detected faces f1 and f2 in each media item, and the number of tiles not considered given the bw_tolerance parameter.

Microplanning: We need to decide on a matching condition aspect of the algorithm that will be first highlighted. Afterwards, we need to elaborate on secondary matching conditions such as detected faces and black-and-white tolerance. The grammatical number (plural or singular) needs to be taken into account. The microplanner needs to decide when exactness (e.g., “99% of all tiles”) and when approximation of calculated values (e.g., “roughly 50%”) better suits the human evaluators’ needs.

Realization: We need to map the abstract representations used by the microplanning step into text.

Page 17: Tell me why! ain't nothin' but a mistake  describing media item differences with media fragments uri and speech synthesis

Natural Speech Generation (cont.)

“However, 22 tiles were not considered, as they are either too bright or too dark, which is a common source of clustering issues.”