Scalable Understanding of Multilingual MediA...

SUMMA H2020–688139 D2.1 Data Management Plan

Scalable Understanding of Multilingual MediA(SUMMA)

http://www.summa-project.eu

H2020 Research and Innovation ActionNumber: 688139

D2.1 – Data Management Plan

Nature Report Work Package WP2Due Date 31/7/2016 Submission Date 29/07/2016

Main authors Peggy van der Kreeft (DW), David Sheppey (BBC)Co-authors Andreas Giefer (DW), Susanne Weber (BBC), Guntis Barzdins (LETA)Reviewers Herve Bourlard (IDIAP), Steve Renals (UEDIN)Keywords data, social media, monitoring, metadata, access

Version Controlv0.1 Status Draft 25/7/2016v0.2 Status Draft 26/7/2016v1.0 Status Final 27/7/2016v1.1 Status Final 29/7/2016

page 1 of 23

http://www.summa-project.eu


Contents

1 Introduction 4

2 Types of Data Collected 52.1 Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5

2.2 Data Types . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6

2.3 Requirements for Monitoring Data . . . . . . . . . . . . . . . . . . . . . . . . . . 7

2.4 Requirements for Data for Specific Technologies . . . . . . . . . . . . . . . . . . 7

2.4.1 Transcribed Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7

2.4.2 Translated Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8

2.4.3 Annotated Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8

2.5 Provision of Monitoring Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8

2.5.1 Customised Batches and API Access . . . . . . . . . . . . . . . . . . . . 9

2.5.2 Other Sources . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11

2.6 Provision of Data for Specific Technologies . . . . . . . . . . . . . . . . . . . . . 12

3 Types of Data Generated 13

4 Data and Metadata Standards 144.1 Metadata . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14

4.2 Dataset Identifiers and Descriptions . . . . . . . . . . . . . . . . . . . . . . . . . 14

4.3 Video Material . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15

4.4 Audio Material . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16

4.5 Text Articles Material . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16

4.6 Social Media Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17

4.7 RSS Feeds, Podcasting . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18

5 Data Storage, Preservation and Re-Use 19

6 Policies for Data Access and Sharing 206.1 Different Levels for Access and Sharing . . . . . . . . . . . . . . . . . . . . . . . 20

6.2 Planned Measurements for the Protection of Personal Data . . . . . . . . . . . . . 21

7 Conclusion 22

page 2 of 23


List of Figures

1 BBC textual content workflow . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10

2 BBC AV workflow . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11

3 DW Multilingual Transcription Spanish-German-English . . . . . . . . . . . . . . 12

4 DW JSON Teaser Example . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17

5 Some DW Twitter Feeds . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17

6 DW RSS Feeds . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18

page 3 of 23


1 Introduction

SUMMA (Scalable Understanding of Multilingual Media) is a three-year H2020 project, runningfrom February 2016 to January 2019, under the Research and Innovation Action grant agreementnumber 688139. SUMMA participates in the H2020 Pilot on Open Research Data. This Data Man-agement Plan (DMP) provides an analysis of the main elements of the data management policy thatwill be used by the SUMMA consortium with regard to all the datasets collected for or generatedby the project. This deliverable is the initial version of the report and two subsequent iterationswill elaborate on the issues covered. The DMP will address issues such as collection of data, dataset identifiers and descriptions, standards and metadata used in the project, data sharing, propertyrights and privacy protection, and long-term preservation and re-use, complying with national andEU legislation. This initial version covers primarily the areas of data requirements and collection,types of datasets, metadata standards and formats, direct access to broadcast data through APIs,intended use of broadcast data provided, initial output formats, and measures to protect propertyrights and privacy. Please note that the technical specification and proposed solution for WorkPackage 2, as well as the diagrams included in this document, have not yet been finalised at thetime of writing. They are work in progress and will be modified throughout the course of Year 1.

The following structure will be adhered to in this DMP:

• Types of data that will be collected

• Types of data that will be generated

• Data and metadata standards

• Data storage, preservation and re-use

• Policies for data access and sharing

page 4 of 23


2 Types of Data Collected

2.1 Overview

SUMMA develops an open-source platform for dealing with large volumes of data across manylanguages and different media types. It has a range of technologies that will be implemented,including automated speech recognition, machine translation, topic clustering, summarisation andsemantic analysis.

Data is being collected in the nine SUMMA languages: English, German, Spanish, Portuguese,Arabic, Persian, Russian, Ukrainian, Latvian.

The project includes three data providers: BBC and Deutsche Welle (DW) as world broadcasterswith a wide range of languages and acting primarily as user partners and content providers inSUMMA. LETA, the Latvian Information Agency, has a double role of integrator and contentprovider. In addition, the Qatar Computing Research Institute also provides content, in particularfor Arabic.

Three use cases implement the applications and put the data to use:

• External monitoring: intelligent tools for global news monitoring of up to 200 broadcastchannels• Internal monitoring: cross-lingual exchange, enabling awareness and re-use of data across

languages• Data journalism: a use case for year three, in which measurable data is extracted and trans-

lated into images.

BBC targets the external monitoring use case, with a complicated workflow simultaneously mon-itoring up to 200 external news channels with streaming content in four SUMMA languages (Rus-sian, Arabic, Persian, Ukrainian). The tool combines the functionalities of monitoring, transcribingand translation, as well as summarising and clustering data.

DW focuses on the internal monitoring use case, involving 8 SUMMA languages (English, German,Spanish, Portuguese, Russian, Arabic, Persian, Ukrainian). In this use case, DW content (andcontent from other news providers), primarily on-demand video, and audio and text articles, butalso streaming content, published in the above languages is continuously monitored, transcribed,translated, compared, clustered and summarised. The result is a tool which keeps editors andjournalists up to date on what the trending news stories are and what has been published in thoselanguages, allowing them to obtain either a full translation into English or a summary of the stories.It thus improves monitoring capacity and quality and reduces workload by automated translationsupport and summarisation.

This is a broadcaster-focused project, with involvement of world broadcasters with coverage of upto 30 languages and thus with a key role for data.

“Collection of data” in this report refers to the acquisition of data by the consortium, primarilythrough data provision by the participating SUMMA broadcasters.

Coverage of the different languages by the user/content partners (BBC, DW, LETA) is summarisedin the following table:

page 5 of 23


Language ProviderEnglish BBC, DW, LETAGerman DWSpanish DW

Portuguese DWArabic BBC, DWRussian BBC, DW, LETA

Ukrainian BBC, DWPersian BBC, DWLatvian LETA

2.2 Data Types

Data for SUMMA is being collected at several levels:

• By project target use– Ingestion data– Training data– Test data

• By targeted technology– Data monitoring– Automated transcription– Automated translation– Automated summarisation– Sentiment analysis– Keyword annotation

• By type of data– Metadata– Video material– Audio material– Text articles– Social media– Ontologies

• By delivery type– Streaming data– Batch data

• By language– All nine SUMMA languages

• By content provider/user partner– BBC– DW– LETA– Others, including the Qatar Computing Research Institute

For practical purposes, data requirements and provision are being divided into two main groupsbased on targeted technology: on the one hand, regular content for monitoring, and on the otherhand, specific data for transcription, translation, summarisation, sentiment analysis. The other

page 6 of 23


types and levels are described within these two broad categories.

2.3 Requirements for Monitoring Data

As SUMMA deals primarily with data monitoring, such data is essential for prototype development,assessment, user validation and scalability testing. Different types of data are involved, as well asdifferent types of delivery. These aspects and challenges are detailed further in this report.

• Different types of data– Metadata– Video material– Audio material– Social media

• Different types of delivery– Streaming data– Batch data

2.4 Requirements for Data for Specific Technologies

The participating broadcasters are directly supporting the technology partners by providing train-ing and/or test data for the different components and technologies, whenever possible. The provi-sion depends on the availability of such data, and on the required manpower for preparation andadaptation.

Requirement specifications for such data have been gathered within WP2, detailing what typeof data is needed, and how much. It is in the interest of the technology developers and the userpartners alike to arrive at powerful tools providing a high-quality output, and all participants realisethat training and test data is needed to make this happen.

2.4.1 Transcribed Data

Ideally, for Automatic Speech Recognition (ASR), 200 hours of transcribed data is collected perSUMMA language. This data is being complemented by “found” data, usually available to thecommunity (e.g., Globalphone, TED lectures, etc) either to increase the amount (and variety) oftraining data or to develop first baseline systems before all necessary transcribed data from ourbroadcast partners is made available.

A minimum of 100 hours of such data is required per language to have a valuable training dataset.All SUMMA languages will be covered, but the focus during the initial stage for transcription ison German, Arabic, and Persian. Different levels of transcription are handled, i.e. verbatim withall details including pauses and hesitations; correct transcription of spoken text; and finally, raw,unedited transcription as it comes from the content provider. Data with timecodes is preferred,but also data without such timecodes is useful, as they can be added automatically. Transcriptsand their related media files are to be stored in the SUMMA Box repository, together with theirrelated media files. The source type is audio or video with audio. The requested machine-readableformat for the transcripts is .txt (UTF8). The individual format for each provider is being regularlycleared with the technical partners. Finally, subtitled data (i.e., loose transcription) will also be

page 7 of 23


made available whenever available to further improve our multilingual ASR systems (exploitingsemi-supervised training approaches).

All available data, to be used/shared by all the partners, is stored in a specific SUMMA Box repos-itory (https://www.box.com).

2.4.2 Translated Data

Ideally, for Machine Translation (MT), 10,000,000 parallel sentences are needed per languagecombination for a valuable training set.

In all the SUMMA scenarios, the target language will always be English, so SUMMA deals with 8language combinations (German-English, Spanish-English, Portuguese-English, Russian-English,Arabic-English, Farsi-English, Ukrainian-English, Latvian-English).

In particular low-resourced languages are sought after for the creation of a training set. For ex-ample, there is already a lot of material from German, Spanish, Arabic and Russian into English,but not from Portuguese, Ukrainian, Latvian, and Farsi. For the well-resourced languages, a smal-ler in-domain test set may suffice, but for the lower-resourced languages a larger training set needsto be built. Different levels of translation quality will be provided, i.e., fully parallel translations,semi-parallel translations, and similar texts (with similar content, covering the same topic, withoutbeing real translations). Translation sets are to be stored in a specific SUMMA Box repository. Thesource type is text (from broadcast articles or transcripts). Target format is .txt (UTF8).

2.4.3 Annotated Data

Further levels of annotations are also requested for topic clustering, summarisation and sentimentanalysis. This includes both regular broadcast material, but also in particular social media. Thefocus for the initial stages is German, English and Arabic. One form of annotation requested ishighlights. For this, preferably a limited number of manually written summaries are provided.These will be used for topic detection and story clustering. Annotations for sentiment analysis arealso needed for social media. Another form of required annotation is the creation of a set of 500documents with IPTC classification (limited to the 4 highest levels of the IPTC classification). Thefocus in the first stages will be on German, English and Arabic.

2.5 Provision of Monitoring Data

Described below is the envisaged process for content provision by the broadcasters for the purposeof data monitoring. This is based on the available infrastructure, content requirements and plannedprototyping. As the project is still in its early stages, it must be understood that changes in theenvisaged processes and in actual implementation are still possible.

As will be clear from the sections below, the descriptions for BBC data management and informa-tion flow within the project are more elaborate than those for Deutsche Welle. The reason for this isthat DW focuses on the internal monitoring use case, which uses internal channels, fully controlledby Deutsche Welle. The BBC’s primary use case, on the other hand, is that of external monitoring,BBCM collects and processes content – primarily streaming content – from other broadcasters.BBCM monitors up to 200 channels simultaneously in real life. It is therefore by nature morecomplicated than the internal use case and requires more time to implement. Much of Deutsche

page 8 of 23

https://www.box.com


Welle’s content provision from internal sources through APIs has been implemented in the firstmonths of the project and was therefore available for early use within the SUMMA platform. Thisprocess will be further enhanced and streaming content will be added. BBCM content is beingimplemented according to the provision plan as detailed below.

2.5.1 Customised Batches and API Access

Thematic Data DumpsThematic data batches could also be requested by the consortium partners throughout the projectduration. These datasets can be used to test a topical range, or certain technologies or components.They will be focused on a specific theme or topic, and provide a batch of “super stories”. Regularlya specific data collection on one big news event will be agreed upon to ensure a consistent thematicdata collection.

In the first six months, three sets of customised batches have been supplied by DW and BBC,supplemented by data from LETA and QCRI:

• At the start of the project: an initial set of sample records for each language covered• 15 March 2016: a specific data collection on the topic of the 5th anniversary of the Syrian

uprising – 24-hour coverage of the news• 12 July 2016: a specific data collection on the topic of the 10th anniversary of the Second

Lebanon War – 24-hour coverage of the news.

General Data DumpsIn addition, general, non-thematic data dumps are foreseen as additional training and test datasets.

In (ongoing) BBC Phase One data collection, initial general data dump will contain A/V materialprovided by BBC Monitoring in Arabic, Russian, Persian and Ukrainian. This will consist of bothaudio and video material. There will also be material in Persian and Arabic from the BBC’s Persianand Arabic services respectively. The media data dump will be a collection of media files placedin a SUMMA repository on Box (https://www.box.com). It will be made up of data collected fromall of the monitored sources collected over a 24 hour period, so should be around 200 hours intotal. It will consist of a maximum of 8 video streams covering the full period of the data dumptime window.

For the initial data dump, the BBC is providing material in textual form from various sources:

• Twitter (English, Arabic, Russian, Persian and Ukrainian)• Facebook (English, Arabic, Russian, Persian and Ukrainian)• Blogs (English, Arabic, Russian, Persian and Ukrainian)• Webpages (English, Arabic, Russian, Persian and Ukrainian)

Webpages are also planned to be scraped and their content extracted where semantic tagging isused within the page, otherwise the entire document may need to be provided.

page 9 of 23

https://www.box.com


Figure 1: BBC textual content workflow

The following are considered to be out of scope for the data dump, but may be considered for thelive system:

• Facebook A/V content• Twitter A/V content• YouTube• Instagram• Podcasts

In BBC Phase Two data collection, on-demand media dump (through API) are planned for end of2016. The BBC will store AV files within the BBC’s secure area where it may be accessed by theconsortium partners for streaming via HLS. A data dump could be created “on-the-fly” throughthe use of a RESTful API. In this phase, the BBC will be creating very small chunks of media(possibly around 10 seconds), in preparation for the implementation of the live system. As thesefiles are too small to be used conveniently as part of a data dump, an API will be provided toconsolidate these small chunks into larger more manageable files. The format of these files will besimilar to that described above in the initial data dump.

Deutsche Welle has its API access in place and has made the content of its mobile site (m.dw.com)available to the consortium, together with instructions. It contributes material in English, German,Arabic, Spanish, Russian, Persian, Portuguese and Ukrainian.

LETA provides direct access to content in Latvian, and is the only content provider for that lan-guage within the consortium, although it may also deliver material in Russian and English.

QCRI has access to external content in Arabic, and started providing such material upon request.

Content is supplied via each partner’s respective APIs. Using API access ensures that data can begathered liberally for the platform and separate components without the need to request it fromthe content providers. Instructions for the API access and use will be provided by each contentprovider.

Data dumps may be created on-demand by the partners at any time. All recorded media willbe available for download from the media store until the point of purging – which is yet to be

page 10 of 23

m.dw.com


determined.

Streaming ContentBBC focuses on monitoring external sources, so streaming is the primary targeted mode. TheBBC “as-live” media system builds on the on-demand system from Phase 2 and incorporates amechanism for creating JSON sidecar files. These will be generated for each chunk of media andpushed into the SUMMA API provided by LETA. These will provide links to the media whichreside on the AV store within the BBC’s AV file storage system.

The BBC will provide an interface whereby text sources can be selected and the “data dump” and“as-live” functionality can be switched “on” and “off”. Note that both modes can be run at thesame time if desired so that a data dump can be created while also feeding the SUMMA system.Social media and other text-based content will be ‘scraped’ as far as possible to reduce the amountof noise present in the data, although this is a difficult process and so it might prove impossible forus to completely remove unwanted data in all cases.

Figure 2: BBC AV workflow

Even though for DW the on-demand content is more important for its primary use case, DW willalso include its streaming content. The process for that is in the making.

2.5.2 Other Sources

The participating broadcasters may add RSS feeds and podcasting to the channels to be provided.DW has granted SUMMA access to all of its RSS feeds and podcasting channels.

page 11 of 23


2.6 Provision of Data for Specific Technologies

DW and BBC have started collecting data to train and test the specific modules and technologiesbased on the requirements from the technical partners.

The provision will happen gradually and the collection will be built up, depending on the availab-ility of the sources, the feasibility of processing and preparing data. Scripts have been or are beingwritten to automate this process.

Deutsche Welle will provide items with teasers which can be used to train the summarisation tool.

Deutsche Welle is in the process of identifying and locating multilingual datasets with transcrip-tions, thus combining the effort of providing translation material and transcriptions. These setsare, however, only available for well-resourced languages (English, German, Spanish, and someArabic). A process to convert the original transcript into a machine-readable format has been setup and is currently being optimised while processing a first set of transcriptions. A script has beenwritten to automate the process.

Figure 3: DW Multilingual Transcription Spanish-German-English

page 12 of 23


3 Types of Data Generated

“Generation of data” in this report refers to the generation, the production of data by the SUMMAplatform, or any of its components or technologies. This can range from plain audio transcripts of(multilingual) broadcast material (ready for indexing and direct search), translated broadcast text(applied to multilingual speech-to-text/ASR system outputs, from any for the SUMMA languageto English as the target language), to automated annotations or graphical enhancements.

We distinguish between six main categories of data that will be generated during the project:

• Content data generated during media monitoring. This is typical broadcast data that remainscopyright-protected.• Specific output formats following a particular step in the SUMMA processing chain. This

includes transcriptions, translations, summaries, annotations, graphics and statistical data.This usually also includes broadcast content.• Software, models, algorithms, lexicons and ontologies, annotations, etc.• Personalised data generated during field testing and prototype testing• Social Media Data: Twitter, Facebook, YouTube, etc• Academic-type research publications

Of course, all this generated data will directly exploit and enrich the data collected (Section 1).

page 13 of 23


4 Data and Metadata Standards

4.1 Metadata

Available metadata at content provider side is considered for inclusion in SUMMA. Metadataformats differ according to content provider.

The original broadcaster metadata format is provided to the integrator and overall preferences andsettings are discussed. Within SUMMA, it is agreed that JSON format is preferred; the BBCmetadata uses Dublin-core; DW metadata does not use Dublin-core.

LETA integrates the different incoming metadata formats. This is a realistic scenario, as suchplatforms need to take into account the fact that different content providers use different schemas,so mapping and ingestion should be made easy and allow for maximum automation after initialsetting of mapping schemas.

4.2 Dataset Identifiers and Descriptions

Within the SUMMA project, data sets are divided into two main groups: (1) regular content formonitoring, and (2) specific data for training and testing transcription, translation, summarisation,and sentiment analysis NLP tools.

In Group (1), “regular content” data files are used within the SUMMA Prototype system and arealways accompanied by the JSON sidecar files providing unique content identifier and completedescription as part of metadata fields such as title, source, author, datetime . The text-based “reg-ular content” is stored directly within the JSON sidecar files while the A/V content is linked fromthe JSON sidecar files by its URL in the A/V storage system. Special treatment is applied to the‘as-live’ A/V content delivered over HLS stream – for storage purposes each such stream is arti-ficially split into segments with each segment having its own sidecar JSON file. Duration of thesegment can vary from 10 sec to 1 hour. For streams, the JSON sidecar file identifier is exten-ded with the GMT <date> / <time> fields uniquely identifying the stream segment in time. Theactual structure and content of the JSON sidecar files varies by the provider (BBC, DW, LETA,QCRI) and will be described in the Swagger editor (http://editor.swagger.io) as part of RESTfulAPI documentation. A sample sidecar JSON file is shown below:

page 14 of 23

http://editor.swagger.io


{

"@context": {

"dc": "http://purl.org/dc/elements/1.1/",

"summa": "http://www.summa-project.eu/vocab#"

},

"@id": "",

"@type": "summa:article",

"summa:text": "Claire Brisebois Starnes enlisted in the Signal Corps of the US Army in 1963...",

"summa:date": "2016-07-06T11:28:28.8817057Z",

\copyright": \US Army",

"dc:date": "None",

"dc:identifier": "http://www.bbc.co.uk/news/in-pictures-36574698",

"dc:language": "en",

"dc:publisher": "BBC News - Home",

"dc:source": "http://www.bbc.co.uk/news/",

"dc:title": "’Tuning out emotionally’",

"dc:type": "Text"

}

In Group (2), “specific data” for training and testing the NLP tools within the SUMMA project aregenerated as part of data dumps and stored in the shared Box.com repository. The file-path withinthe SUMMA Box repository serves as the data set identifiers. Data set descriptions are providedas readme files within the folders containing the actual data sets. A sample Box repository dataset identifier is given here: /SUMMAData/TrainingDataOneNewsDay/Spanish/OnlineText/DW SyriaDWCOM-SP 1939 20160315 a-19118941.json

The data sets generated in the SUMMA Platform NLP processing chain such as transcripts, trans-lations, summaries are stored as additional fields within the JSON sidecar files, therefore theirstorage, identification, and descriptions are identical to the Group (1) data set treatment describedabove.

4.3 Video Material

Although video material itself will not be processed in the context of SUMMA, it is howeverimportant to store it, with properly links to the audio and other metadata material which will beused to indirectly index and search the relevant videos.

Video data will include metadata and at least the link to the video file, or the media file itself(especially if there is a chance the file will be taken offline in the foreseeable future). Videomaterial will include streaming data as well as on-demand data. The preferred video file formatis mp4 (DW has mp4 - BBC has MPEG2). The metadata format may differ according to contentprovider. A 2-minute latency is allowed to start processing streaming content.

Deutsche Welle provides on-demand video content from DW internal sources in first instance(mp4)and the provision of such content is already in place via API access. Procedures for supplyingstreaming content from DW sources (HLS stream) are being established.

BBC provides primarily streaming content from external sources. The AV sources will originatefrom BBC Monitoring and will be selectable by the BBC Monitoring team. It will initially consist

page 15 of 23

/SUMMA Data/TrainingDataOneNewsDay/Spanish/OnlineText/ DW_Syria_DWCOM-SP_1939_20160315_a-19118941.json

/SUMMA Data/TrainingDataOneNewsDay/Spanish/OnlineText/ DW_Syria_DWCOM-SP_1939_20160315_a-19118941.json


of a maximum of 8 video streams but this may be scaled upwards as the project progresses. Thepreview videos supplied by the system are intended to be reconstituted into an HLS stream – so thatthey may be used as part of the interface being designed and developed in the work packages onRequirements (WP1) and on Integration (WP6). It should be noted that it may become necessaryto change the chunk sizes, bitrates or codecs to accommodate this at a later date.

When the system is running in “as-live” mode, BBC JSON data is pushed into the SUMMA systemvia the RESTful API supplied by LETA. The exact JSON format is yet to be agreed with LETAand will be defined in Swagger. All JSON data will be encoded using UTF-8.

4.4 Audio Material

The audio is separated from the video and will be present in the media store along with the videoas separate MPEG2 .TS files in AAC-HE 64 kb/s format, chunked into 10 second segments. Thisis similar to the data dump except that the chunk size is much smaller, allowing the files to bereconstituted into an HLS stream for language processing by the academic partners. As well asthe audio files, there will also be preview video files encoded as MPEG2 .TS files with audio inAAC-HE 64 kb/s format and video as H264 VBR at bitrates between approximately 1mb/s and12mb/s depending on the original source. These will also be chunked at 10 second intervals totime-align with the audio files described above.

Besides the transcription mentioned earlier, audio data will also include, whenever possible, addi-tional metadata (topic, speaker, story, genre, etc), together, of course, with the link to the audio file,or the media file itself (especially if there is a chance the file will be taken offline in the foresee-able future). Audio material will include streaming data as well as on-demand data. The preferredaudio file format is mp3. The metadata format may differ according to the content provider (andnot all of them will be Dublin core compatible).

All A/V media files supplied by the “as-live” system are being kept in storage provided by theBBC and these media files are referenced in the JSON supplied to the LETA API.

4.5 Text Articles Material

Text data should include metadata and the full text. The format may differ according to the contentprovider. The format agreed upon is JSON. Data will primarily be provided via direct API access.

Deutsche Welle provides two levels of JSON files for articles: a teaser format providing a list,overview of items, and a detail format, with a full-text version containing all details of the assets.

page 16 of 23


Figure 4: DW JSON Teaser Example

4.6 Social Media Data

Social media data (Twitter, Facebook, YouTube – and possibly other sources) will be providedwith some annotation. The broadcast partners will provide identifiers or links to social media postsrelevant for their organisation.

Figure 5: Some DW Twitter Feeds

page 17 of 23


4.7 RSS Feeds, Podcasting

RSS feeds and podcasts are provided when available and appropriate. The metadata format maydiffer according to content provider.

DW has provided access to all its RSS and podcasting feeds via API.

Figure 6: DW RSS Feeds

page 18 of 23


5 Data Storage, Preservation and Re-Use

Data collected by the content providers are currently stored in a SUMMA repository on Box (https://www.box.com), managed by the BBC and used by all consortium partners. It holds all selectedand downloaded broadcast content for experimenting, testing, training, trialing and demo withinSUMMA. That repository was selected as it meets internal BBC Data storage security requirementsand was agreed upon by the rest of the consortium.

The platform and technical partners retrieve the data from the Box repository (or from the broad-caster directly, via API), after which it is stored in the SUMMA platform on LETA servers.

Technology partners also retrieve selected data either from the Box repository or through APIaccess and store it on their organisation’s server for specific training and testing on models andtechnologies, e.g. for MT or ASR.

Preservation options after the project will be discussed in the next iteration of this report, whenthe final output form becomes visible. However, it is envisaged that the platform is usable – andcustomisable - after project end, as it is especially geared towards Use Case 1 and Use Case 2. Thedata produced during the course of the project will be available in accordance with the ConsortiumAgreement and licence agreements.

Re-use will be ensured as much as possible and will primarily apply to software, lexicons, al-gorithms, dashboards, etc.

page 19 of 23

https://www.box.com

https://www.box.com


6 Policies for Data Access and Sharing

6.1 Different Levels for Access and Sharing

There are different kinds of categories of data that will be collected or generated during the project,with different levels and conditions for access and sharing:

• Original broadcast data is copyright-protected and, as stipulated in the Consortium Agree-ment, is provided only for use by the consortium partners for the duration of the project. Itcan therefore not be shared outside the consortium or after the project. Some demo materialwill be selected for public viewing in agreement with the broadcasters.

• Data generated during media monitoring. This data is typically owned by the broadcaster;therefore, the consortium does not have the rights to share this as open research data. How-ever, negotiations will be opened with broadcasters with the aim of releasing data sets forspecific research use, as has been done in the past by the BBC for the MediaEval and MGBChallenge evaluation campaigns.

• Specific output formats following a particular step in the SUMMA processing chain. Thisincludes transcriptions, translations, summaries, annotations, graphics and statistical data.This usually also includes broadcast content.

• Software, models, algorithms, lexicons and ontologies, annotations, etc. These will be madeavailable as open source as much as possible. We shall endeavour to publish and makeopen access derived data such as phrase dictionaries for MT, when such publication is not inbreach of copyright. In cases where data is available, but copyright restrictions do not allowus to publish it, we shall release tools to reconstruct it (cf. Kaldi recipes and the WikiLinksproject).

• Personalised user-specific data generated during field testing and prototype testing. This isdata relating to people’s use of SUMMA prototype systems. As this is very specific to thesystems being evaluated, and also personal to evaluation subjects, we do not anticipate torelease this data.

• Social Media Data. Twitter, Facebook, YouTube, etc. is very time-sensitive, quickly out-dated and also strongly related to privacy issues. However, social media is an essentialpart of the content used and analysed. In order to ensure the protection of private indi-viduals’ personal data (contained in Tweets or Facebook posts), the data providers (UserPartners) will not transfer any third-party social media content to other consortium part-ners. Of course, social media content owned by the content provider (e.g. @dwnews)or the consortium (@SummaEu) can be used if the account holder agrees to this. Userpartners will take protection of personal data legislation into account while developing theplatform for the processing of social media. The potential value of the social media out-put, anonymisation/pseudonymisation and retention of social media user-related data willbe further discussed as we see a more advanced platform and output formats. Followingrecommendations from the Ethics Committee, special care shall be taken in handling socialmedia content, ensuring privacy protection. The SUMMA content providers will differenti-ate between different categories of social media posters - e.g. public figures (politicians),political organisations (political parties, NGOs) and private citizens.

page 20 of 23


• Academic-type research publications. Academic publications will be made available as“green” open access via institutional repositories and the OpenAire system.

6.2 Planned Measurements for the Protection of Personal Data

This section deals with ethical issues addressed by the Ethics Advisory Board. The consortiumwill specify in its data management procedures how it intends to identify where personal data isinvolved and how such personal data is protected.

As the SUMMA architecture and information flow is being established, security and privacy issuesare being taken into consideration and procedures will be set before implementation. All consor-tium partners dealing with data, including provision, use, processing and storing, will look intodata protection regulations for their organisation and country.

• The consortium will decide on the process of dealing with and retaining images in whichpeople could potentially be identified. This relates to social media, but also to regular audi-ovisual broadcast media in which individuals could be identified.

• For the protection of personal data in social media, names will be replaced by hashtags.

• Pseudonymisation will be considered as an alternative to anonymisation, in particular forclustering. Special attention will be given to the control of the key.

• The consortium will compile a Data Protection Impact Plan, which includes a privacy im-pact assessment. The UK-based partners in their role of data controllers, for instance, willseek advice from the Information Commissioner’s Office (ICO). All consortium partners areresponsible for seeking advice from their respective local data protection authorities (Ger-many, Latvia, Portugal, Switzerland). It is understood that the consortium as a whole arejoint data controllers in this project.

• Security procedures will be established for each partner dealing with data. For instance, allcommunication by any third parties with the BBC Monitoring web server and storage willbe secured using SSL via the HTTPS protocol and will require a security certificate. TheBBC will issue these certificates to each of the partners. This is to meet the needs of BBCInformation Security and UK Data Protection regulations.

• The consortium is striving for transparency to make the purpose of including social mediacontent and other content containing personal data in the SUMMA project clear. It will beclearly stated which partner is responsible for the relevant data. The details will be estab-lished in the data protection impact plan.

• The consortium is looking into access procedures and restrictions during dissemination,which is targeted at approaching potentially new users. A solution must be found to en-sure privacy protection and at the same time allow the SUMMA platform and some of itsdata to be demonstrated.

• The Data Protection Impact Plan will also address the duration of storing personal data afterdelivery to the platform by the content providers, i.e., storage within the SUMMA platform.This will contribute to the requirement of Privacy by Design.

page 21 of 23


• The next iteration of the Data Management Plan, due M18, will include the Data ProtectionImpact Plan and describe in detail all aspects involved.

7 Conclusion

The present Deliverable D2.1, the Data Management Plan (DMP), is setting up the basis for theSUMMA project management strategy and planning, as extensively discussed and agreed by all thepartners.

D2.1 addresses most identified issues related to collection and generation of data, data set identifi-ers and descriptions, standards, data sharing, property rights and privacy protection, and long-termpreservation and re-use. Standards and metadata formats have also been agreed upon. Mappingwas also successfully done, and a first dataset has been provided and processed in the platform. Fi-nally, a specific SUMMA Box repository (https://www.box.com) has been established and is beingused to exchange all relevant media data between the partners (while LETA provides the Dockerintegration infrastructure).

This initial version is a first issue of three iterations; the next one is due in M18; the final one atthe end of the project in M36. Data collection, generation and processing are key areas in thismonitoring project and will be discussed, elaborated upon, and further specified throughout theproject.

Arriving at an agreement between providing results to potential users as well as the open dataplatform, on the one hand, and protecting the copyright and privacy of content providers and endusers, on the other, will be the continuous endeavour of the consortium.

page 22 of 23

https://www.box.com


ENDPAGE

SUMMA

H2020-ICT-2015 688139

D2.1 Data Management Plan

page 23 of 23

Scalable Understanding of Multilingual MediA...

Documents

Transcript of Scalable Understanding of Multilingual MediA...