Scalable Understanding of Multilingual MediA...

38
SUMMA H2020–688139 D2.3 Updated Data Management Plan Scalable Understanding of Multilingual MediA (SUMMA) http://www.summa-project.eu H2020 Research and Innovation Action Number: 688139 D2.3 – Updated Data Management Plan Nature Report Work Package WP2 Due Date 31/07/2017 Submission Date 31/07/2017 Main authors Chris Hernon (BBC) Co-authors Andrew Secker (BBC), Peggy van der Kreeft (DW), Steve Renals (UEdin) Reviewers Alexandra Birch (UEdin) Keywords data, social media, monitoring, metadata, access, data protection, privacy Version Control v0.1 Status Draft for Ethics Board 16/06/2017 v0.2 Status Draft for Review 25/07/2017 v0.3 Status Reviewed 27/07/2017 v1.0 Status Final Version 31/07/2017 page 1 of 38

Transcript of Scalable Understanding of Multilingual MediA...

Page 1: Scalable Understanding of Multilingual MediA (SUMMA)summa-project.eu/wp-content/uploads/2017/08/SUMMA... · SUMMA (Scalable Understanding of Multilingual Media) is a three-year H2020

SUMMA H2020–688139 D2.3 Updated Data Management Plan

Scalable Understanding of Multilingual MediA(SUMMA)

http://www.summa-project.eu

H2020 Research and Innovation ActionNumber: 688139

D2.3 – Updated Data Management Plan

Nature Report Work Package WP2Due Date 31/07/2017 Submission Date 31/07/2017

Main authors Chris Hernon (BBC)Co-authors Andrew Secker (BBC), Peggy van der Kreeft (DW), Steve Renals

(UEdin)Reviewers Alexandra Birch (UEdin)Keywords data, social media, monitoring, metadata, access, data protection, privacy

Version Controlv0.1 Status Draft for Ethics Board 16/06/2017v0.2 Status Draft for Review 25/07/2017v0.3 Status Reviewed 27/07/2017v1.0 Status Final Version 31/07/2017

page 1 of 38

Page 2: Scalable Understanding of Multilingual MediA (SUMMA)summa-project.eu/wp-content/uploads/2017/08/SUMMA... · SUMMA (Scalable Understanding of Multilingual Media) is a three-year H2020

SUMMA H2020–688139 D2.3 Updated Data Management Plan

Contents

1 Introduction 6

2 Types of Data Collected 72.1 Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7

2.2 Data Types . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8

2.3 Requirements for Monitoring Data . . . . . . . . . . . . . . . . . . . . . . . . . . 9

2.4 Requirements for Data for Specific Technologies . . . . . . . . . . . . . . . . . . 9

2.4.1 Transcribed Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9

2.4.2 Translated Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10

2.4.3 Annotated Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10

2.5 Provision of Monitoring Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10

2.5.1 Customised Batches and API Access . . . . . . . . . . . . . . . . . . . . 11

2.5.2 Other Sources . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13

2.6 Provision of Data for Specific Technologies . . . . . . . . . . . . . . . . . . . . . 13

3 Types of Data Generated 14

4 Data and Metadata Standards 154.1 Metadata . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15

4.2 Dataset Identifiers and Descriptions . . . . . . . . . . . . . . . . . . . . . . . . . 15

4.3 Video Material . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16

4.4 Text Articles Material . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17

4.5 Social Media Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18

4.6 RSS Feeds, Podcasting . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18

5 Data Storage, Preservation and Re-Use 20

6 Policies for Data Access and Sharing 216.1 Different Levels for Access and Sharing . . . . . . . . . . . . . . . . . . . . . . . 21

6.2 Planned Measurements for the Protection of Personal Data . . . . . . . . . . . . . 22

7 Privacy by Design 23

8 Scenario Models 258.1 Scenario 1 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25

8.2 Scenario 2 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25

page 2 of 38

Page 3: Scalable Understanding of Multilingual MediA (SUMMA)summa-project.eu/wp-content/uploads/2017/08/SUMMA... · SUMMA (Scalable Understanding of Multilingual Media) is a three-year H2020

SUMMA H2020–688139 D2.3 Updated Data Management Plan

9 Ethics recommendations 27

10 Conclusion 28

A Personal Data 29

B EU/National data protection laws 31

C Roles of SUMMA Partners 36

page 3 of 38

Page 4: Scalable Understanding of Multilingual MediA (SUMMA)summa-project.eu/wp-content/uploads/2017/08/SUMMA... · SUMMA (Scalable Understanding of Multilingual Media) is a three-year H2020

SUMMA H2020–688139 D2.3 Updated Data Management Plan

List of Figures

1 BBC AV workflow . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12

2 DW Multilingual Transcription Spanish-German-English . . . . . . . . . . . . . . 13

3 DW JSON Teaser Example . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17

4 Some DW Twitter Feeds . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18

5 DW RSS Feeds . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19

page 4 of 38

Page 5: Scalable Understanding of Multilingual MediA (SUMMA)summa-project.eu/wp-content/uploads/2017/08/SUMMA... · SUMMA (Scalable Understanding of Multilingual Media) is a three-year H2020

SUMMA H2020–688139 D2.3 Updated Data Management Plan

Abstract

The Data Management Plan provides an analysis of the main elements of the data managementpolicy that will be used by the SUMMA consortium with regard to all the datasets collected foror generated by the project. It addresses issues such as collection of data, data set identifiers anddescriptions, standards and metadata used in the project, data sharing, property rights and privacyprotection, and long-term preservation and re-use, complying with national and EU legislation.

This is an update to the initial data management plan published as D2.1 of the project. A furtherupdate to the data management plan will be provided at the end of the project (D2.5).

page 5 of 38

Page 6: Scalable Understanding of Multilingual MediA (SUMMA)summa-project.eu/wp-content/uploads/2017/08/SUMMA... · SUMMA (Scalable Understanding of Multilingual Media) is a three-year H2020

SUMMA H2020–688139 D2.3 Updated Data Management Plan

1 Introduction

SUMMA (Scalable Understanding of Multilingual Media) is a three-year H2020 project, runningfrom February 2016 to January 2019, under the Research and Innovation Action grant agreementnumber 688139. SUMMA participates in the H2020 Pilot on Open Research Data. This Data Man-agement Plan (DMP) provides an analysis of the main elements of the data management policy thatwill be used by the SUMMA consortium with regard to all the datasets collected for or generatedby the project.

This deliverable is the second version of the DMP and there will be one further iteration which willelaborate on the issues covered. This deliverable supersedes the first version of the deliverable:D2.1 Data Management Plan. The DMP addresses issues such as collection of data, data setidentifiers and descriptions, standards and metadata used in the project, data sharing, propertyrights and privacy protection, and long-term preservation and re-use, complying with national andEU legislation.

This version covers the areas reported on in the first version (D2.1) which addressed data require-ments and collection, types of datasets, metadata standards and formats, direct access to broadcastdata through APIs, intended use of broadcast data provided, initial output formats, and measuresto protect property rights and privacy.

In this version we have extended the description of our data protection requirements and describewhat actions we have agreed on to address these requirements. We have added two new sectionsto address data protection, Section 7 Privacy by Design, and Section 8 Scenario Models. Wehave also added three appendices, Appendix A describing Personal Data, Appendix B on EU andNational data protection laws, and Appendix C describing the roles of the SUMMA partners. Allthe partners in SUMMA are addressing data security issues for accessing, processing and storingSUMMA data, and each partner’s information security policies are described in Deliverable D10.1.

For more details on how data dumps and streams were provided in the first 18 months of theproject, see deliverable D2.2 ‘Initial Data Provision’.

page 6 of 38

Page 7: Scalable Understanding of Multilingual MediA (SUMMA)summa-project.eu/wp-content/uploads/2017/08/SUMMA... · SUMMA (Scalable Understanding of Multilingual Media) is a three-year H2020

SUMMA H2020–688139 D2.3 Updated Data Management Plan

2 Types of Data Collected

2.1 Overview

SUMMA develops an open-source platform for dealing with large volumes of data across manylanguages and different media types. It has a range of technologies that are implemented, includingautomated speech recognition, machine translation, topic clustering, summarisation and semanticanalysis.

Data is being collected in the nine SUMMA languages: English, German, Spanish, Portuguese,Arabic, Persian, Russian, Ukrainian, Latvian.

The project includes three data providers: BBC and Deutsche Welle (DW) as world broadcasterswith a wide range of languages and acting primarily as user partners and content providers inSUMMA. LETA, the Latvian Information Agency, has a double role as integrator and contentprovider. In addition, the Qatar Computing Research Institute also provides content, in particularfor Arabic.

Three use cases implement the applications and put the data to use:

• External monitoring: intelligent tools for global news monitoring of up to 200 broadcastchannels• Internal monitoring: cross-lingual exchange, enabling awareness and re-use of data across

languages• Data journalism: a use case for year three, in which measurable data is extracted and used

for creating narrative journalism, including with visual representations.

BBC targets the external monitoring use case, with a complicated workflow simultaneously monit-oring up to 200 external news channels with streaming content in at least four SUMMA languages(Russian, Arabic, Persian, Ukrainian). The tool combines the functionalities of monitoring, tran-scribing and translation, as well as summarising and clustering data.

DW focuses on the internal monitoring use case, involving eight SUMMA languages (English, Ger-man, Spanish, Portuguese, Russian, Arabic, Persian, Ukrainian). In this use case, DW content (andcontent from other news providers), primarily on-demand video, and audio and text articles, butalso streaming content, published in the above languages is continuously monitored, transcribed,translated, compared, clustered and summarised. The result is a tool which keeps editors andjournalists up to date on what the trending news stories are and what has been published in thoselanguages, allowing them to obtain either a full translation into English or a summary of the stories.It thus improves monitoring capacity and quality and reduces workload by automated translationsupport and summarisation.

This is a broadcaster-focused project, with involvement of world broadcasters with coverage ofseveral languages and thus with a key role for data.

“Collection of data” in this report refers to the acquisition of data by the consortium, primarilythrough data provision by the participating SUMMA broadcasters.

Coverage of the different languages by the user/content partners (BBC, DW, LETA) is summarisedin the following table:

page 7 of 38

Page 8: Scalable Understanding of Multilingual MediA (SUMMA)summa-project.eu/wp-content/uploads/2017/08/SUMMA... · SUMMA (Scalable Understanding of Multilingual Media) is a three-year H2020

SUMMA H2020–688139 D2.3 Updated Data Management Plan

Language ProviderEnglish BBC, DW, LETAGerman DWSpanish DW

Portuguese DWArabic BBC, DWRussian BBC, DW, LETA

Ukrainian BBC, DWPersian BBC, DWLatvian LETA

2.2 Data Types

Data for SUMMA is being collected at several levels:

• By project target use– Ingestion data– Training data– Test data

• By targeted technology– Data monitoring– Automated transcription– Automated translation– Knowledge base creation– Automated summarisation– Sentiment analysis– Keyword annotation

• By type of data– Metadata– Video material– Audio material– Text articles– Social media– Ontologies

• By delivery type– Streaming data– Batch data

• By language– All nine SUMMA languages

• By content provider/user partner– BBC– DW– LETA– Others, including the Qatar Computing Research Institute

For practical purposes, data requirements and provision are being divided into two main groupsbased on targeted technology: on the one hand, regular content for monitoring, and on the other

page 8 of 38

Page 9: Scalable Understanding of Multilingual MediA (SUMMA)summa-project.eu/wp-content/uploads/2017/08/SUMMA... · SUMMA (Scalable Understanding of Multilingual Media) is a three-year H2020

SUMMA H2020–688139 D2.3 Updated Data Management Plan

hand, specific data for transcription, translation, summarisation, sentiment analysis. The othertypes and levels are described within these two broad categories.

2.3 Requirements for Monitoring Data

As SUMMA deals primarily with data monitoring, such data is essential for prototype development,assessment, user validation and scalability testing. Different types of data are involved, as well asdifferent types of delivery. These aspects and challenges are detailed further in this report.

• Different types of data– Metadata– Video material– Audio material– Social media

• Different types of delivery– Streaming data– Batch data

2.4 Requirements for Data for Specific Technologies

The participating broadcasters are directly supporting the technology partners by providing train-ing and/or test data for the different components and technologies, whenever possible. The provi-sion depends on the availability of such data, and on the required manpower for preparation andadaptation.

Requirement specifications for such data have been gathered within WP2, detailing what typeof data is needed, and how much. It is in the interests of the technology developers and the userpartners alike to arrive at powerful tools providing a high-quality output, and all participants realisethat training and test data is needed to make this happen.

2.4.1 Transcribed Data

Ideally, for Automatic Speech Recognition (ASR), at least 200 hours of transcribed data is collectedper SUMMA language. This data is being complemented by “found” data, usually available to thecommunity (e.g., Globalphone, TED lectures, etc) either to increase the amount (and variety) oftraining data or to develop first baseline systems before all necessary transcribed data from ourbroadcast partners is made available.

A minimum of 100 hours of such data is required per language to have a valuable training dataset.All SUMMA languages will be covered, but the focus during the initial stage for transcription ison German, Arabic, and Persian. Different levels of transcription are handled, i.e. verbatim withall details including pauses and hesitations; correct transcription of spoken text; and finally, raw,unedited transcription as it comes from the content provider. Data with timecodes is preferred, butalso data without such timecodes is useful, as they can be added automatically. Initially, transcriptsand their related media files were stored in a repository in a BBC instance of the Box file-sharingplatform, together with their related media files. Later, data dumps were provided on hard drives.And currently, live streams of BBC sources are available via AWS. The source type was audio orvideo with audio. The requested machine-readable format for the transcripts was .txt (UTF8). The

page 9 of 38

Page 10: Scalable Understanding of Multilingual MediA (SUMMA)summa-project.eu/wp-content/uploads/2017/08/SUMMA... · SUMMA (Scalable Understanding of Multilingual Media) is a three-year H2020

SUMMA H2020–688139 D2.3 Updated Data Management Plan

individual format for each provider is being regularly cleared with the technical partners. Finally,subtitled data (i.e., loose transcription) will also be made available whenever available to furtherimprove our multilingual ASR systems (exploiting semi-supervised training approaches).

2.4.2 Translated Data

Ideally, for Machine Translation (MT), 10,000,000 parallel sentences are needed per languagecombination for a valuable training set.

In all the SUMMA scenarios, the target language will always be English, so SUMMA deals with 8language combinations (German-English, Spanish-English, Portuguese-English, Russian-English,Arabic-English, Farsi-English, Ukrainian-English, Latvian-English).

In particular low-resourced languages are sought after for the creation of a training set. For ex-ample, there is already a lot of material from German, Spanish, Arabic and Russian into English,but not from Portuguese, Ukrainian, Latvian, and Farsi. For the well-resourced languages, a smal-ler in-domain test set may suffice, but for the lower-resourced languages a larger training set needsto be built. Different levels of translation quality will be provided, i.e., fully parallel translations,semi-parallel translations, and similar texts (with similar content, covering the same topic, withoutbeing real translations). Translation sets were initially stored in a specific SUMMA Box repository.The source type was text (from broadcast articles or transcripts). Target format is .txt (UTF8).

2.4.3 Annotated Data

Further levels of annotations are also requested for topic clustering, summarisation and sentimentanalysis. This includes both regular broadcast material, but also in particular social media. Thefocus for the initial stages is German, English and Arabic. One form of annotation requested ishighlights. For this, preferably a limited number of manually written summaries are provided.These will be used for topic detection and story clustering. Annotations for sentiment analysis arealso needed for social media. Another form of required annotation is the creation of a set of 500documents with IPTC classification (limited to the 4 highest levels of the IPTC classification). Thefocus in the first stages will be on German, English and Arabic.

2.5 Provision of Monitoring Data

Described below is the envisaged process for content provision by the broadcasters for the purposeof data monitoring. This is based on the available infrastructure, content requirements and plannedprototyping. As the project is still in its early stages, it must be understood that changes in theenvisaged processes and in actual implementation are still possible.

As will be clear from the sections below, the descriptions for BBC data management and inform-ation flow within the project are more elaborate than those for Deutsche Welle. The reason forthis is that DW focuses on the internal monitoring use case, which uses internal channels, fullycontrolled by Deutsche Welle. The BBC’s primary use case, on the other hand, is that of externalmonitoring, BBCM collects and processes content – primarily streaming content – from otherbroadcasters. BBCM monitors up to 200 channels simultaneously in real life. It is therefore bynature more complicated than the internal use case and requires more time to implement. Muchof Deutsche Welle’s content provision from internal sources through APIs was implemented in the

page 10 of 38

Page 11: Scalable Understanding of Multilingual MediA (SUMMA)summa-project.eu/wp-content/uploads/2017/08/SUMMA... · SUMMA (Scalable Understanding of Multilingual Media) is a three-year H2020

SUMMA H2020–688139 D2.3 Updated Data Management Plan

first months of the project and was therefore available for early use within the SUMMA platform.This process will be further enhanced and streaming content will be added. BBCM content isbeing implemented according to the provision plan as detailed below.

2.5.1 Customised Batches and API Access

Thematic Data DumpsThematic data batches could also be requested by the consortium partners throughout the project.These datasets can be used to test a topical range, or certain technologies or components. Theywill be focused on a specific theme or topic, and provide a batch of “super stories”. Regularly aspecific data collection on one big news event will be agreed upon to ensure a consistent thematicdata collection.

In the first six months, three sets of customised batches were supplied by DW and BBC, supple-mented by data from LETA and QCRI:

• At the start of the project: an initial set of sample records for each language covered• 15 March 2016: a specific data collection on the topic of the 5th anniversary of the Syrian

uprising – 24-hour coverage of the news• 12 July 2016: a specific data collection on the topic of the 10th anniversary of the Second

Lebanon War – 24-hour coverage of the news.

General Data DumpsIn addition, general, non-thematic data dumps are foreseen as additional training and test datasets.See D2.2 for more detail.

For the initial data dump, the BBC provided material in textual from various sources:

• Twitter (English, Arabic, Russian, Persian and Ukrainian)• Facebook (English, Arabic, Russian, Persian and Ukrainian)• Blogs (English, Arabic, Russian, Persian and Ukrainian)• Webpages (English, Arabic, Russian, Persian and Ukrainian)

The following were considered to be out of scope for the data dump, but may be considered for thelive system:

• Facebook A/V content• Twitter A/V content• YouTube• Instagram• Podcasts

Deutsche Welle has API access to its content and has made the content of its mobile site (m.dw.com) available to the consortium, together with instructions. It contributes material in English,German, Arabic, Spanish, Russian, Persian, Portuguese and Ukrainian.

LETA provides direct access to content in Latvian, and is the only content provider for that lan-guage within the consortium, although it may also deliver material in Russian and English.

QCRI has access to external content in Arabic, and started providing such material upon request.

Content is supplied via each partner’s respective API. Using API access ensures that data can begathered liberally for the platform and separate components without the need to request it from thecontent providers. Instructions for API access and use will be provided by each content provider.

page 11 of 38

Page 12: Scalable Understanding of Multilingual MediA (SUMMA)summa-project.eu/wp-content/uploads/2017/08/SUMMA... · SUMMA (Scalable Understanding of Multilingual Media) is a three-year H2020

SUMMA H2020–688139 D2.3 Updated Data Management Plan

Data dumps may be created on-demand by the partners at any time. All recorded media willbe available for download from the media store until the point of purging – which is yet to bedetermined.

Streaming ContentBBC focuses on monitoring external sources, so streaming is the primary targeted mode. TheBBC “as-live” media system builds on the on-demand system from Phase 2 and incorporates amechanism for creating JSON sidecar files. These will be generated for each chunk of media andpushed into the SUMMA API provided by LETA. These will provide links to the media whichreside on the AV store within the BBC’s AV file storage system.

The BBC will provide an interface whereby text sources can be selected and the “data dump” and“as-live” functionality can be switched “on” and “off”. Note that both modes can be run at thesame time if desired so that a data dump can be created while also feeding the SUMMA system.Social media and other text-based content will be ‘scraped’ as far as possible to reduce the amountof noise present in the data, although this is a difficult process and so it might prove impossible forus to completely remove unwanted data in all cases.

Figure 1: BBC AV workflow

Even though for DW the on-demand content is more important for its primary use case, DW willalso include its streaming content and has created a process to provide it.

page 12 of 38

Page 13: Scalable Understanding of Multilingual MediA (SUMMA)summa-project.eu/wp-content/uploads/2017/08/SUMMA... · SUMMA (Scalable Understanding of Multilingual Media) is a three-year H2020

SUMMA H2020–688139 D2.3 Updated Data Management Plan

2.5.2 Other Sources

The participating broadcasters may add RSS feeds and podcasting to the channels to be provided.DW has granted SUMMA access to all of its RSS feeds and podcasting channels.

2.6 Provision of Data for Specific Technologies

DW and BBC have provided data to train and test the specific modules and technologies based onthe requirements from the technical partners.

Deutsche Welle is providing items with teasers which can be used to train the summarisation tool.

Deutsche Welle is in the process of identifying and locating multilingual datasets with transcrip-tions, thus combining the effort of providing translation material and transcriptions. These setsare, however, only available for well-resourced languages (English, German, Spanish, and someArabic). A process to convert the original transcript into a machine-readable format has been setup and is currently being optimised while processing a first set of transcriptions. A script has beenwritten to automate the process.

Figure 2: DW Multilingual Transcription Spanish-German-English

page 13 of 38

Page 14: Scalable Understanding of Multilingual MediA (SUMMA)summa-project.eu/wp-content/uploads/2017/08/SUMMA... · SUMMA (Scalable Understanding of Multilingual Media) is a three-year H2020

SUMMA H2020–688139 D2.3 Updated Data Management Plan

3 Types of Data Generated

“Generation of data” in this report refers to the generation, the production of data by the SUMMAplatform, or any of its components or technologies. This can range from plain audio transcripts of(multilingual) broadcast material (ready for indexing and direct search), translated broadcast text(applied to multilingual speech-to-text/ASR system outputs, from any for the SUMMA languageto English as the target language), to automated annotations or graphical enhancements.

We distinguish between six main categories of data that will be generated during the project:

• Content data generated during media monitoring. This is typical broadcast data that remainscopyright-protected.• Specific output formats following a particular step in the SUMMA processing chain. This

includes transcriptions, translations, summaries, annotations, graphics and statistical data.This usually also includes broadcast content.• Software, models, algorithms, lexicons and ontologies, annotations, etc.• Personalised data generated during field testing and prototype testing• Social Media Data: Twitter, Facebook, YouTube, etc• Academic-type research publications

Of course, all this generated data will directly exploit and enrich the data collected (Section 1).

page 14 of 38

Page 15: Scalable Understanding of Multilingual MediA (SUMMA)summa-project.eu/wp-content/uploads/2017/08/SUMMA... · SUMMA (Scalable Understanding of Multilingual Media) is a three-year H2020

SUMMA H2020–688139 D2.3 Updated Data Management Plan

4 Data and Metadata Standards

4.1 Metadata

Available metadata at content provider side is considered for inclusion in SUMMA. Metadataformats differ according to content provider.

The original broadcaster metadata format is provided to the integrator and overall preferences andsettings are discussed. Within SUMMA, it is agreed that JSON format is preferred; the BBCmetadata uses Dublin-core; DW metadata does not use Dublin-core.

LETA integrates the different incoming metadata formats. This is a realistic scenario, as suchplatforms need to take into account the fact that different content providers use different schemas,so mapping and ingestion should be made easy and allow for maximum automation after initialsetting of mapping schemas.

4.2 Dataset Identifiers and Descriptions

Within the SUMMA project, data sets are divided into two main groups: (1) regular content formonitoring, and (2) specific data for training and testing transcription, translation, summarisation,and sentiment analysis NLP tools.

In Group (1), “regular content” data files are used within the SUMMA Prototype system and arealways accompanied by the JSON sidecar files providing unique content identifier and completedescription as part of metadata fields such as title, source, author, datetime . The text-based “reg-ular content” is stored directly within the JSON sidecar files while the A/V content is linked fromthe JSON sidecar files by its URL in the A/V storage system. Special treatment is applied to the‘as-live’ A/V content delivered over HLS stream – for storage purposes each such stream is arti-ficially split into segments with each segment having its own sidecar JSON file. Duration of thesegment can vary from 10 sec to 1 hour. For streams, the JSON sidecar file identifier is exten-ded with the GMT <date> / <time> fields uniquely identifying the stream segment in time. Theactual structure and content of the JSON sidecar files varies by the provider (BBC, DW, LETA,QCRI) and will be described in the Swagger editor (http://editor.swagger.io) as part of RESTfulAPI documentation. A sample sidecar JSON file is shown below:

page 15 of 38

Page 16: Scalable Understanding of Multilingual MediA (SUMMA)summa-project.eu/wp-content/uploads/2017/08/SUMMA... · SUMMA (Scalable Understanding of Multilingual Media) is a three-year H2020

SUMMA H2020–688139 D2.3 Updated Data Management Plan

{

"@context": {

"dc": "http://purl.org/dc/elements/1.1/",

"summa": "http://www.summa-project.eu/vocab#"

},

"@id": "",

"@type": "summa:article",

"summa:text": "Claire Brisebois Starnes enlisted in the Signal Corps of the US Army in 1963...",

"summa:date": "2016-07-06T11:28:28.8817057Z",

\copyright": \US Army",

"dc:date": "None",

"dc:identifier": "http://www.bbc.co.uk/news/in-pictures-36574698",

"dc:language": "en",

"dc:publisher": "BBC News - Home",

"dc:source": "http://www.bbc.co.uk/news/",

"dc:title": "’Tuning out emotionally’",

"dc:type": "Text"

}

In Group (2), “specific data” for training and testing the NLP tools within the SUMMA projectwere generated as part of data dumps and stored in the shared Box.com repository. The file-pathwithin the SUMMA Box repository served as the data set identifiers. Data set descriptions wereprovided as readme files within the folders containing the actual data sets. A sample Box repositorydata set identifier is given here: /SUMMAData/TrainingDataOneNewsDay/Spanish/OnlineText/DWSyria DWCOM-SP 1939 20160315 a-19118941.json

The data sets generated in the SUMMA Platform NLP processing chain such as transcripts, trans-lations, summaries were stored as additional fields within the JSON sidecar files, therefore theirstorage, identification, and descriptions are identical to the Group (1) data set treatment describedabove.

4.3 Video Material

Although video material itself will not be processed in the context of SUMMA, it is howeverimportant to store it, with proper links to the audio and other metadata material which will be usedto indirectly index and search the relevant videos.

Video data will include metadata and at least the link to the video file, or the media file itself(especially if there is a chance the file will be taken offline in the foreseeable future). Videomaterial will include streaming data as well as on-demand data. The preferred video file formatis mp4 (DW has mp4 - BBC has MPEG2). The metadata format may differ according to contentprovider. A 2-minute latency is allowed to start processing streaming content.

Deutsche Welle provided on-demand video content from DW internal sources in the first instance(mp4)and the provision of such content is already in place via API access. Procedures for supplyingstreaming content from DW sources (HLS stream) are being established.

BBC provides primarily streaming content from external sources. The AV sources in the finalplatform will originate from BBC Monitoring and will be selectable by the BBC Belfast team in

page 16 of 38

Page 17: Scalable Understanding of Multilingual MediA (SUMMA)summa-project.eu/wp-content/uploads/2017/08/SUMMA... · SUMMA (Scalable Understanding of Multilingual Media) is a three-year H2020

SUMMA H2020–688139 D2.3 Updated Data Management Plan

conjunction with the BBC Monitoring team. It will initially consist of a maximum of 10 videostreams but this may be scaled upwards as the project progresses. BBC Northern Ireland is provid-ing sources from the BBC system in London up to Amazon Web Services (AWS), from where part-ners can integrate them into their instance of the platform. Although the system initially providedpreview videos for the interface, as specified in WP1 and WP6, this has now been deprecated infavour of a local cache within the SUMMA platform itself. The AV material is being re-wrappedas HLS but remains in its original quality; hence each feed is potentially encoded at a differentbitrate.

4.4 Text Articles Material

For text-based media the BBC have provided a component which has now been integrated intothe SUMMA platform by LETA. There is currently no GUI for the configuration of the feeds tobe monitored, instead these can be configured by importing OPML files which are dropped into afolder. OPML is the most common way to share lists of feeds, see https://en.wikipedia.org/wiki/OPML.The software component can also import CSV files in the format used by BBC Newslab’s Juicersystem. For testing we are using 4 OPML files, 2 Arabic, 1 Russian and 1 General, plus 2 largeCSV files containing English language and Spanish news sources. During testing we have approx-imately 1,100 feeds defined in these files and being consumed. The SUMMA GUI allows for RSSfeeds to be added, edited and deleted individually through the interface. Thus, the participatingbroadcasters may add RSS feeds and podcasting to the channels to be provided. Deutsche Wellehas granted SUMMA access to all of its RSS feeds and podcasting channels. Please see D2.2SUMMA Initial Data Provision Report for more details.

Figure 3: DW JSON Teaser Example

page 17 of 38

Page 18: Scalable Understanding of Multilingual MediA (SUMMA)summa-project.eu/wp-content/uploads/2017/08/SUMMA... · SUMMA (Scalable Understanding of Multilingual Media) is a three-year H2020

SUMMA H2020–688139 D2.3 Updated Data Management Plan

4.5 Social Media Data

Social media data (Twitter, Facebook, YouTube – and possibly other sources) will be providedwith some annotation. The broadcast partners will provide identifiers or links to social media postsrelevant for their organisation.

Figure 4: Some DW Twitter Feeds

4.6 RSS Feeds, Podcasting

RSS feeds and podcasts are provided when available and appropriate. The metadata format maydiffer according to content provider.

DW has provided access to all its RSS and podcasting feeds via API.

page 18 of 38

Page 19: Scalable Understanding of Multilingual MediA (SUMMA)summa-project.eu/wp-content/uploads/2017/08/SUMMA... · SUMMA (Scalable Understanding of Multilingual Media) is a three-year H2020

SUMMA H2020–688139 D2.3 Updated Data Management Plan

Figure 5: DW RSS Feeds

page 19 of 38

Page 20: Scalable Understanding of Multilingual MediA (SUMMA)summa-project.eu/wp-content/uploads/2017/08/SUMMA... · SUMMA (Scalable Understanding of Multilingual Media) is a three-year H2020

SUMMA H2020–688139 D2.3 Updated Data Management Plan

5 Data Storage, Preservation and Re-Use

Data collected by the content providers was initially stored in a SUMMA repository on Box (https://www.box.com), managed by the BBC and used by all consortium partners. It held all selectedand downloaded broadcast content for experimenting, testing, training, trialling and demo withinSUMMA. That repository was selected as it meets internal BBC Data storage security requirementsand was agreed upon by the rest of the consortium.

The platform and technical partners retrieved the data from the Box repository (or from the broad-caster directly, via API), after which it is stored in the SUMMA platform on LETA servers. Ingesteddata from BBC sources and Deutsche Welle is now stored on servers assigned to an instance of theplatform - currently at LETA and University of Edinburgh.

Technology partners also retrieve selected data either from the Box repository or through APIaccess and store it on their organisation’s server for specific training and testing on models andtechnologies, e.g. for MT or ASR.

Preservation options after the project will be discussed in the next iteration of this report, whenthe final output form becomes visible. However, it is envisaged that the platform is usable – andcustomisable - after project end, as it is especially geared towards Use Case 1 and Use Case 2. Thedata produced during the course of the project will be available in accordance with the ConsortiumAgreement and licence agreements.

Re-use will be ensured as much as possible and will primarily apply to software, lexicons, al-gorithms, dashboards, etc.

page 20 of 38

Page 21: Scalable Understanding of Multilingual MediA (SUMMA)summa-project.eu/wp-content/uploads/2017/08/SUMMA... · SUMMA (Scalable Understanding of Multilingual Media) is a three-year H2020

SUMMA H2020–688139 D2.3 Updated Data Management Plan

6 Policies for Data Access and Sharing

6.1 Different Levels for Access and Sharing

There are different kinds of categories of data that will be collected or generated during the project,with different levels and conditions for access and sharing:

• Original broadcast data is copyright-protected and, as stipulated in the Consortium Agree-ment, is provided only for use by the consortium partners for the duration of the project. Itcan therefore not be shared outside the consortium or after the project. Some demo materialwill be selected for public viewing in agreement with the broadcasters.

• Data generated during media monitoring. This data is typically owned by the broadcaster;therefore, the consortium does not have the rights to share this as open research data. How-ever, negotiations will be opened with broadcasters with the aim of releasing data sets forspecific research use, as has been done in the past by the BBC for the MediaEval and MGBChallenge evaluation campaigns.

• Specific output formats following a particular step in the SUMMA processing chain. Thisincludes transcriptions, translations, summaries, annotations, graphics and statistical data.This usually also includes broadcast content.

• Software, models, algorithms, lexicons and ontologies, annotations, etc. These will be madeavailable as open source as much as possible. We shall endeavour to publish and makeopen access derived data such as phrase dictionaries for MT, when such publication is not inbreach of copyright. In cases where data is available, but copyright restrictions do not allowus to publish it, we shall release tools to reconstruct it (cf. Kaldi recipes and the WikiLinksproject).

• Personalised user-specific data generated during field testing and prototype testing. Thisis data relating to people’s use of SUMMA prototype systems. As this is very specific tothe systems being evaluated, and also personal to evaluation subjects, we do not anticipatereleasing this data.

• Social Media Data. Twitter, Facebook, YouTube, etc. is very time-sensitive, quickly out-dated and also strongly related to privacy issues. However, social media is an essentialpart of the content used and analysed. In order to ensure the protection of private indi-viduals’ personal data (contained in Tweets or Facebook posts), the data providers (UserPartners) will not transfer any third-party social media content to other consortium part-ners. Of course, social media content owned by the content provider (e.g. @dwnews)or the consortium (@SummaEu) can be used if the account holder agrees to this. Userpartners will take protection of personal data legislation into account while developing theplatform for the processing of social media. The potential value of the social media out-put, anonymisation/pseudonymisation and retention of social media user-related data willbe further discussed as we see a more advanced platform and output formats. Followingrecommendations from the Ethics Committee, special care shall be taken in handling socialmedia content, ensuring privacy protection. The SUMMA content providers will differenti-ate between different categories of social media posters - e.g. public figures (politicians),political organisations (political parties, NGOs) and private citizens.

page 21 of 38

Page 22: Scalable Understanding of Multilingual MediA (SUMMA)summa-project.eu/wp-content/uploads/2017/08/SUMMA... · SUMMA (Scalable Understanding of Multilingual Media) is a three-year H2020

SUMMA H2020–688139 D2.3 Updated Data Management Plan

• Academic-type research publications. Academic publications will be made available as“green” open access via institutional repositories and the OpenAire system.

6.2 Planned Measurements for the Protection of Personal Data

This section deals with ethical issues addressed by the Ethics Advisory Board. The consortiumwill specify in its data management procedures how it intends to identify where personal data isinvolved and how such personal data is protected. See D9.4 ’Updated ethics review and recom-mendations’ for more detail on measures to ensure privacy.

As the SUMMA architecture and information flow is being established, security and privacy issuesare being taken into consideration and procedures will be set before implementation. All consor-tium partners dealing with data, including provision, use, processing and storing, will look intodata protection regulations for their organisation and country.

• The consortium will decide on the process of dealing with and retaining images in whichpeople could potentially be identified. This relates to social media, but also to regular audi-ovisual broadcast media in which individuals could be identified.

• For the protection of personal data in social media, names will be replaced by hashtags.

• Pseudonymisation will be considered as an alternative to anonymisation, in particular forclustering. Special attention will be given to the control of the key.

• All consortium partners are responsible for seeking advice from their respective local dataprotection authorities (Germany, Latvia, Portugal, Switzerland). It is understood that theconsortium as a whole are joint data controllers in this project.

• Security procedures will be established for each partner dealing with data. For instance, allcommunication by any third parties with the BBC Monitoring web server and storage willbe secured using SSL via the HTTPS protocol and will require a security certificate. TheBBC will issue these certificates to each of the partners. This is to meet the needs of BBCInformation Security and UK Data Protection regulations.

• The consortium is striving for transparency to make the purpose of including social mediacontent and other content containing personal data in the SUMMA project clear. It will beclearly stated which partner is responsible for the relevant data. The details will be estab-lished in the data protection impact plan.

• The consortium is looking into access procedures and restrictions during dissemination,which is targeted at approaching potentially new users. A solution must be found to en-sure privacy protection and at the same time allow the SUMMA platform and some of itsdata to be demonstrated.

page 22 of 38

Page 23: Scalable Understanding of Multilingual MediA (SUMMA)summa-project.eu/wp-content/uploads/2017/08/SUMMA... · SUMMA (Scalable Understanding of Multilingual Media) is a three-year H2020

SUMMA H2020–688139 D2.3 Updated Data Management Plan

7 Privacy by Design

According to the EU Data Protection Regulation Datasheet1 “privacy by design means that eachnew service or business process that makes use of personal data must take the protection of suchdata into consideration. An organisation needs to be able to show that they have adequate securityin place and that compliance is monitored. In practice this means that an IT department must takeprivacy into account during the whole life cycle of the system or process development.”

According to the EU Data Protection Big Data Factsheet2, “The new ‘data protection by design’principle motivates architects of Big Data analytics to use techniques like anonymization, pseud-onymisation, encryption, and protocols for anonymous communications. The Commission willwork with Member States and in particular the supervisory authorities and stakeholders to ensurethat businesses receive adequate guidance on these techniques.”

SUMMA uses secure data storage and processing at all stages. Details of our privacy by designstrategy are detailed in the table below.

Requirement Details Protective MeasuresExplicit con-sent

Data controllers must be able todemonstrate that their customerhas agreed to the processing oftheir personal data by a state-ment or a clear affirmative ac-tion.

Only previously broadcast ma-terial is used, and thus previousconsent is assumed.Further processing withinSUMMA is for research pur-poses only.

Storing per-sonal data

Data controllers have the re-sponsibility to keeping data safeand secure during the storingprocess, using firewalls, pass-words, encryption, etc.

Data is only accessible bySUMMA partners.All parts of the SUMMA plat-form use secured access andstorage.SUMMA minimizes personaldata storage by the use ofanonymization and pseudonom-ization.Data in SUMMA is securelystored. More details availablein D10.1 – POPD -RequirementNo. 2.

1 http://www.eudataprotectionregulation.com/data-protection-design-by-default2 http://ec.europa.eu/justice/data-protection/files/data-protection-big-data factsheet web en.pdf

page 23 of 38

Page 24: Scalable Understanding of Multilingual MediA (SUMMA)summa-project.eu/wp-content/uploads/2017/08/SUMMA... · SUMMA (Scalable Understanding of Multilingual Media) is a three-year H2020

SUMMA H2020–688139 D2.3 Updated Data Management Plan

Transmittingpersonaldata

Personal data must be protectedat all times, also during transferand communication processes.

Data is securely transmitted inthe SUMMA platform.Data in SUMMA is securelystored. More details availablein D10.1 – POPD -RequirementNo. 2.

Analysing/

Processingpersonaldata

Personal data must be protectedat all times, also during analysisand processing.

SUMMA minimizes identifica-tion of personal data by the useof anonymization and pseudo-nomization.

Table 1: Data Protection by Design Measures

page 24 of 38

Page 25: Scalable Understanding of Multilingual MediA (SUMMA)summa-project.eu/wp-content/uploads/2017/08/SUMMA... · SUMMA (Scalable Understanding of Multilingual Media) is a three-year H2020

SUMMA H2020–688139 D2.3 Updated Data Management Plan

8 Scenario Models

Scenario modelling can be hgelpful to better understand the risks arising from data managementand personal data processing. Below we present two scenarios as examples of risks relating toSUMMA.

8.1 Scenario 1

Consider a citizen of an autocratic country with a poor human rights record, resident in Europe,who posts a tweet revealing a political opinion that could lead to arrest by the regime in their homecountry. One could argue that if they are posting on Twitter, a public platform, it is their ownresponsibility. Also, if they are a known political activist, they are probably being monitored bythe regime, including on social media. How could the use of SUMMA make a potential negativeimpact in this situation? It’s well known that Western and other governments use powerful inform-ation gathering and analysis platforms, such as Palantir and Sail Labs. It’s unlikely a regime suchas that described would not already have their own human and technology capacity for monitoringcitizens they view as a threat. However, if we imagine that an oppressive government has decidedto use SUMMA for its media monitoring, there are two potential ways the platform could exacer-bate the situation. 1. The country’s intelligence and security services notice the political tweetbecause SUMMA highlights it to them. 2. The tweet is highlighted in reporting by a media orhuman rights organizations.

In the first case, the problem is caused by the open-source nature of the platform, meaning thatanyone can use it or adapt it for their own purposes, including nefarious ends. The consortium willwrite conditions of use that expressly forbid this. But realistically, there are no means of enforcingthem.

In the second case, political activists use Twitter and other social media specifically to highlighttheir causes and often to bring about Western media and government pressure. It would be im-possible for the platform to judge if this is the purpose. However, Western media and rights organ-izations have editorial standards and policies and make judgements in cases like this. SUMMA isnot designed as an autonomous reporting machine with no human intervention. It is designed tobe used as a tool to help journalists and others to find information they need for reporting. Thosejournalists, and consequently other users, should be expected to adhere to the relevant standards,ethics and duty of care of their profession.

8.2 Scenario 2

An organisation or individual user collects illegal material - let’s say extremist literature - either byaccident due to a source they are following broadcasting it or with genuine intentions to conductresearch.

If by accident, it’s entirely possibly the material might never be noticed unless it surfaces in arelated search. If we imagine it’s possible the security services have a way of knowing whensomeone accesses extremist material in this manner, then they could come to the user and askquestions. One would hope they would be satisfied that a look through the material gathered fromother sources and at the source that had suddenly sent out the extremist content would convincethem that this was an aberration. However, it may be wise for the organisation to institute regular

page 25 of 38

Page 26: Scalable Understanding of Multilingual MediA (SUMMA)summa-project.eu/wp-content/uploads/2017/08/SUMMA... · SUMMA (Scalable Understanding of Multilingual Media) is a three-year H2020

SUMMA H2020–688139 D2.3 Updated Data Management Plan

checks, involving searches on certain terms and on whether the content of monitored sources haschanged to include undesireable content.

In the case of research, universities, for example, already have guidelines involving the oversightof an ethics officers and a ’rapid response’ procedure, as well as advice to inform the local policeof such work, for example the Universities UK document http://www.universitiesuk.ac.uk/policy-and-analysis/reports/Documents/2012/oversight-of-security-sensitive-research-material.pdf.

In the case of individuals, since the platform is intended to be open-source, the SUMMA projectshould ensure prominent warnings and links to advice are visible wherever the platform is availableand in, say, the loading screen of the tool.

page 26 of 38

Page 27: Scalable Understanding of Multilingual MediA (SUMMA)summa-project.eu/wp-content/uploads/2017/08/SUMMA... · SUMMA (Scalable Understanding of Multilingual Media) is a three-year H2020

SUMMA H2020–688139 D2.3 Updated Data Management Plan

9 Ethics recommendations

A second meeting of the Ethics Advisory Board (EAB) was held on 4th July 2017. The externaladvisers assessed how the consortium had addressed issues raised at the first meeting a year earlier,discussed the latest areas of interest and concern, and then made new and updated recommenda-tions.

Firstly, the committee recognized that the consortium had done a good amount of work to imple-ment their advice. They also said there was still more to do:

“The EAB noticed that many of the recommendations resulting from the previousmeeting had been taken into account, notably in the updated data management report(D2.3). At the same time, some misunderstandings remain and this management plancould still be further improved”.

The EAB’s report is expected in August 2017, and we shall further update the data managementplan according to their recommendations.

Please see D9.4 ‘Updated ethics review and recommendations’ for details of the main areas hereand steps taken or to be taken to address them for the next iteration of the plan. We will also liaisewith the EAB to ensure they are happy with how we have responded to their guidance.

page 27 of 38

Page 28: Scalable Understanding of Multilingual MediA (SUMMA)summa-project.eu/wp-content/uploads/2017/08/SUMMA... · SUMMA (Scalable Understanding of Multilingual Media) is a three-year H2020

SUMMA H2020–688139 D2.3 Updated Data Management Plan

10 Conclusion

The present Deliverable D2.3, the Updated Data Management Plan (DMP), provides the basis forthe SUMMA project management strategy and planning, as extensively discussed and agreed byall the partners.

D2.3 addresses most identified issues related to the collection and generation of data, data setidentifiers and descriptions, standards, data sharing, property rights and privacy protection, andlong-term preservation and re-use. Standards and metadata formats have also been agreed upon.

This is the second of three iterations; the final update is due at the end of the project in M36. Datacollection, generation and processing are key areas in this monitoring project and will be discussed,elaborated upon, and further specified throughout the project.

Arriving at a balance between providing results to potential users as well as the open data platform,on the one hand, and protecting the copyright and privacy of content providers and end users, onthe other, will be the continuous endeavour of the consortium.

page 28 of 38

Page 29: Scalable Understanding of Multilingual MediA (SUMMA)summa-project.eu/wp-content/uploads/2017/08/SUMMA... · SUMMA (Scalable Understanding of Multilingual Media) is a three-year H2020

SUMMA H2020–688139 D2.3 Updated Data Management Plan

A Personal Data

This appendix catalogues all the personal data sources used in the project. We specify data cat-egories (AV, social media etc.), whether or not they are used in SUMMA and protective measurewe have taken to ensure compliance with data privacy legislation. In SUMMA we have agreedthat personal data is not going to be distributed. It will be downloaded by each instance of theSUMMA platform, and data will remain on site.

Content type Used inSUMMA(Y/N)

Protective Measures

AV or Text ContentVideos featuring people Y Only material approved for

broadcast distribution is used.Analysis results are for researchpurposes only.

Interviews with people in audio,video or text

Y Only material approved forbroadcast distribution is used.Analysis results are for researchpurposes only.

Descriptions of people in audio,video or text

Y Only material approved forbroadcast distribution is used.Analysis results are for researchpurposes only.

Mentions of people in audio,video or text

Y Only material approved forbroadcast distribution is used.Analysis results are for researchpurposes only.

Social mediaSocial media accounts of publicfigures

Y We consider people who haveverified Twitter accounts to bepublic figures.

Social media accounts of non-public figures

Y This data will be used for trend-ing and sentiment analysis.

Mentions of people in social me-dia

Y

Images of people in social media N Only text in social media is ana-lysed.

Links to non-public people’s SMaccounts

N Links to unverified twitter ac-counts will be anonymised.

Other sources

page 29 of 38

Page 30: Scalable Understanding of Multilingual MediA (SUMMA)summa-project.eu/wp-content/uploads/2017/08/SUMMA... · SUMMA (Scalable Understanding of Multilingual Media) is a three-year H2020

SUMMA H2020–688139 D2.3 Updated Data Management Plan

Blogs Y DW live blog: DW and Scribble-Live service provider abide byEU data protection laws

Podcasts Y DW podcasts: Same protec-tion as other AV DW content,covered by German and EU dataprotection laws

RSS feeds Y DW feeds: Same protection asother AV or textual DW content,covered by German and EU dataprotection laws

Analysed dataStatistical data Y The same measures apply as to

the original dataSummaries Y The same measures apply as to

the original dataClustering Y The same measures apply as to

the original dataAutomated translations Y The same measures apply as to

the original dataKnowledge bases Y The same measures apply as to

the original data

Table 2: Analysed data personal data sources used in the project

page 30 of 38

Page 31: Scalable Understanding of Multilingual MediA (SUMMA)summa-project.eu/wp-content/uploads/2017/08/SUMMA... · SUMMA (Scalable Understanding of Multilingual Media) is a three-year H2020

SUMMA H2020–688139 D2.3 Updated Data Management Plan

B EU/National data protection laws

This annex describes the legal responsibilities of partners as relating to EU and their national dataprotection laws. We first describe the upcoming changes to EU law, and then we detail all thepartners legal responsibilities in a table.

EU regulations as a one-stop shopAccording to the EU Data Protection Reform policy3, 28 national legislations will be replacedby one, simple and clear legal framework and a one-stop-shop for governance and enforcement.Thus, within the timeframe of the SUMMA project, this EU Data Protection regulation will beapplicable and all partners should adhere to the measures in that regulation. Some of the aspectsof the new regulation are described below. The objective of this new set of rules is to give citizensback control over of their personal data, and to simplify the regulatory environment for business.The data protection reform is a key enabler of the Digital Single Market which the Commissionhas prioritised. The reform will allow European citizens and businesses to fully benefit from thedigital economy.

In January 2012, the European Commission proposed a comprehensive reform of data protectionrules4 in the EU. On 4 May 2016, the official texts of the Regulation and the Directive have beenpublished in the EU Official Journal in all the official languages. While the Regulation will enterinto force on 24 May 2016, it shall apply from 25 May 2018. The Directive enters into force on 5May 2016 and EU Member States have to transpose it into their national law by 6 May 2018.

Regulation (EU) 2016/6795 of the European Parliament and of the Council of 27 April 2016 onthe protection of natural persons with regard to the processing of personal data and on the freemovement of such data, and repealing Directive 95/46/EC (General Data Protection Regulation)

Directive (EU) 2016/6806 of the European Parliament and of the Council of 27 April 2016 on theprotection of natural persons with regard to the processing of personal data by competent authorit-ies for the purposes of the prevention, investigation, detection or prosecution of criminal offencesor the execution of criminal penalties, and on the free movement of such data, and repealing Coun-cil Framework Decision 2008/977/JHA.

A very helpful summary of how these changes will affect the activities of the SUMMA projectare described in the following article:https://iapp.org/news/a/how-gdpr-changes-the-rules-for-research/.3 http://ec.europa.eu/justice/data-protection/files/data-protection-big-data factsheet web en.pdf4 http://ec.europa.eu/justice/data-protection/reform/index en.htm5 http://eur-lex.europa.eu/legal-content/EN/TXT/?uri=uriserv:OJ.L .2016.119.01.0001.01.ENG&toc=OJ:L:2016:

119:TOC6 http://eur-lex.europa.eu/legal-content/EN/TXT/?uri=uriserv:OJ.L .2016.119.01.0089.01.ENG&toc=OJ:L:2016:

119:TOC

page 31 of 38

Page 32: Scalable Understanding of Multilingual MediA (SUMMA)summa-project.eu/wp-content/uploads/2017/08/SUMMA... · SUMMA (Scalable Understanding of Multilingual Media) is a three-year H2020

SUMMA H2020–688139 D2.3 Updated Data Management Plan

Partner Legislation Responsibilities ActionDeutscheWelle

Germanand EU-wide* dataprotectionlaws

According to section 42 of theGerman Federal Data ProtectionStatute, Deutsche Welle needs toappoint a data protection officer.Publish data and privacy protec-tion policy.

DW has appointed a dedicated dataprotection officer: currently this isThomas Gardemann, contact: [email protected] has published its data and privacyprotection policy on its distributionchannels. For the website, see:http://www.dw.com/en/data-privacy-policy/a-18265246DW has published its termsof use for interactive content:http://www.dw.com/en/conditions-of-participation/a-16372765DW has published its termsof use for the DW App 2.1:http://www.dw.com/en/general-conditions-of-use-for-the-dw-app-21/a-18532587

BBC UK andEU-wide*data pro-tectionlaws

Abide by the UK 1998 Data Pro-tection Act which controls howpersonal information is usedby organisations, businesses orthe government. Supportedby advice and guidelines fromthe Information Commissioner’sOffice (ICO), https://ico.org.uk,the UK’s independent body setup to uphold information rights.

The BBC provides personal data to anyconsortium member that needs a li-cence of our content.Those project partners have a licencein place with us that includes data pro-tection provisions. We have looked atways of reducing risks, for example in-stead of providing social media con-tent we are providing our data scrap-ing tool. BBC have worked with ourlegal team in Research and Develop-ment to draft data agreements betweenourselves and all consortium membersthat use BBC-sourced media for train-ing and testing purposes.We are supplying a means to scrape so-cial media to the platform and do notsupply any links to social media ac-counts.

page 32 of 38

Page 33: Scalable Understanding of Multilingual MediA (SUMMA)summa-project.eu/wp-content/uploads/2017/08/SUMMA... · SUMMA (Scalable Understanding of Multilingual Media) is a three-year H2020

SUMMA H2020–688139 D2.3 Updated Data Management Plan

UEDIN UK andEU-wide*data pro-tectionlaws

Abide by the UK 1998 Data Pro-tection Act which controls howpersonal information is usedby organisations, businesses orthe government. Supportedby advice and guidelines fromthe Information Commissioner’sOffice (ICO), https://ico.org.uk,the UK’s independent body setup to uphold information rights.

UEDIN has a data protection officerwho can be contacted at [email protected] has published its data pro-tection policy: http://www.ed.ac.uk/records-management/data-protectionUEDIN has a research datamanagement policy: http://www.ed.ac.uk/information-services/about/policies-and-regulations/research-data-policyThis is supported by theUEDIN’s Research Data Service:http://www.ed.ac.uk/information-services/research-support/research-data-service

UCL UK andEU-wide*data pro-tectionlaws

Abide by the UK 1998 Data Pro-tection Act which controls howpersonal information is usedby organisations, businesses orthe government. Supportedby advice and guidelines fromthe Information Commissioner’sOffice (ICO), https://ico.org.uk,the UK’s independent body setup to uphold information rights.

UCL has a data protection of-ficer who can be contacted [email protected] UCLhas published its data protectionpolicy: https://www.ucl.ac.uk/informationsecurity/policy/public-policy/DataProtectionPolicy1016.pdfUCL has a research datapolicy: https://www.ucl.ac.uk/isd/services/research-it/documents/uclresearchdatapolicy.pdf

Universityof Shef-field

UK andEU-wide*data pro-tectionlaws

Abide by the UK 1998 Data Pro-tection Act which controls howpersonal information is usedby organisations, businesses orthe government. Supportedby advice and guidelines fromthe Information Commissioner’sOffice (ICO), https://ico.org.uk,the UK’s independent body setup to uphold information rights.

USFD has a research data manage-ment officer who can be contactedat [email protected] USFD haspublished its research data manage-ment policy: https://www.sheffield.ac.uk/polopoly fs/1.553350!/file/GRIPPolicyextractRDM.pdf USFDhas a data protection policy:https://www.sheffield.ac.uk/cics/dataprotection

page 33 of 38

Page 34: Scalable Understanding of Multilingual MediA (SUMMA)summa-project.eu/wp-content/uploads/2017/08/SUMMA... · SUMMA (Scalable Understanding of Multilingual Media) is a three-year H2020

SUMMA H2020–688139 D2.3 Updated Data Management Plan

IDIAP Swiss andEU-wide*data pro-tectionlaws

The private data protection(mainly in the field of telecom-munication) is stated in Article13 of the Swiss Federal Consti-tution of April 18, 1999, relatedto the protection of private lifeinformation, and against abusiveuse of this private information.Furthermore, specific regula-tions related to data protectionare also stated in the federallaw of June 1992 and the Actsof 14 June 1993, 30 April1997, and 31 October 1997.More information about theseregulations can be found atwww.edoeb.admin.ch On 26July 2000, pursuant to the Dir-ective 95/46/Ec of the EuropeanParliament and of the Councilon the adequate protectionof personal data provided inSwitzerland, the EC acknow-ledged the fact that Switzerlandratified (on 2 October 1997) theCouncil of Europe Conventionon the Protection of individualswith regards to Automatic Pro-cessing of Personal Data (http://conventions.coe.int/Treaty/en/Treaties/Html/108.htm).Based on the above FederalLaw, all private data must bedeclared to, and clear by, theFederal Data Protection andInformation Commissioner(FDPIC, www.edoeb.admin.ch,art. 11a, al. 3, LPD).

IDIAP will declare private data byfilling one form for each data set.This form will be downloaded fromwww.leprepose.ch, or filed online atwww.datareg.admin.ch.Since June 2012, IDIAP is reg-ularly declaring to FDPIC all itsprivacy-sensitive data (althoughused for research purposes only andincluding voice, video, pictures,text, etc), and in some cases distrib-uted through our own data portal(www.idiap.ch/dataset).IDIAP is currently setting up its ownEthics Committee, aligned with EPFLpractice http://research-office.epfl.ch/research-ethics/research-ethics-assessment/epfl-human-research-ethics-committee/hrec

LETA Latvianand EU-wide* dataprotectionlaws

Abide the law that is intendedto protect individuals with fun-damental rights and freedoms, inparticular privacy, with regard tothe Personal Data gathering, pro-cessing and publishing.

LETA has appointed its Data protec-tion officer, LETA lawyer Mr JanisBulis, contact [email protected].

page 34 of 38

Page 35: Scalable Understanding of Multilingual MediA (SUMMA)summa-project.eu/wp-content/uploads/2017/08/SUMMA... · SUMMA (Scalable Understanding of Multilingual Media) is a three-year H2020

SUMMA H2020–688139 D2.3 Updated Data Management Plan

Priberam Portugueseand EU-wide* dataprotectionlaws

Abide by the PT 1998 Data Pro-tection Law [Lei n.o 67/98 de 26de Outubro] which controls howpersonal information is used byorganisations, businesses or thegovernment.

Data protection officer and data pro-tection related matters are in thescope of Direccao Jurıdica whichcan be contacted at [email protected]

QCRI Qatar andEU-wide*data pro-tectionlaws

The Qatar Data Privacy Law iscurrently awaiting signature. Ithas been developed to ensurethat personal data details of yourcompany and of your custom-ers are kept confidential. Thelaw defines personal informationand sets measures for their pro-tection required to be taken byprocessors of those information.The law imposes penalties onall those who disclose any fin-ancial or non-financial informa-tion of their customers withouttheir consent. As an organiza-tion located in Qatar QCRI willbe bound by this law.

As QCRI is not located in a EU coun-try EU laws are not legally binding.However, as partner in an EU projectQCRI will follow the EU-wide dataprotection laws unless in conflict withQatari laws.

Table 3: Partners legal responsibilities and actions

* Regulation (EU) 2016/6797 of the European Parliament and of the Council on 27 April 2016on the protection of natural persons with regard to the processing of personal data and on the freemovement of such data, and repealing Directive 95/46/EC (General Data Protection Regulation).On 4 May 2016, the official text of the Regulation was published in the EU Official Journal in allthe official languages. The Regulation will enter into force on 24 May 2016, it shall apply from 25May 2018.

7 http://eur-lex.europa.eu/legal-content/EN/TXT/?uri=uriserv:OJ.L .2016.119.01.0001.01.ENG&toc=OJ:L:2016:119:TOC

page 35 of 38

Page 36: Scalable Understanding of Multilingual MediA (SUMMA)summa-project.eu/wp-content/uploads/2017/08/SUMMA... · SUMMA (Scalable Understanding of Multilingual Media) is a three-year H2020

SUMMA H2020–688139 D2.3 Updated Data Management Plan

C Roles of SUMMA Partners

Here we describe the role of each partner in terms of data protection, for example who is a datacontroller, a data provider and a data processor. These follow the definitions of these roles asdescribed by the ICO8. In the SUMMA project all partners take part in decisions about what datais ingested, and how it is processed and stored. The data controller role is therefore shared amongstthe partners. Furthermore, once the platform is open-source, anyone who downloads and runs theSUMMA platform is both a data processor and a data controller.Deutsche Welle is a data provider in first instance in the SUMMA project. It supplies audiovisualand textual data from some DW broadcast sources (online, live TV and social media in eightSUMMA languages). Its data is subsequently processed in the SUMMA platform and assessed bySUMMA and DW staff. A subselection of the processed content (all originating from previouslyDW broadcast material) is used for dissemination purposes. As a data provider, Deutsche Welleabides by German and European data protection laws and applies protective measures to protect(personal) data as well as the privacy of its users.

BBC is a provider of recorded Broadcast Media (in the form of data-dumps) as well as as-liveBroadcast Media (stored within a secure AWS environment) for the purposes of training and testingthe SUMMA Platform. BBC has also been responsible for providing the means to scrape socialmedia feeds (not the feeds themselves) and integrate this component into the SUMMA Platform.The BBC abides by UK and EU data protection laws. The BBC has also drafted their own DataProtection Policies and Information Security Policies and abide by these.

LETA is data provider for Latvian public broadcast content. LETA also is a data processor andprovides SUMMA Platform integration technology and hosting for DW content used in SUMMAProject. Finally, LETA provides technology which is used for data processing, particularly theAMR (Abstract Meaning Representation) module. LETA abides by LV and EU data protectionlaws.

UEDIN is primarily a data processor. We process data supplied to us by BBC and DW both to trainnatural language models and to test them. The University of Edinburgh abides by UK and EU dataprotection laws. The University of Edinburgh has also drafted their own Data Protection Policiesand Information Security Policies and abides by these.

PRIBERAM provides technology which is used for data processing. Priberam is also a data pro-cessor in the scope of the Project. Priberam abides by PT and EU data protection laws.

IDIAP/Switzerland is primarily a data processor.

UCL is primarily a data processor. We process data supplied to us by BBC and DW both to trainnatural language models and to test them. University College London abides by UK and EU dataprotection laws. University College London has also drafted their own Data Protection Policiesand Information Security Policies and abides by these.

QCRI is primarily a data processor for Arabic speech recognition and Arabic to English machinetranslation. QCRI processes data supplied by BBC and DW. QCRI also supports the project bydefining test sets, i.e., selecting some documents from the provided data for evaluation purposesfor testing the Arabic language processing modules.

8 https://ico.org.uk/for-organisations/guide-to-data-protection/key-definitions/

page 36 of 38

Page 37: Scalable Understanding of Multilingual MediA (SUMMA)summa-project.eu/wp-content/uploads/2017/08/SUMMA... · SUMMA (Scalable Understanding of Multilingual Media) is a three-year H2020

SUMMA H2020–688139 D2.3 Updated Data Management Plan

Partner Data Provider Data Processor Data ControllerBBC Y YDW Y YLETA Y Y YUEDIN Y YPRIBERAM Y YIDIAP Y YSheffield Y YUCL Y YQCRI Y Y

Table 4: Partners’ roles regarding data protection

page 37 of 38

Page 38: Scalable Understanding of Multilingual MediA (SUMMA)summa-project.eu/wp-content/uploads/2017/08/SUMMA... · SUMMA (Scalable Understanding of Multilingual Media) is a three-year H2020

SUMMA H2020–688139 D2.3 Updated Data Management Plan

ENDPAGE

SUMMA

H2020-ICT-2015 688139

D2.3 Updated Data Management Plan

page 38 of 38