Deliverable 6.3: Pilot Evaluation and Community-Specific Assessment · 2016-12-12 · D6.3 – v...

36
Project funded by the European Union’s Horizon 2020 Research and Innovation Programme (2014 – 2020) Coordination and Support Action Big Data Europe Empowering Communities with Data Technologies Deliverable 6.3: Pilot Evaluation and Community-Specific Assessment Dissemination Level Public Due Date of Deliverable M23, 30/11/2016 Actual Submission Date M24, 03/12/2016 Work Package WP6, Real-Life Deployment & User Evaluation Task T6.3 Type Report Approval Status Approved Version 1.0 Number of Pages 36 Filename D6.3_Pilot_Evaluation_and_Community- Specific_Assessment.pdf Abstract: Report summarizing the deployment of the pilots, the obtained results, the evaluation of the results, and the acquired recommendations for improvements. The information in this document reflects only the author’s views and the European Community is not liable for any use that may be made of the information contained therein. The information in this document is provided “as is” without guarantee or warranty of any kind, express or implied, including but not limited to the fitness of the information for a particular purpose. The user thereof uses the information at his/ her sole risk and liability. Project Number: 644564 Start Date of Project: 01/01/2015 Duration: 36 months

Transcript of Deliverable 6.3: Pilot Evaluation and Community-Specific Assessment · 2016-12-12 · D6.3 – v...

Page 1: Deliverable 6.3: Pilot Evaluation and Community-Specific Assessment · 2016-12-12 · D6.3 – v 1.0 Page 3 Executive Summary This document details the results and evaluation of the

Project funded by the European Union’s Horizon 2020 Research and Innovation Programme (2014 – 2020)

Coordination and Support Action

Big Data Europe – Empowering Communities with

Data Technologies

Deliverable 6.3: Pilot Evaluation and Community-Specific Assessment

Dissemination Level Public

Due Date of Deliverable M23, 30/11/2016

Actual Submission Date M24, 03/12/2016

Work Package WP6, Real-Life Deployment & User Evaluation

Task T6.3

Type Report

Approval Status Approved

Version 1.0

Number of Pages 36

Filename D6.3_Pilot_Evaluation_and_Community-Specific_Assessment.pdf

Abstract: Report summarizing the deployment of the pilots, the obtained results, the evaluation of the results, and the acquired recommendations for improvements.

The information in this document reflects only the author’s views and the European Community is not liable for any use that may be made of the information contained therein. The information in this document is provided “as is” without guarantee or warranty of any kind, express or implied, including but not limited to the fitness of the information for a particular purpose. The user thereof uses the information at his/ her sole risk and liability.

Project Number: 644564 Start Date of Project: 01/01/2015 Duration: 36 months

Page 2: Deliverable 6.3: Pilot Evaluation and Community-Specific Assessment · 2016-12-12 · D6.3 – v 1.0 Page 3 Executive Summary This document details the results and evaluation of the

D6.3 – v 1.0

Page 2

History

Version Date Reason Revised by

0.1 26/09/2016 Initial version Ronald Siebes

0.2 14/10/2016 Outline Ronald Siebes

0.3 19/10/2016 SC update Michele Lazzarini

0.4 21/10/2016 SC update Timea Turdean

0.5 03/11/2016 SC update Iraklis Klampanos

0.6 07/11/2016 SC Update Luigi Selmi, George Papadakis

0.7 08/11/2016 SC6 information reworked Martin Kaltenböck

0.8 09/11/2016 SC update Fragiskos Mouzakis

0.9 14/11/2016 editorial work Ronald Siebes

1.0 19/11/2016 SC update Pythagoras Karampiperis

Author List

Organisation Name Contact Information

OpenPHACTS Bryn Williams-Jones [email protected]

VUA Victor de Boer [email protected]

VUA Ronald Siebes [email protected]

NCSR-D S. Konstantopoulos, A. Charalambidis, I. Mouchakis, G. Stavrinos

[email protected]

UoA George Papadakis [email protected]

NCSR-D/ SC5

I. Klampanos, S. Andronopoulos, M. Vlachogiannis

[email protected]

SWC Martin Kaltenböck Timea Turdean Jürgen Jakobitsch

[email protected] [email protected] [email protected]

Page 3: Deliverable 6.3: Pilot Evaluation and Community-Specific Assessment · 2016-12-12 · D6.3 – v 1.0 Page 3 Executive Summary This document details the results and evaluation of the

D6.3 – v 1.0

Page 3

Executive Summary

This document details the results and evaluation of the first round of pilots and provides the recommendations for the next piloting cycle. This document follows the methodology described in D6.1 and given the high variety in requirements and domains, the focus will be on a descriptive evaluation where the feedback of domain experts from within and outside the BDE consortium, and the feedback from the mid-term review meeting are leading the recommendations for the second cycle of pilots.

Page 4: Deliverable 6.3: Pilot Evaluation and Community-Specific Assessment · 2016-12-12 · D6.3 – v 1.0 Page 3 Executive Summary This document details the results and evaluation of the

D6.3 – v 1.0

Page 4

Abbreviations and Acronyms

LOD Linked Open Data

SC Societal Challenge

BDE BigData Europe

GB Gigabyte

TB Terabyte

PB Petabytes

JSON JavaScript Object Notation

SC1 Societal Challenge 1

API Application programming interface

PDF Portable Document Format

JPEG Joint Photographic Experts Group

PNG Portable Network Graphics

GIF Graphics Interchange Format

GML Geography Markup Language

GeoTIFF Geographic Tagged Image File Format

TIF Tagged Image Files

GMLJP2 Geography Markup Language JPEG 2000

CCTV Closed-circuit television

CMIP5 Coupled Model Intercomparison Project Phase 5

CMIP6 Coupled Model Intercomparison Project Phase 6

CORDEX Coordinated Regional Climate Downscaling Experiment

SPECS Seasonal-to-decadal climate Prediction for the improvement of European Climate Services

HDF Hierarchical Data Format

NetCDF Network Common Data Form

ASCII American Standard Code for Information Interchange

MRI Magnetic resonance imaging

INSPIRE Infrastructure for Spatial Information in Europe

PPT PowerPoint templates

Page 5: Deliverable 6.3: Pilot Evaluation and Community-Specific Assessment · 2016-12-12 · D6.3 – v 1.0 Page 3 Executive Summary This document details the results and evaluation of the

D6.3 – v 1.0

Page 5

Table of Contents

1. Introduction ................................................................................................................ 8

2. Rationale for the Choice of Evaluation Methodologies ............................................... 9

3. Pilot Evaluations ...................................................................................................... 10

3.1 SC1: Health, Demographic Change and Wellbeing ................................................ 10

3.1.1 Use Case Description ...................................................................................... 10

3.1.2 Key Evaluation Questions: .............................................................................. 11

3.1.3 Other Evaluation Questions Based on the Requirements Specified in D5.2: ... 12

3.1.4 SC1 Summary and Recommendations for the Next Piloting Cycle .................. 13

3.2 SC2: Food Security, Sustainable Agriculture and Forestry, Marine and Maritime and Inland Water Research, and the Bioeconomy ..................................................................... 13

3.2.1 Use Case Description ...................................................................................... 13

3.2.2 Key Evaluation Questions ............................................................................... 14

3.2.3 SC2 Summary and Recommendations for the Next Piloting Cycle .................. 15

3.3 SC3: Secure, Clean and Efficient Energy .............................................................. 16

3.3.1 Use Case Description ...................................................................................... 16

3.3.2 Key Evaluation Questions ............................................................................... 17

3.3.3 SC2 Summary and Recommendations for the Next Piloting Cycle .................. 19

3.4 SC4: Smart, Green and Integrated Transport ........................................................ 19

3.4.1 Use Case Description ...................................................................................... 19

3.4.2 Key Evaluation Questions ............................................................................... 21

3.5 SC5: Climate, Environment, Resource Efficiency and Raw Materials .................... 23

3.5.1 Use Case Description ...................................................................................... 23

3.5.2 Evaluation Results .......................................................................................... 25

3.6 SC6: Inclusive, Innovative and Reflective Societies ............................................... 27

3.6.1 Use Case Description ...................................................................................... 27

3.6.2 Evaluation Approach & Key Evaluation Questions .......................................... 30

3.6.3 SC6 Summary and Recommendations for the Next Piloting Cycle .................. 31

3.7 SC7: Secure Societies ........................................................................................... 31

3.7.1 Use Case Description ...................................................................................... 31

3.7.2 Key Evaluation Questions ............................................................................... 32

4. Conclusion ............................................................................................................... 36

Page 6: Deliverable 6.3: Pilot Evaluation and Community-Specific Assessment · 2016-12-12 · D6.3 – v 1.0 Page 3 Executive Summary This document details the results and evaluation of the

D6.3 – v 1.0

Page 6

List of Figures

Figure 1: Work packages and respective project implementation phases .......................... 8

Figure 2: pilot planning ...................................................................................................... 9

Figure 3: Architecture of first SC1 pilot ............................................................................ 10

Figure 4: Architecture of SC2 pilot - Cycle 1 .................................................................... 14

Figure 5: Architecture of the first SC3 pilot ...................................................................... 16

Figure 6: Architecture of the first SC4 pilot ...................................................................... 20

Figure 7: Final architecture of the first SC5 pilot .............................................................. 24

Figure 8: Screenshot of the evaluation Jupyter notebook ................................................ 25

Figure 9: Architecture of SC6 pilot - Phase 1 ................................................................... 29

Figure 10: Architecture of SC6 pilot - Phase 1 ................................................................. 29

Figure 11: Architecture of the first SC7 pilot .................................................................... 32

Page 7: Deliverable 6.3: Pilot Evaluation and Community-Specific Assessment · 2016-12-12 · D6.3 – v 1.0 Page 3 Executive Summary This document details the results and evaluation of the

D6.3 – v 1.0

Page 7

List of Tables

Table 1: Evaluation questions for the first SC1 pilot ........................................................ 13

Table 2: Evaluation questions for the first SC3 pilot ........................................................ 19

Table 3: Evaluation questions for the first SC4 pilot ........................................................ 23

Table 4: Evaluation questions for the first SC7 pilot ........................................................ 35

Page 8: Deliverable 6.3: Pilot Evaluation and Community-Specific Assessment · 2016-12-12 · D6.3 – v 1.0 Page 3 Executive Summary This document details the results and evaluation of the

D6.3 – v 1.0

Page 8

1. Introduction

According to the planning in the Technical Annex WP6 – Real-life Deployment & User Evaluation – starts the end of year one and ends the third quarter of year three (cf. figure 1).

Figure 1: Work packages and respective project implementation phases

The pilot deployment and evaluations will be done in three cycles, this is to have the possibility to improve, adjust and extend each pilot by the evaluation results of each previous cycle. Given the ambitious goal of the BDE project to provide an infrastructure that facilitates at least three pilots for each of the seven Societal Challenges, the evaluation methodology described in D6.1 for this first round of pilots is best served by a descriptive “lessons learned” approach which will form the prescriptive “must-haves” for the second round of pilots, starting M18.

During year one of the project, the first implementation of the BDE generic infrastructure is delivered in M12 and the first cycle of pilots in M18. The design and architectural decisions are based on the numerous feedback sessions and interviews with the project partners and the domain experts that are consulted by the various domain partners. The requirements and design specifications of the platform are described in deliverable 3.3, 3.5 and 4 which form the basis for the generic evaluation criteria. The specification of the first cycle of pilots for each of the seven challenges is worked out in deliverable 5.2, however after the D5.2 was delivered the pilots from the first cycle were still being improved upon and therefore the reader will experience some differences between the evaluation criteria from D6.1 and the aforementioned deliverables.

Page 9: Deliverable 6.3: Pilot Evaluation and Community-Specific Assessment · 2016-12-12 · D6.3 – v 1.0 Page 3 Executive Summary This document details the results and evaluation of the

D6.3 – v 1.0

Page 9

Figure 2: Pilot planning

2. Rationale for the Choice of Evaluation Methodologies

Deliverable 2.4.1 provides the results of an extensive requirements elicitation phase, which combined with the technical requirement analysis in deliverable 2.3 and the results of the interviews contain the functional and non-functional requirements for the BDE Platform. The goal of the evaluation process is to investigate at which level these requirements are met during the various implementation phases of this project.

For the functional and non-functional requirements of the generic infrastructure part the FURPS model1 is followed which classifies software quality attributes:

● Functionality - Capability (Size & Generality of Feature Set), Reusability (Compatibility, Interoperability, Portability), Security (Safety & Exploitability)

● Usability (UX) - Human Factors, Aesthetics, Consistency, Documentation, Responsiveness

● Reliability - Availability (Failure Frequency (Robustness/Durability/Resilience), Failure Extent & Time-Length (Recoverability/Survivability)), Predictability (Stability), Accuracy (Frequency/Severity of Error)

● Performance - Speed, Efficiency, Resource Consumption (power, ram, cache, etc.), Throughput, Capacity, Scalability

● Supportability (Serviceability, Maintainability, Sustainability, Repair Speed) - Testability, Flexibility (Modifiability, Configurability, Adaptability, Extensibility, Modularity), Installability, Localizability

1 "FURPS - Wikipedia, the free encyclopedia." 2011. 2 Nov. 2015 <https://en.wikipedia.org/wiki/FURPS>

Page 10: Deliverable 6.3: Pilot Evaluation and Community-Specific Assessment · 2016-12-12 · D6.3 – v 1.0 Page 3 Executive Summary This document details the results and evaluation of the

D6.3 – v 1.0

Page 10

The details of each of these requirements are different for each challenge and need to be addressed as such in our evaluation strategy. However, as mentioned before, the generic BDE infrastructure also can be evaluated independent from these challenges according to the FURPS model. This evaluation strategy for the first development cycle of the BDE infrastructure is described in section 3. The challenge specific evaluation strategies require a fine-tuned approach which is described in section 4.

3. Pilot Evaluations

The first cycle of pilots is specified in Deliverable 5.2. As mentioned before, each pilot will be evaluated on BDE generic and pilot specific requirements. The questionnaire described in section 3 deals with the generic part, the pilot specific questionnaires in this section cover the latter.

3.1 SC1: Health, Demographic Change and Wellbeing

3.1.1 Use Case Description

The first pilot in SC1 “Health, demographic change and wellbeing” tries to duplicate the OpenPHACTS functionality on the BDE infrastructure.

The OpenPHACTS functionality is twofold: 1) having a REST-full interface to the 2) integrated RDF store containing the data relevant to drug discovery. The current Open PHACTS infrastructure uses some commercial components (e.g. the cluster version of Virtuoso for RDF storage and 3Scale for delegating the API requests to the reasoner). The “BDE Open PHACTS” pilot will use open-source solutions for RDF reasoning and will provision the API via the BDE infrastructure (cf. Figure 3).

Figure 3: Architecture of first SC1 pilot

Page 11: Deliverable 6.3: Pilot Evaluation and Community-Specific Assessment · 2016-12-12 · D6.3 – v 1.0 Page 3 Executive Summary This document details the results and evaluation of the

D6.3 – v 1.0

Page 11

3.1.2 Key Evaluation Questions:

1. Did you manage to store all the RDF data and answer curated queries in a reasonable time?

We positively tested the storage on various machines. These minimum system requirements will have to be met in order to get the storage functional:

250 GB SSD disk space

32 GB RAM memory

8 CPU cores

Recent x64 Linux distribution (e.g. Ubuntu 14.04 LTS, Centos 7)

Docker 1.7.1 or later

Docker 1.7.1 or later

Good internet connection when loading external data (only needed at setup stage). This takes around 6 hours with 8Mb/s connection.

Docker Compose 1.5.2 or later

Tested on: Centos 6.7 (with kernel 3.18.21-17.el6 - yum install centos-release-xen ; yum update)

Tested on: Ubuntu 14.04 LTS

We also positively tested the API as documented in the SWAGGER file from which the web interface is rendered.

2. Were you able to fill the Puella SPARQL templates with the HTTP-GET parameters to execute RDF queries on the 4Store DB?

Yes, Demokritos demonstrated a successful execution where the Virtuoso RDF component is exchanged with 4Store. Part of this effort is published at the BLINK workshop2.

3. How many of the 21 Open PHACTS research questions3 were you able to answer?

With the Virtuoso RDF Docker component we can answer 18 of the 21 questions. The reason for not being able to answer 3 is because they depend on the patent data set which is not available as an open dataset, which is one of the requirements in the BDE project.

2 Antonis Troumpoukis, Angelos Charalambidis, Giannis Mouchakis, Stasinos Konstantopoulos, Ronald Siebes, Victor de Boer, Stian Soiland-Reyes and Daniela Digles. Developing a Benchmark Suite for Semantic Web Data from Existing Workflows. Workshop on Benchmarking Linked Data (BLINK), ISWC conference, October 18, 2016, Kobe, Japan 3 http://www.openphacts.org/documents/registered/deliverables/D%206.1%20Prioritised%20Research%20Questions_final%20version.pdf

Page 12: Deliverable 6.3: Pilot Evaluation and Community-Specific Assessment · 2016-12-12 · D6.3 – v 1.0 Page 3 Executive Summary This document details the results and evaluation of the

D6.3 – v 1.0

Page 12

3.1.3 Other Evaluation Questions Based on the Requirements Specified in D5.2:

Requirement Evaluation questions

R1 The solution should be packaged in a way that makes it possible to combine the Open PHACTS Docker and the BDE Docker to achieve a custom integrated solution.

Is SWAGGER suitable for specifying a REST API on a dynamic distributed environment like the BDE infrastructure? Yes, the Docker approach has a very intuitive networking approach which makes it easy to deploy the RESTful API in a secure way and to set-up a gateway via a public IP address. The Web GUI rendering of the SWAGGER script is done via a custom PHP script on the Puelia RDF web server.

R2 RDF data storage. What are the experiences with the Docker version of the open-source Virtuoso software with respect to this pilot? In general the software is performing very well: it is stable and able to load all the data in a reasonable amount of time (around 4 hours) on a high-end consumer PC. One issue that needs improvement is the way the Docker stack deals with temporal node failures and recovery from the cache files. Now, if for any reason the Virtuoso process halts, the whole data loading procedure (which takes more than four hours) has to be repeated and can be prevented by using the automatic persistent caching mechanism provided by Virtuoso.

R3 Datasets are aligned and linked at data ingestion time, and the transformed data is stored.

How does the BDE infrastructure communicate with the external IMS provider? The IMS became an independent docker module and is initiated via the docker compose script. This allows to customize various parameters that deal with the communication between the various other components like the Puelia RDF abstraction layer, the Virtuoso store, the IMS etc. What is the difference in delay between the current OPS system and the pilot version? When running the whole OPS docker on the same LAN the delay is in the order of milliseconds which is much faster than the current OPS system.

R4 Queries are expanded or otherwise processed and the processed query is applied to the data.

Were there any changes in the SPARQL templates needed due to the transition to another RDF store (it’s known that some providers include extra ‘shortcuts’ and functionality next to the SPARQL standard)? At this moment there were no changes needed because the first cycle of this pilot uses the same components as the current OPS system. The only difference is that the OPS system uses

Page 13: Deliverable 6.3: Pilot Evaluation and Community-Specific Assessment · 2016-12-12 · D6.3 – v 1.0 Page 3 Executive Summary This document details the results and evaluation of the

D6.3 – v 1.0

Page 13

the commercial version of Virtuoso and the OPS-docker stack uses the open-source version of Virtuoso, however there is no difference in the language constructs between both versions. In the next cycle we are going to experiment further with other RDF stores like 4Store. Most likely this will result in making adaptations in the query constructs.

R5 Data and query security and privacy requirements.

Are there currently vulnerabilities in the BDE infrastructure that might reveal any sort of communication to a 3rd party (e.g. queries and results, or ip addresses)? The current OPS-docker has no additional ACL functionality besides the built-in options from the Docker stack. The Docker stack provides sufficient mechanisms to 'shield' components from outside access by maintaining a virtual local network stack and bridging them internally. When desired the docker configuration file provides the possibility to proxy incoming and outgoing traffic to any of the components, like the OpenPHACTS explorer (part of the OPS-docker).

Table 1: Evaluation questions for the first SC1 pilot

3.1.4 SC1 Summary and Recommendations for the Next Piloting Cycle

The SC1 pilot is successful: all the requirements are implemented, the software can be easily deployed and the code is well documented. The feedback from the community gave clear directions for the next cycle:

Broaden the community: not only drug-research

Add more functionality to support extended domain requirements

In the next piloting cycle the functionality and data-sources will be extended to address the domain of food safety in the context of the academic and not-for-profit organisations.

3.2 SC2: Food Security, Sustainable Agriculture and Forestry, Marine and Maritime and Inland Water Research, and the Bioeconomy

3.2.1 Use Case Description

The pilot was carried out by SWC, AK and FAO. The goal of the SC2 Pilot Cycle 1 was

to demonstrate / evaluate the ability of BDE proposed technologies to complement existing

community-driven systems (e.g. VITIS for the Viticulture Research Community) with efficient

Page 14: Deliverable 6.3: Pilot Evaluation and Community-Specific Assessment · 2016-12-12 · D6.3 – v 1.0 Page 3 Executive Summary This document details the results and evaluation of the

D6.3 – v 1.0

Page 14

large-scale back-end text-mining workflows, which can utilise any spark-based Natural

Language Processing (NLP) module.

The focus of the demonstrator was on the Big Data aspects of such workflow (i.e.

storage, messaging and failure management) and not on the specificities of the NLP

modules/tools used in this demonstrator.

The goal of the text-mining workflow demonstrator was to automatically annotate

scientific publications relevant to Viticulture, available at FAO/AGRIS4 and NCBI/PubMed5 in

PDF format (about 26K and 7K publications respectively) by extracting (a) named entities

(locations, domain terms), (b) images / figures and tables as digital objects, and (c) the captions

of images / figures and tables.

The extracted information (metadata and digital objects) extends the Knowledge Base

of the VITIS application: Metadata are stored as triples in GraphDB, and digital objects (files)

are stored in HDFS.

Figure 4, presents the architecture containing BDE components which was used in SC2

Cycle 1 pilot.

Figure 4: Architecture of SC2 pilot - Cycle 1

3.2.2 Key Evaluation Questions

1. Can the proposed architecture easily handle a new variety of data models and formats? Although the overall architecture is data model and format agnostic by design (data is moved between components as streams of bytes), it exposes a specific point of control where

4 http://agris.fao.org/agris-search/home 5 www.ncbi.nlm.nih.gov/pubmed

Page 15: Deliverable 6.3: Pilot Evaluation and Community-Specific Assessment · 2016-12-12 · D6.3 – v 1.0 Page 3 Executive Summary This document details the results and evaluation of the

D6.3 – v 1.0

Page 15

variety of data formats can easily be tackled. For our use case we created a plugin system using Java’s Service Provider Interfaces (SPI), so as to enable framework extension and component replaceability. The different data types provided on harvesting is unified such that the rest of the pipeline can work on one data model and format.

2. Can data be processed in a continuous matter depending on when it is harvested? Yes. Through the use of Apache Flume the pipeline is continuously accepting new input. Through the use of Apache Kafka those chunks of data of any kind are distributed in a failsafe manner, meaning that the system knows which chunks of data have already been processed successfully.

3. Can the pipeline, in case of a failure, recover and continue from where it left off? Yes, as mentioned above, Apache Kafka will store which messages have successfully been processed by the data processing unit (e.g. a spark job). As an example, if the HDFS system is full, Apache Kafka can continue where it felt off when the HDFS system gets more space.

4. Is it possible to parallelize the analysis process? Parallelization is guaranteed by the data processing unit in use like Apache Spark or Apache Flink. Apache Kafka will additionally make sure that data messages are distributed evenly amongst all consumers, e.g. a Spark job, running on multiple nodes.

5. Can we easily scale up? Yes, all components, that have been chosen to be part of the implementation of this architecture are horizontally scalable by design.

3.2.3 SC2 Summary and Recommendations for the Next Piloting Cycle

SC2 Pilot Cycle 1 demonstrated the ability of BDE proposed technologies to setup efficient

large-scale back-end processing workflows. The ability to effectively handle pipeline failures

showcases the choice of Apache Kafka, especially when combined with Apache Flume for

seamless data input.

Recommendations for the next piloting cycle include the extension of functionality, by extending

the Flume/Kafka pipeline to handle other than bibliographic data (e.g sensor / weather data) and

inclusion of use case scenarios which combine / link more heterogeneous data sources.

Page 16: Deliverable 6.3: Pilot Evaluation and Community-Specific Assessment · 2016-12-12 · D6.3 – v 1.0 Page 3 Executive Summary This document details the results and evaluation of the

D6.3 – v 1.0

Page 16

3.3 SC3: Secure, Clean and Efficient Energy

3.3.1 Use Case Description

The first pilot in SC3 demonstrates the condition monitoring workflow in cases where high volume of sensor data streams are collected and need to be managed and analysed for operational state identification and model research. It is foreseen that custom analysis modules created by the user will perform the analytics tasks.

The data involved are sensor data collected by a distributed data acquisition system, specifically build for the BDE pilot case. The data stream reaches a volume of 16Gb per hour. The aim is to support condition monitoring and research on the specific unit.

All data are delivered in an industrial third party structured binary format. A preprocessor performs the transformation of the raw data to be handled in HDFS and the SPARK executor. Analytics result data are stored in PostgreSQL for the consequent query and presentation tasks.

Data processing include the near-real time execution of parametrized models to extract operational parametrics, periodical evaluation of system and condition models and update of execution of operational statistics.

Figure 5, presents the architecture containing BDE components which was used in SC2

Cycle 1 pilot.

Figure 5: Architecture of the first SC3 pilot

Page 17: Deliverable 6.3: Pilot Evaluation and Community-Specific Assessment · 2016-12-12 · D6.3 – v 1.0 Page 3 Executive Summary This document details the results and evaluation of the

D6.3 – v 1.0

Page 17

3.3.2 Key Evaluation Questions

1. Were you able to store and retrieve the binary blobs of temporal slices of the sensor data by using the HDFS module from the BDE?

Raw data are stored in binary files. Each file correspond to a specific data acquisition unit and contains data of multiple sensors for a specific time window. Each file contains the meta-data for the subsequent management and analysis. For the analysis of the raw data a preprocessor is utilised in order to transform the data in order to be used by the HDFS and SPARK executor. The key action is the reconstruction of the binary stream so the data chunks distributed by the executor correspond to the predefined (by the engineering requirements) time length for the analysis.

2. Were the processing algorithms able to efficiently work with the HDFS module?

The raw data analytics are supported by independent binaries running in parallel that process the binary stream handed by the SPARK executor.

3. Please describe the positive and negative points of the first GUI for visualizing and querying the (derived and raw) data.

The first GUI is kept minimal, yet providing the basic functionalities.

4. Other evaluation questions based on the requirements specified in D5.2:

Requirement Evaluation questions

R1 The data will be sent (via ftp, or otherwise) from the intermediate (local) processing level to BDE.

How was the data transferred to the BDE HDFS module and/or Cassandra module? A data preprocessor is developed that provides the transformation of the files to binaries handled by HDFS. Thus the format of the raw data is irrelevant and as such the applicability of the pilot enhanced.

R2 The application should be able to recover from short outages by collecting the data transmitted during the outage from the data sources.

Did you simulate ‘outages’? If yes, was the data connector able to request the missing data? BDE pilot scans the local data repository and processes data upon file reception. In the current version the pilot does not perform data transfer from the data acquisition units to the data repository. In the second pilot round this task will be examined.

R3 Binaries of existing analysis software will be used

Please describe how Spark is fulfilling your analysis requirements. What are the positive and negative experiences?

Page 18: Deliverable 6.3: Pilot Evaluation and Community-Specific Assessment · 2016-12-12 · D6.3 – v 1.0 Page 3 Executive Summary This document details the results and evaluation of the

D6.3 – v 1.0

Page 18

The developed SPARK executor performs the distributed parallel analysis on the data chunks with the user provided binaries. Except from the parallel analysis approach, the capability for the user to provide proprietary modules is the key advantage of the pilot.

R4 Weekly execution of model parameterization and operational statistics.

The development of the analysis tool is one of the key components for this pilot. Did you use any of the BDE scheduling tools for initiating the weekly executions? Were the statistical results according to your expectations? The analysis is performed by existing codes delivered as compatible with the pilot binaries. As the aim is the long term monitoring and system degradation identification this evaluation item is on-going (the pilot will remain active for one year).

R5 Near-real time execution of parameterized models to return operational statistics, including correlation analysis of data across units.

Was the BDE infrastructure able to successfully process the near-real time data analysis? Please elaborate. BDE pilot concept is not the substitution of a SCADA system that performs the real-time (yet low data volume content) monitoring for basic operation and safety (for example very high vibration recorded by a conditioner with a refresh rate of 1 second drives the SCADA to issue a warning and stop the system). The pilot support the handling of data streams for identifying for example the trends in system degradation and as such a latency of hours is acceptable. Due to the volume of the data, its transfer from the data acquisition point to the BDE pilot is not performed by a direct link (DAQ units and BDE pilot host cluster are not on a high speed network allowing the real time transfer). In the second round the data transfer of reduced data volumes will be addressed.

R6 Flexibility of data file input formats for future applications.

Please specify the formats and data models that the ingestion component supports after delivering this first pilot. How did you test the correctness of these results (e.g. which external tools were able to read the results successfully)? The analysis binaries dictate the format and data model for its input. Thus these binaries along with the data preprocessor are provided per case by the user. In the sequel of evaluation the pilot will be used by a stakeholder for his specific case.

R7 The GUI supports database Please describe the experiences with the GUI. In

Page 19: Deliverable 6.3: Pilot Evaluation and Community-Specific Assessment · 2016-12-12 · D6.3 – v 1.0 Page 3 Executive Summary This document details the results and evaluation of the

D6.3 – v 1.0

Page 19

querying and data visualization for the analytics results.

particular, which tasks you are able to do via this GUI? The GUI in the first round supports the basic functionalities and in the second cycle the tool will be enriched.

Table 2: Evaluation questions for the first SC3 pilot

3.3.3 SC3 Summary and Recommendations for the Next Piloting Cycle

During the first cycle of SC3 pilot, the effort was put to the data acquisition and data analysis driver. The concept facilitates the use of the pilot by stakeholders as it is open in incorporating third party analytics binaries. Also, the pilot address adequately all relevant cases with high sensor data volumes (specifically highly sampled sensors).

Yet, there are fields that have to be addressed in the next rounds as follows:

demonstration of BDE platform real time use for the case of a high number of low sampled sensors. The same DAQ infrastructure will be used, where in FPGA level the data will be reduced in volume. In this case the pilot will handle real-time processes in parallel to the off-line processes of the first version

inclusion of forecasting module related to production. This will be facilitated by the inclusion of BDE components used in other pilots

further investigation into binary data stream handling options

enrichment of the GUI

3.4 SC4: Smart, Green and Integrated Transport

3.4.1 Use Case Description

In the first cycle of the pilot the focus of our work has been on acquiring the knowledge of the domain experts, learning about the use cases that have been addressed using legacy systems and implementing some of them to show how the big data frameworks, and specifically the tools developed within the BDE project, can help to solve problems like scalability and fault tolerance. We had established bi-weekly calls with our partner CERTH-HIT, a research institute in the field of transportation in Greece. CERTH-HIT has provided the expertise in the traffic domain, the data to test the pilot and the map-matching algorithm as R script. In the first cycle we have used the Floating Car Data through a web service that CERTH-HIT provides as open data. The main requirements for the pilot in the first cycle were four. First, being able to ingest the data as streams, not only as batch files, as the data comes from sensors and GPS devices and its value depends on the speed with which it is processed and delivered. The second requirement was to match the location of the vehicles (taxies) carrying the GPS devices, given as longitude and latitude coordinates, to the road segments (links) of the road network extracted from a subset of the OpenStreetMap database. Third, the geographical data must be indexed in order to achieve low latency. Our fourth and last main requirement in the first cycle was to deploy the pilot and its component using the Docker images of the components so that the pilot can be

Page 20: Deliverable 6.3: Pilot Evaluation and Community-Specific Assessment · 2016-12-12 · D6.3 – v 1.0 Page 3 Executive Summary This document details the results and evaluation of the

D6.3 – v 1.0

Page 20

started using the Docker containers. The components chosen for the transport pilot are currently in use in companies that provide transport services. The architecture is based on the concept of microservices where an application or service is implemented using different loosely coupled components for data ingestion, communication, processing and storage. Apache Kafka is used as a message broker and allows the components to communicate asynchronously. Apache Flink is used to process the stream data, like the near-real-time taxi data provided by our partner CERTH, and also batch data like the historical data. Elasticsearch, a document database based on the open source search engine Apache Lucene, is used to index and store the records after the processing. All these components and others specific to our pilot are provided by the BDE project in Docker images that can be deployed in a single host where a Docker engine has been installed or in a distributed environment with nodes that are member of a Docker swarm. The BDE platform also provides a tool to orchestrate and start the components according to their dependencies and user interfaces to monitor the components used by an application. The pilot in the first cycle can ingest the taxi records and match each record containing the geographical coordinates of the vehicle to the closest road segment or “link” in the terminology of OpenStreetMap that provides the road network data. The link identifier is added to the record and then it is stored in Elasticsearch. We use Kibana to visualize the result of queries like showing on a map the number of vehicles on each road segment in a certain time window.

Figure 6: Architecture of the first SC4 pilot

Page 21: Deliverable 6.3: Pilot Evaluation and Community-Specific Assessment · 2016-12-12 · D6.3 – v 1.0 Page 3 Executive Summary This document details the results and evaluation of the

D6.3 – v 1.0

Page 21

3.4.2 Key Evaluation Questions

Q1) What have been the main considerations in designing the architecture for the pilot?

The data sources in the transport domain are characterized by the time dimension and the spatial dimension and most of the data comes as streams of records whose value decreases quickly with time. A simple use case like showing the number of taxi on a road segment within a time window needs a component that can buffer the GPS records and send them to a consumer that can process the data, e.g. matching the coordinates of the vehicles to the road segments, and finally store the result so that it can be used for visualization or further processing.

Q2) What have been the specific issues in dealing with stream data?

Contrary to the case of batch processing, stream data by definition is unbounded and messages can be lost or arrive late even if they refer to an event that happened before the last processed ones. Therefore a stream data processing pipeline must decide when a computation must be performed. This is usually done defining a time window. So for example the count of the records will take into account all the records that have arrived within an hour, every hour. Most of the time we want the data to be processed according to the event time, usually set as a timestamp in the record. Unfortunately some messages can arrive late and the stream processor should be able to handle these records.

Q3) What are the pros and cons of the frameworks used?

A record must be sent to Kafka with a schema so that it would be easier for other components to use the data. Flink also needs a schema to specify which is the timestamp so that it can manage checkpoints and watermarks and process the data using the event time. The problem is that currently we do not have a component that can be used to share the schemas among all the component that consume a certain data set. This problem has been addressed in the commercial version of Apache Kafka and likely will be included soon in the open source version.

Q4) How worked the integration of R in the pilot? Can it be improved?

R is known to be a great tool for performing statistical analysis on a single machine but it does not scale out in its default version. There have been improvements in recent times to run R in more than one core and also in distributed environment based on Hadoop. It is increasingly adopted as a programming language by Big Data frameworks such as Apache Spark and it is supported not only by a strong and active community of statisticians but also by software industries like Microsoft as a powerful complement of SQL. In our pilot, for the first cycle, we used R to implement the map-matching algorithm because the developer at CERTH had implemented it in Matlab and R. The reason to use R in a distributed environment instead of switching to some other programming language is twofold. First, R has lots of packages with good implementations of many algorithms, secondly many researchers use R to implement algorithms used in statistics or machine learning and are not familiar with languages like Java or Scala, more common in production environments.

Q5) Which are the benefits of using Dockerized versions of the pilot components and the BDE platform in particular?

Docker enables the separation of concerns between the development phase of an application and its testing and deployment in different environments. This is particularly useful in

Page 22: Deliverable 6.3: Pilot Evaluation and Community-Specific Assessment · 2016-12-12 · D6.3 – v 1.0 Page 3 Executive Summary This document details the results and evaluation of the

D6.3 – v 1.0

Page 22

microservices architecture where different components are used to provide a service. Providing each components as Docker images to be run in Docker containers each with its operating system, software packages, data and configuration files reduce the complexity of deploying an application on different environment, enables the update of each components without worrying about the possible unwanted consequences or side effects on the other components. The BDE platform provides many components to build Big Data architectures and tools to orchestrate and manage the components of a service deployed on a single host or in multiple nodes in a Docker swarm. The BDE platform provides also a common user interface to monitor all the components used in an application.

Other evaluation questions based on the requirements specified in D5.2 are outlined in the Table below:

Requirement Evaluation Questions

R1 Ingest the FCD stream data and historical data

Which is the format of the data and how it is imported into the pilot ? CERTH-HIT provides the near-real-time FCD data as a webservice in JSON format and the historical data as zipped CSV files. We developed an application (producer) to fetch the data from the web service (source) and send it to a Kafka topic where a Flink application (consumer) process the data.

R2 Ingest Bluetooth sensors stream data and historical data (raw detections number as well as travel times between BT sensors)

The Bluetooth sensor data has not been used in the first cycle.

R3 Import the geographical data in a relational database.

Did you use a relational database to import the road network used in the map matching algorithm ? The geographical data was imported in Postgis in order to be used from the R script that implements the map-matching. Since the performances were not good enough we decided to load the OSM data from a shape file into the memory available to the same Docker image where Rserve and the script for the map-matching is deployed.

R4 The FCD data is processed using the geographical data to match the position of cabs to roads

How did you implement the map matching algorithm ? The map matching algorithm is used to match the location of a vehicle to a road segment (link) from the OSM database. The algorithm has been implemented by CERTH as an R script. R normally runs as a single process. The reason for which the R script has been chosen are that our partners from CERTH use Matlab or R as a programming language, not Java or Scala that are more common in big data frameworks. The other reason is that R provides many packages that perform well with complex algorithms.

R5 The pilot will enable the evaluation of the present and

Can you explain how the data enriched using the map-matching algorithm with the information about the road

Page 23: Deliverable 6.3: Pilot Evaluation and Community-Specific Assessment · 2016-12-12 · D6.3 – v 1.0 Page 3 Executive Summary This document details the results and evaluation of the

D6.3 – v 1.0

Page 23

future traffic condition (e.g. congestion) within temporal windows.

segment, can be used to inferring the road status ? The map-matching algorithm adds to each FCD record a field to store the identifier of the closest road segment. Using this information it is possible to say how many taxies were on that same road segment in a certain time interval, which is the average speed, and infer the road status (normal/congested). The status can be determined setting a threshold on the number of taxies and the average speed recorded in a road segment in a time frame (e.g. 15 minutes).

R6 The traffic conditions and predictions will be saved in a database

Are the data aggregated for further analysis ? In which format? How the predictions will be compared with the actual data? The pilot saves the records from the taxies with the road segment information and the timestamp in Elasticsearch, a document database. Elasticsearch indexes the records using the location (latitude and longitude) and the timestamp. A simple query can be sent to the database to know how many taxies were in a road segment and what was the average speed within a certain time interval. The status can be determined as normal if the number of taxies is below the density threshold or the average speed is above the velocity threshold or congested if the number of taxies is above the density threshold and the average speed is below the velocity threshold.

R7 The pilot can be started in two configurations: single node (for development and testing) and cluster (production)

Can the pilot be started in a single node for development and testing? The pilot (cycle 1) can be started manually in a single node using the docker images from the BDE components and its own specific components. The BDE component are Zookeeper, Kafka, Flink and Elasticsearch. The components developed specifically for the pilot are available from the same GitHub repository of the BDE project. A project in the repository (pilot-sc4-cycle1) contains the docker-compose file used to start the pilot.

Table 3: Evaluation questions for the first SC4 pilot

3.5 SC5: Climate, Environment, Resource Efficiency and Raw Materials

3.5.1 Use Case Description

The first pilot in SC5 “Climate, environment, resource efficiency and raw materials” developed a series of functionalities allowing for big data (as instantiated by the BDE platform) to be brought closer to institutional infrastructures operating climate modelling procedures, and in particular downscaling via the WRF modelling system. These functions operate mainly on WRF-compatible NetCDF data and provide data ingestion, exporting, the recording of data

Page 24: Deliverable 6.3: Pilot Evaluation and Community-Specific Assessment · 2016-12-12 · D6.3 – v 1.0 Page 3 Executive Summary This document details the results and evaluation of the

D6.3 – v 1.0

Page 24

lineage and basic data analytics. Functions are parameterisable via user-defined parameters, such as geographical areas, time periods, physical variables, and time steps).

Ingestion would ingest a chosen NetCDF file into the BDE platform (into Hive specifically). The ingested data would acquire a Hive key and would then be queryable either by the key or by its metadata. Exporting would be the selection of data present inside the BDE platform and its exporting to a NetCDF file. Basic analytics functionality was implemented as Hive queries onto the raw data previously ingested, either from a NetCDF file acquired through ESGF (Earth System Grid Federation) or directly as a product of a WRF downscaling operation. Basic data lineage was implemented as a small subset of the W3C PROV standard (https://www.w3.org/TR/prov-overview/). The source code for the 1st SC5 pilot prototype can be found at https://github.com/iaklampanos/bde-climate-1.

The final design describing the 1st SC5 pilot is depicted below. Differences in the final design are due to practicalities, such as the ESGF web services being unavailable due to maintenance, until after implementation had commenced.

Figure 7: Final architecture of the first SC5 pilot

The functionality described above was wrapped inside a Web-based Jupyter interactive python UI (http://jupyter.org) for hangout participants to use and evaluate. This evaluation exercise took place during the SC5 hangout of 12 July 2016.

Page 25: Deliverable 6.3: Pilot Evaluation and Community-Specific Assessment · 2016-12-12 · D6.3 – v 1.0 Page 3 Executive Summary This document details the results and evaluation of the

D6.3 – v 1.0

Page 25

Figure 8: Screenshot of the evaluation Jupyter notebook

The evaluation performed was qualitative. Measuring performance in terms of time, disk I/O, etc. was not applicable to the purposes of the pilot, which were to supplement the local/institutional downscaling modelling workflow of climate researchers. As such the heavy climate modelling would still take place on appropriately configured clusters outside BDE, with exporting and ingestion performance being predictable and analogous to the NetCDF size in terms of variable number and length.

3.5.2 Evaluation Results

The hangout of 12 July 2016 was attended by approximately 16 people representing the climate research community. The 1st pilot was sampled by 8 people, 7 of which provided feedback by filling in the evaluation form. The evaluation questions and outcome of the 1st SC5 pilot is as follows:

Domain of expertise (accumulated)

Environmental & Agricultural Science, Remote Sensing Analyst Climate Research, Data Analytics, Data Dissemination Platform, Big Data, ICT, Software Development, Regional Climate Modeling, Scientific Computing, Climate research

Page 26: Deliverable 6.3: Pilot Evaluation and Community-Specific Assessment · 2016-12-12 · D6.3 – v 1.0 Page 3 Executive Summary This document details the results and evaluation of the

D6.3 – v 1.0

Page 26

How would you grade the NetCDF ingestion and exporting functionality of the pilot?

How would you grade the potential of data lineage for your work?

How would you grade the analytics potential of BigDataEurope for your field of interest?

What additional analytics would you consider important for your line of work?

● averages, extremes, interpolation using additional datasets e.g. height

info, station data

● multi-temporal analytics

● Simple statistics, climate system processes, climate indices

● temporal analytics

● min/max,average etc

● seasonal analytics of max/min temp, differences, more climate indices

What other environment-related thematic areas or applications do you see the pilot being potentially used/expanded/ adapted for?

● impact modelling in various domains (water management, agriculture,

disaster risk reduction etc.)

● for urban applications could be interesting, especially with heat-island

analytics

● Meteorology, climate change impact models

● all impact fields (energy, tourism, agriculture etc)

● ocean CFD simulation (Sirocco etc), urban scale environmental CFD

simulation etc

● seasonal forecasting

Any Other “It would be nice to see some background regarding the technologies used. The

Page 27: Deliverable 6.3: Pilot Evaluation and Community-Specific Assessment · 2016-12-12 · D6.3 – v 1.0 Page 3 Executive Summary This document details the results and evaluation of the

D6.3 – v 1.0

Page 27

Comments? lineage registration is very useful, although it might be improved by allowing users/modellers to also provide some "manual" documentation.” “The proposed technology is very interesting and innovative. It could be interesting how all these steps works considering a more user-friendly interface (e.g. web-based clients). Even if the pilot is not directly related to my field of work, the approach proposed could be useful and potentially extendable to cover other Earth-Observation areas.” “Work needed on the user interface. Could be teamed up with climate4impact.eu technologies (e.g. web interface, ESGF search front API)”

Based on the evaluation outcome presented above, we conclude that (a) users are keen on using the BDE platform for specific, albeit very diverse, analytics goals, and (b) that meta-services such as data-lineage and provenance will prove very important for the development of future research in the climate research area. The overall functionality of the pilot is available via the BDE pilot source code distribution, and allows users to add this functionality to their solutions.

3.6 SC6: Inclusive, Innovative and Reflective Societies

3.6.1 Use Case Description

The pilot was carried out by SWC, CESSDA and NCSR Demokritos and the title of the pilot is:

Citizens & Researchers Budget on Municipal Level.

Problem Statement

Local government budgets (e.g. Cities) are filled with so many numbers and technical jargon

that the ordinary readers cannot easily understand what they mean. Such terminology is not

totally harmonised (only on regional level; sometimes on national level, but not on cross-country

level in the European Union). This means that A) budget data is hard to understand by non-

experts (and sometimes even by experts) and that and B) budget data is hard to be analysed

and/or compared across municipalities because of different terminology (and thereby also

meaning) used.

Additionally the 24 languages used in EU28 are an issue for this Pilot that needs to be solved

as data description (or e.g. attributes) are often available in local languages only. At the same

time, data on municipal level cannot be widely reused by researchers because are often

aggregated in time and scale and do not use common international standards.

Finally we see that more and more municipalities are publishing budget data as open data

across Europe and above (but in heterogeneous structures and formats and thereby we face an

issue on Variety) AND that municipalities start to publish budget execution data on a continuous

basis (daily, weekly, monthly) what requires real time data harvesting & processing as well as

Page 28: Deliverable 6.3: Pilot Evaluation and Community-Specific Assessment · 2016-12-12 · D6.3 – v 1.0 Page 3 Executive Summary This document details the results and evaluation of the

D6.3 – v 1.0

Page 28

an infrastructure that can deal with big amounts of data that can also process and analyse such

data fast and sometimes even in parallel streams.

Hence, it is needed a clear and simple summary guide and data to the municipal budget both

for citizens (the so called citizens’ budget) and researchers.

Objectives

Main Objective: can we make budgets more useful for citizens, researchers and decision

makers?!

The SC6 Pilot is about acquiring & harvesting, processing and analysing economic data (budget

data and budget execution data) from several municipalities in Europe (starting point in Phase

1 are the municipalities of Thessaloniki and Athens and Kalamaria in Greece). But system has

been planned to enable to dock more and more municipality data sources onto the environment

in a later stage.

In Phase 1 of the Pilot we are building the basic infrastructure for the Pilot system on top of BDE

core components (and the BDE platform - see details below)

Thereby the following Big Data requirements have been identified as being part of this Pilot:

● Variety: requirement based on the harvesting of budget data and budget execution

data from several sources, available in different structures and formats.

● Volume: requirement regarding the growing amount of open budget data available as

well as of budget execution data

● Velocity: requirements regarding budget execution data that is provided on continuous

basis by the publisher (daily, weekly, monthly).

● Veracity: Veracity refers to the biases, noise and abnormality in data. Even for within

the same country there are differences on the published data because often are

coming from different systems or public accounting standards are not enforced

absolutely uniformly (e.g. different municipal departments)

The overall objective of the Pilot system after all phases planned is to provide municipal budget

data (budget data and budget execution data) as well as other socio-demographic data as well

as potentially contract (documents) data from several sources in different A) languages B)

structures and C) using different terminology inside of the data that will be harvested,

normalised, processed and analysed be finally distributed / provided through the Web as:

● Open Data in RDF format (and other formats provided by the endpoint/API)

● Auto-updated infographics (in the form of a dashboard)

● Simplified categories of revenues and expenses

● Comparative Economic Indicators per municipality and across municipalities

● Data dumps

● SPARQL endpoint

Page 29: Deliverable 6.3: Pilot Evaluation and Community-Specific Assessment · 2016-12-12 · D6.3 – v 1.0 Page 3 Executive Summary This document details the results and evaluation of the

D6.3 – v 1.0

Page 29

Figures 9 and 10, presents the architecture containing BDE components which was used in SC2

Cycle 1 pilot.

Figure 9: Architecture of SC6 pilot - Phase 1

Figure 10: Architecture of SC6 pilot - Phase 1

Page 30: Deliverable 6.3: Pilot Evaluation and Community-Specific Assessment · 2016-12-12 · D6.3 – v 1.0 Page 3 Executive Summary This document details the results and evaluation of the

D6.3 – v 1.0

Page 30

3.6.2 Evaluation Approach & Key Evaluation Questions

Evaluation Approach

The approach for the evaluation of the SC6 Pilot is as follows:

● Invite municipalities to evaluate and use the system ● Invite community (open data, data community, BDE community, W3C) to use the

system and provide feedback ● Evaluate the pilot within the 3 participating projects (BDE, DataStories, OpenBudget) ● Discuss the pilot at the BDE SC6 workshop in Cologne, taking place on 5.12.2016

PLUS at the overall BDE Tech WS (November 2016 in the course of ApacheCon)

Additional evaluation / tests over time with ● A growing amount of data ● A growing number of different sources & formats docked onto the system ● Additional analytics in place

Pilot Key Questions

1. Can the proposed architecture handle a variety of data models and formats from different data sources?

Although the overall architecture is data model- and format agnostic by design (data is moved between components as streams of bytes), it exposes a specific point of control where variety of data formats can easily be tackled. For our use case we created a plugin system using - very similar to the SC2 approach - Java’s Service Provider Interfaces (SPI), so as to enable framework extension and component replaceability. The different data types from different data sources, processed in the data acquisition mechanism is unified such that the rest of the pipeline can work on one data model and format.

2. Can data be processed in a continuous matter depending on when it is harvested?

Yes it can. Through the use of Apache Flume the pipeline is continuously accepting new input. Through the use of Apache Kafka those chunks of data of any kind are distributed in a failsafe manner, meaning that the system knows which chunks of data have already been processed successfully.

3. Can the pipeline, in case of a failure, recover and continue from where it left off? Yes it can, Apache Kafka stores which messages have successfully been processed by the data processing unit (e.g. an Apache SPARK job). As an example, if the HDFS system is full, Apache Kafka can continue where it felt off when the HDFS system gets more space.

4. Is it possible to parallelize the analysis process? Parallelization is guaranteed by the data processing unit in use - here Apache SPARK. And Apache Kafka will additionally make sure that data messages are distributed evenly amongst all consumers, e.g. a SPARK job, running on multiple nodes.

5. Can we easily scale up? Yes we can, all components that have been chosen to be part of the Pilot implementation are horizontally scalable by design!

Page 31: Deliverable 6.3: Pilot Evaluation and Community-Specific Assessment · 2016-12-12 · D6.3 – v 1.0 Page 3 Executive Summary This document details the results and evaluation of the

D6.3 – v 1.0

Page 31

3.6.3 SC6 Summary and Recommendations for the Next Piloting Cycle

SC6 Pilot Cycle 1 demonstrated the ability of BDE proposed technologies to setup

efficient large-scale data harvesting workflows from different sources (variety of data) coming in

in different time intervals (velocity of data / processing) as well as the ability to store such data

along different requirements. The ability to effectively handle pipeline failures showcases the

choice of Apache Kafka, especially when combined with Apache Flume for seamless data input.

Finally we demonstrated that Apache SPARK was the right choice to be used to enable on-the-

fly RDF conversion and data linking as well as storage into a Triple store.

Recommendations for the next piloting phase include the extension of data sources and

thereby data by adding more municipality budget & budget execution data from European cities,

and also such data available in other languages that need to be mapped between each other to

enable efficient use and comparison. As well as extension of data in regards to adding more

socio-economic data and potentially also expanding in the field of related contract data (means

unstructured data in the form of documents) and thereby extending the Flume/Kafka pipeline to

handle other data types and extending methods to map multilingual data by using PoolParty

Semantic Suite. Finally data analysis methods and visualisations will be expanded and all

manuals and the documentation will adapted accordingly.

3.7 SC7: Secure Societies

3.7.1 Use Case Description

The first pilot for Secure Societies is focusing on the integration and fusion of data coming from remote and social sensing in order to add value to the current data exploitation practices; this is key in the Space and Security domain, where useful information can be derived not only from satellite data, but also from data coming from social media and other sources.

The pilot considers two different workflows of data:

The first workflow, called the Change Detection workflow, ingests satellite images to detect areas with changes in land cover or land use by using change detection techniques; the identified Areas of Interest (AoIs) are then associated with social media and news items, which are presented to the end-user for cross-validation.

The reverse procedure is applied to the second workflow, called the Event Detection workflow. Event detection is triggered by news and social media information, where trending topics (i.e. document clusters) with geospatial connotation constitute a time- and space- localized event; provided such an event, the user has the possibility to activate the Change Detection workflow, where corresponding satellite images are acquired and processed in order to check for changes in land cover or land use.

Page 32: Deliverable 6.3: Pilot Evaluation and Community-Specific Assessment · 2016-12-12 · D6.3 – v 1.0 Page 3 Executive Summary This document details the results and evaluation of the

D6.3 – v 1.0

Page 32

Figure 11: Architecture of the first SC7 pilot

3.7.2 Key Evaluation Questions

1. Was the generic BDE HDFS module able to store all the satellite images relevant to this pilot? And how did you retrieve them (ie. annotate them with relevant meta-data)? The satellite images that are compared to detect AoIs with man-made changes are retrieved on demand from ESA’s Sentinels Scientific Data Hub (SCIHUB). To enable their parallel processing, they are temporarily stored in the HDFS that is running on BDE infrastructure. After their processing is concluded, they are removed, and the only information that is retained is the set of polygons that corresponds to the detected AoIs. This information is kept in textual form (WKT in particular) and is stored in Strabon as RDF triples along with all the relevant meta-data (e.g., the filenames of the compared images). Thus, they can be later retrieved by the user. In this way, we minimize the storage requirements of the pilot and simplify its architecture: if we stored permanently the retrieved satellite images, the functionality of Image Aggregator would be more complex in order to ensure that the stored information is up-to-date and can be retrieved from the disk instead of SCIHUB.

2. Was Cassandra the right choice for storing news and tweets? How did you index the content?

Page 33: Deliverable 6.3: Pilot Evaluation and Community-Specific Assessment · 2016-12-12 · D6.3 – v 1.0 Page 3 Executive Summary This document details the results and evaluation of the

D6.3 – v 1.0

Page 33

Cassandra was a fine choice for storing news and tweets, offering the scalability and efficient access the pilot needs. Table indexing was designed to allow fast reads for the query set of each use case of the event detection pipeline. News are acquired by their url and a reversed substring of the latter, to utilize reversed host bucketing that allows fast lookup. For tweets, a unique identifier is supplied from the twitter api. Data duplication schemes were adopted - given the non-relational nature of the store - to support scenarios beyond the pilot, allowing retrieval by each column of interest, such as publication, crawl date or referred place. These columns are also regularly used by modules of the pipeline and affect performance positively. Specifically for twitter tables, this is expanded for various metadata in a post, for example hashtag and geolocation coordinates, for potential future use.

3. What triggered the ‘event detection’ that lead to initiating the relevant satellite image comparison procedures?

In the first version of the pilot, the Event Detection and the Change Detection are separated workflows that are manually triggered by the users. In the next phases of the pilot, we will add the automated activation of Change Detection from Event Detection.

4. How satisfied are you with Strabon for storing the meta-data (e.g. the geo-locations)?

Strabon constitutes a state-of-the-art triplestore for spatiotemporal data. Thus, it offers three benefits that are crucial for BDE and the SC7 pilot in particular: (i) it supports the materialized semantification of satellite images processing (qualitative advantage), (ii) it can be federated with SemaGrow, thus being compatible with Cassandra, which stores news items and events (qualitative advantage), and (iii) it offers very efficient querying of large volumes of spatiotemporal RDF data (quantitative advantage). In fact, it constitutes one of the leading open-source RDF stores with respect to querying time.6

5. Did you use Spark and/or Flink for detecting changes? If yes, how? If no, why not and which technology did you use?

Our initial plan was to use both Spark and Flink for both Event and Change detection workflows. Yet, Flink is specialized in stream processing, which is not the case in either workflow: in the Event Detection workflow, the news items do arrive in a quasi-stream fashion, but they are clustered into events periodically, at predetermined points in time. Thus, they are best suited to batch processing.The same applies to Change Detection, where all necessary information is a-priori available as soon as the relevant satellite images have been retrieved. As a result, we exclusively used Spark.

In more detail, for the Change Detection workflow, a series of operators is applied to a pair of satellite images. To accelerate their processing, we parallelized every operator such that every node in the cluster applies the operator functionality to a set of tiles (i.e., subsets of the input images). In this way, the computational cost of every operator is distributed among the available cluster nodes using Spark.

For the Event Detection workflow, graph representations and textual similarities between unique pairs in a set of news items have to be computed. To minimize the overhead of this process, Spark distributes the computation of textual similarities across the available cluster nodes.

6 George Garbis, Kostis Kyzirakos and Manolis Koubarakis. Geographica: A Benchmark for Geospatial {RDF} Stores (Long Version). 12th International Semantic Web Conference (ISWC), pages 343-359, 2013.

Page 34: Deliverable 6.3: Pilot Evaluation and Community-Specific Assessment · 2016-12-12 · D6.3 – v 1.0 Page 3 Executive Summary This document details the results and evaluation of the

D6.3 – v 1.0

Page 34

Requirement Evaluation questions

R1 Monitor multiple text services (Twitter and Reuters).

Was the adoption of the NOMAD connectors for the Twitter and Reuters data successfully able to store the data into Cassandra, including provenance and other metadata like locations? Text is retrieved and stored together in Cassandra with provenance and any metadata provided by the service (notably, location). The metadata include automatically identified locations, based on Named Entity Recognition within the evolved NLP analysis pipeline.

R2 Regularly execute event detection on a single thread over the most recent text batch.

Did you use the generic BDE Scheduling module to automatically trigger the event detection pipeline? The Event Detection workflow is triggered on specific time intervals, constantly populating the event store. A corresponding parameter for the interval has been incorporated into the module internal functionality. Thus, there is no need to employ the BDE Scheduling module.

R3 Download images for a given Area Of Interest (AoI) from the Sentinel service.

What is your experience with the Sentinel data connector to download image data for a given GIS location? We performed analytical experiments to measure the time efficiency of the corresponding pilot component (Image Aggregator). First, we selected 10 pairs of images of various sizes from 10 different countries in all continents. Then, we performed 10 measurements for the download time of each pair using the BDE infrastructure at NCSR Demokritos, which is equipped with a 100Mbps connection. The following diagram presents the average performance for every pair of images:

We can deduce that SCIHUB is the bottleneck of the Change Detection workflow, requiring almost 10 minutes, on average, for downloading a pair of images with a total size of 1.5 GBs. Most importantly, though, SCIHUB is quite unstable, having frequent downtimes.

R4 AoIs defined by the user by selecting a map area.

Please describe your experiences with the GUI developed for selecting AoIs. Sextant’s interface allows the user to search for a location in two ways: (i) navigating the Earth map to draw a polygon on it, and (ii) through a Location bar that allows for an autocomplete keyword search in Bing; as

Page 35: Deliverable 6.3: Pilot Evaluation and Community-Specific Assessment · 2016-12-12 · D6.3 – v 1.0 Page 3 Executive Summary This document details the results and evaluation of the

D6.3 – v 1.0

Page 35

soon as the user selects one of the proposed locations, the map automatically zooms into the corresponding area.

R5 Define AoIs from the event detection workflow.

Please look at the false-positives and false-negatives that the automatic event detector identified. What are your recommendations for improving the detection algorithm?

AoIs resulting from the Event Detection workflow have to be properly evaluated in the following phase of the pilots, as well the activation of the Change Detection workflow from the Event Detection output.

R6 Change detection will be experimented with Spark and/or Flink implementations.

How do the state-of-the-art detection implementations compare to the performance of the Spark and Flink versions? Which of these implementations is the simplest to read, write and debug? There are several tools for remote sensing, but few of them are open-source and free to use. The best choice in this respect is ESA’s SNAP toolbox, whose source code in Java is freely available on Github7. To the best of our knowledge, none of the available open-source solutions is able to exploit the shared-nothing architecture of MapReduce that lies at the core of Spark and Flink. Instead, they are only suitable for stand-alone processing, using multiple threads for higher time efficiency. There is a single exception that proves this rule, namely the project Calvalus8, which uses Hadoop to parallelize image processing. However, instead of following a tile-based approach that distributes parts of images to the available cores, it assigns every pair of images to a specific node. Thus, its scalability is limited by the number of available cluster nodes. On the whole, our approach to adapting existing operators for image processing to Spark is beyond the state-of-the-art. However, it is a time-consuming process, due to the lack of documentation in SNAP’s code base.

R7 Change detection and event detection store locations of changes in a Strabon database.

Is Strabon able to join geographical events? Yes, Strabon is able to identify AoIs from Change Detection that are overlapping with areas of detected events. This can be achieved through a simple SPARQL query.

R8 End-user interface is based on Sextant.

What is your opinion on presentation layer that combines the text- and image analysis output? The presentation of the results from the Change and Event Detection workflows will be improved and properly evaluated in the next phases of the pilots.

Table 4: Evaluation questions for the first SC7 pilot

7 https://github.com/senbox-org/ 8 http://www.brockmann-consult.de/calvalus/

Page 36: Deliverable 6.3: Pilot Evaluation and Community-Specific Assessment · 2016-12-12 · D6.3 – v 1.0 Page 3 Executive Summary This document details the results and evaluation of the

D6.3 – v 1.0

Page 36

4. Conclusion

This report provides the evaluation for the first cycle of pilots for the seven Societal Challenges, the generic BDE platform and recommendations for the next cycle. Deliverable 5.2 identified four challenges to have a feasible plan for an implementation in this first cycle, however at this moment of evaluation we have for each SC the possibility to reflect. For all the seven pilots and the BDE platform this report outlined the evaluation results. The main goal of this document is to have the practical and to-the-point open set of questions from D6.1 answered and how the requirements from D5.2 are met. Together with the feedback from the external community gathered via workshops, hangout etc. this document provides the input needed for the development of the next round of pilots.