BigDataEurope @BDVA Summit2016 1: The BDE Platform

Post on 20-Jan-2017

130 views 0 download

Transcript of BigDataEurope @BDVA Summit2016 1: The BDE Platform

BIG DATA EUROPE'S INTEGRATOR PLATFORM A ONE-STOP SOLUTION FOR BIG AND

SMART DATA MANAGEMENT

BDVA Summit 2016, Valencia1 December 2016

Summit 2016

Talk outline The BigDataEurope Project, Mission & BDVA Synergies The Big Data Integrator (BDI) platform

o Stakeholder Requirements o Architectureo Supported Componentso Beyond the State-of-the-Art

A look into the BDI platform [DEMO]6-déc.-16www.big-data-europe.eu

Supporting the Societal Domains with Big Data Technology

BigDataEurope Project

6-déc.-16www.big-data-europe.eu

BigDataEurope Action EC Horizon 2020 Coordination & Support Action

o ~5mio €, 2015-2017

Show societal value of Big Datao Across all societal challenges addressed by H2020

Lower barrier for using big data technologieso Effort to setup and deploy use-case workflows

o Lack of skills & expertise

Help establish data value chains across domains & orgs.

6-déc.-16www.big-data-europe.eu

Consortium

NCSRDEMOKRITOS

Stakeholder Engagement Cycle

Present action, showcase deployments

Raise awareness about BDE results, what they mean for stakeholders

Collect requirements to drive further development

6-déc.-16

www.big-data-europe.eu

M12M6 M18 M24 M30

Data Value Chain Evolution

6-déc.-16

Extraction, Curation Quality, Linking, Integration

Publication, Visualization, Analysis

Extraction, Curation, Quality, Linking, Integration, Publication,

Visualization, Analysis

HealthTransport

Security

Extraction Curation Quality Linking Integration Publication Visualization Analysis

Data Repositories

Linked Open Data

TIME

Food SocietiesClimate EnergyProprietary, ‘locked-in’solutions

OS Solutions,Big Data Stacks

www.big-data-europe.eu

Parallels to BDVA Mission

Task Force 6 (Technical)o SG1: Managemento SG2: Big Data Architectures and Infrastructures

The Big Data Integrator Platform (SG2)o Generic Architecture (Blueprint) & Instances

Smart Big Data Management (SG1)o Support for Semantic Components & Data Lakes

6-déc.-16www.big-data-europe.eu

A flexible, generic platform for (Big) Data Value Chain Deployment

1. Stakeholder Requirements

Big Data Integrator

6-déc.-16www.big-data-europe.eu

Workshops

Requirement Elication

SUPPORTED BY BDVA Face-to-face interviews

Feedback from each 7 pilots, in 3 phases 7 held per year with

Societal Communities

Importance of Volume Importance of Velocity

Key Results from the Survey (I)

Importance of Variety Efficiency of Data Infrastructures

Key Results from the Survey (II)

Societal Data Value Chain Requirements

A flexible, generic platform for (Big) Data Value Chain Deployment

2. Architecture

Big Data Integrator

6-déc.-16www.big-data-europe.eu

Big Data Integrator Architecture

Prototype developed by BDEo Incorporates existing BD technologyo Facilitates integration and deployment

Main points of the architectureo Dockerizationo Support layer, including integrated UIo Semantification layer

6-déc.-16www.big-data-europe.eu

Generic Architecture

6-déc.-16www.big-data-europe.eu

Plug-and-play BD Platform

Cloud-deployment ready

Domain independent, Customisable

Stacks Open Source solutions

BDI Prototype Releases

1. [July 2016]2. December 20163. ….

Docker containers

6-déc.-16www.big-data-europe.eu

Docker offers lightweight virtualizationo Containers can be shared/provisioned on different Linux variations/versions

Identical base systemo NOT Required

All BDI componentso Docker containers

Architectural design 1.1

Architectural design 1.219

6-déc.-16www.big-data-europe.eu

Architectural design 1.3

Stack

Architectural design 1.4 (released)21

BDE vs Hadoop distributions

BDE is not built on top of existing distributionsTargets

o Communitieso Research institutions

Bridges scientists and open dataMulti-Tier research efforts towards Smart Data

22

BDE vs Hadoop distributionsHortonworks Cloudera MapR Bigtop BDE

File System HDFS HDFS NFS HDFS HDFS

Installation Native Native Native Native lightweight virtualization

Plug & play components (no rigid schema)

no no no no yes

High Availability Single failure recovery (yarn)

Single failure recovery (yarn)

Self healing, mult. failure rec.

Single failure recovery (yarn)

Multiple Failure recovery

Cost Commercial Commercial Commercial Free Free

Scaling Freemium Freemium Freemium Free Free

Addition of custom components Not easy No No No Yes

Integration testing yes yes yes yes --

Operating systems Linux Linux Linux Linux All

Management tool Ambari Cloudera manager MapR Control system

- Docker swarm UI+ Custom

23

A flexible, generic platform for (Big) Data Value Chain Deployment

3. Supported Components

Big Data Integrator

6-déc.-16www.big-data-europe.eu

Dockerized Components

6-déc.-16www.big-data-europe.eu

Processing and storage componentso Re-used existing docker containers (where available)o Dockerized by BDE otherwiseo Ensuring all can be provisioned through Docker Swarm

Other Componentso Semantic Layero Support Layer

Data Acquisition: Apache Flume

Data Storage: Hue, Apache Cassandra, ScyllaDB, Apache Hive, Postgis Search/Indexing: Apache Solr Message Passing: Apache Kafka Data Processing: Spark, Flink Semantic Components: Sansa, Silk, Strabon, Sextant, GeoTriples,

Semagrow, Limes, 4Store, Openlink Virtuoso

BDI Docker Containers (..and counting)

6-déc.-16www.big-data-europe.eu

A flexible, generic platform for (Big) Data Value Chain Deployment

4. In-use: Deployment & Installation

Big Data Integrator

6-déc.-16www.big-data-europe.eu

BDI User profiles28

Platform installation

Manual installation guide

Using Docker Machineo On local machine (VirtualBox)o In cloud (AWS, DigitalOcean, Azure)o Bare metal

Screencasts (Getting Starting with the Platform)

29

Developing a component

Base Docker imageso Serve as a template for a (Big Data) technologyo Easily extendable custom algorithm/data

Published componentso Responsibilities divided b/w partnerso Image repositories on GitHubo Automated builds on DockerHubo Documentation on BDE Wiki

30

Deploying a Big Data Stack

Stack: Collection of communicating components to solve a specific problem

Described in Docker Composeo Component configurationo Application topology

Orchestrator required for initialization processo Components may depend on each othero Components may require manual intervention

31

Support Layer (User Interfaces)

6-déc.-16www.big-data-europe.eu

Integrator UI o Web UIs from BDE dockers (including 3rd party components)

follow these BDE stylesheets

Stack Monitor Appo Workflow Buildero Workflow Monitor

Swarm UI o Allows scaling up/down multiple Docker instances

Stack

Integrator UI33

BDE Workflow Builder34

BDE Workflow Monitor35

Swarm UI36

Demonstrating the ease-of-use in deploying custom instances of the BDI Platform

Recorded video showing an example available:https://www.youtube.com/watch?v=1zHIhFDDdCg

BDI Platform – A Demo

6-déc.-16www.big-data-europe.eu

A flexible, generic platform for (Big) Data Value Chain Deployment

5. Beyond the State-of-the-Art

Big Data Integrator

6-déc.-16www.big-data-europe.eu

Smart Big Data

Increase Big Data value by adding meaning to it!

39

Quelle: Gesellschaft für Informatik

Variety – The most neglected V?

Data Source Heterogeneity

Lack of interoperability/semantics

Semantic Layer tools

6-déc.-16www.big-data-europe.eu

BDE tooling for Semantic Data Lake:o Swagger: Semantics of RESTful APIso Semantic Analytics Stack (SANSA):

Distributed data processing over large-scale Knowledge Graphs

o Semagrow: SPARQL over Big Data storeso Ontario: Querying over Semantic Data

Lakes

Semantic Layer

www.big-data-europe.eu

Semantic Data Lakes o Minimal ingestion

pre-processingo Semantic layer

maintains metadatao Add meaning when

retrieving/processing Data Lake: scalable unstructured data store

Relationship definitions and metadata

JSON-LD CSVW R2RMLXML2RDF

Ongoing Research for Semantic Big Data & Analytics

Knowledge Graphs

Ontario: Semantic Data Lakes

Repository of data in its raw formato Structured, semi-structured, unstructured

Schema-lesso No schema is defined on write, it is defined only on read

Open to any kind of processingAdd a Semantic layer on top of the source datasets

o Semantic data is handled as-iso Non-Semantic data is semantically lifted using existing

ontology terms

43

Ontario: Architecture44

Translate and execute Query via Source-specific Access Method

Decompose to Source-specific Entities

Decompose SPARQL Query

SANSA: Semantic Analytics Stack Abundant machine readable structured information is

available (e.g. in RDF)o Across SCs, e.g. Life Science Data (OpenPhacts)o General: DBpedia, Google knowledge grapho Social graphs: Facebook, Twitter

Need for scalable querying, inference & MLo Link predictiono Knowledge base completiono Predictive analytics

45

SANSA Stack47

More Information

Big Data Integrator:https://github.com/big-data-europe

README includes extensive documentation, instructions and information on supported components

6-déc.-16www.big-data-europe.eu

Free Workshops, Hangouts & Webinars

BigDataEurope Activities

6-déc.-16www.big-data-europe.eu

2nd round of Societal Workshops

6-déc.-16www.big-data-europe.eu

Transport 22 September 2016 Brussels Collocated with Big Data for Transport, Tisa workshop

Food&Agri 30 September 2016 Brussels Collocated with DG AGRI WP2018-20 stakeholder consultation

Energy 4 October 2016 Brussels Collocated with EC H2020 Info Day on “Smart Grids and Storage”

Climate 11 October 2016 Brussels Collocated with Melodies Project Event – Exploiting Open Data

Security 18 October 2016 Brussels Standalone WorkshopSocieties 5 December 2016 Cologne Collocated with EDDI16- 8th Annual

European DDI User Conference Health 9 December 2016 Brussels Standalone Workshop

Other Activities

Fresh set (7) of Societal Workshops in 2017

Various SC-focussed and general hangouts, follow!o Apache Flink & BDE (20 Oct) – available onlineo BDVA & BDE Webinar planned early next yearo Keep track on BDE Website (Events)

6-déc.-16www.big-data-europe.eu

WEB: www.big-data-europe.eu EMAIL: info@big-data-europe.eu

BIG DATA INTEGRATOR www.github.com/big-data-europe

PROJECT COORDINATION (Fraunhofer IAIS)Prof. Sören Auer, auer © cs.uni-bonn · de > Dr. Simon Scerri, scerri © cs.uni-bonn · deEIS Department/Group,Fraunhofer IAIS & CS Department Uni-Bonn, Bonn, Germany

Questions & Contacts

www.big-data-europe.eu6-déc.-16

#BigDataEurope

leads the FraunhoferBig Data Alliance

6-déc.-16www.big-data-europe.eu

SANSA: Read Write Layer

Ingest RDF and OWL data in different formatsusing Jena / OWL API style interfaces

Represent data in multiple formats (e.g. RDD, DataFrames, GraphX, Tensors)

Allow transformation among these formatsCompute dataset statistics and apply functions to

URIs, literals, subjects, objects → DistributedLODStats

54

SANSA: Query Layer

To make generic queries efficient and fast using:o Intelligent indexingo Splitting strategieso Distributed Storageo Distributed/ Federated Querying

Early work in progress: query evaluation (SPARQL-to-SQL approaches, Virtual Views)

Provision of W3C SPARQL compliant endpoint

55

SANSA: Inference Layer

W3C Standards for Modelling: RDFS andOWL

Parallel in-memory inference via rule-basedforward chaining

Beyond state of the art: dynamically build arule dependency graph for a rule set

→ Adjustable performance levels

56

SANSA: ML Layer

Distributed Machine Learning (ML) algorithms thatwork on RDF data and make use of its structure /semantics

Work in Progress:o Tensor Factorization for e.g. KB completion (testing stage)o Simple spatiotemporal analytics (idea stage)o Graph Clustering (testing stage)o Association rule mining (evaluation stage)o Semantic Decision trees (idea stage)

57