BigDataEurope - Empowering Communities with Data Technologies
BigDataEurope @BDVA Summit2016 1: The BDE Platform
-
Upload
bigdataeurope -
Category
Software
-
view
130 -
download
0
Transcript of BigDataEurope @BDVA Summit2016 1: The BDE Platform
BIG DATA EUROPE'S INTEGRATOR PLATFORM A ONE-STOP SOLUTION FOR BIG AND
SMART DATA MANAGEMENT
BDVA Summit 2016, Valencia1 December 2016
Summit 2016
Talk outline The BigDataEurope Project, Mission & BDVA Synergies The Big Data Integrator (BDI) platform
o Stakeholder Requirements o Architectureo Supported Componentso Beyond the State-of-the-Art
A look into the BDI platform [DEMO]6-déc.-16www.big-data-europe.eu
Supporting the Societal Domains with Big Data Technology
BigDataEurope Project
6-déc.-16www.big-data-europe.eu
BigDataEurope Action EC Horizon 2020 Coordination & Support Action
o ~5mio €, 2015-2017
Show societal value of Big Datao Across all societal challenges addressed by H2020
Lower barrier for using big data technologieso Effort to setup and deploy use-case workflows
o Lack of skills & expertise
Help establish data value chains across domains & orgs.
6-déc.-16www.big-data-europe.eu
Consortium
NCSRDEMOKRITOS
Stakeholder Engagement Cycle
Present action, showcase deployments
Raise awareness about BDE results, what they mean for stakeholders
Collect requirements to drive further development
6-déc.-16
www.big-data-europe.eu
M12M6 M18 M24 M30
Data Value Chain Evolution
6-déc.-16
Extraction, Curation Quality, Linking, Integration
Publication, Visualization, Analysis
Extraction, Curation, Quality, Linking, Integration, Publication,
Visualization, Analysis
HealthTransport
Security
Extraction Curation Quality Linking Integration Publication Visualization Analysis
Data Repositories
Linked Open Data
TIME
Food SocietiesClimate EnergyProprietary, ‘locked-in’solutions
OS Solutions,Big Data Stacks
www.big-data-europe.eu
Parallels to BDVA Mission
Task Force 6 (Technical)o SG1: Managemento SG2: Big Data Architectures and Infrastructures
The Big Data Integrator Platform (SG2)o Generic Architecture (Blueprint) & Instances
Smart Big Data Management (SG1)o Support for Semantic Components & Data Lakes
6-déc.-16www.big-data-europe.eu
A flexible, generic platform for (Big) Data Value Chain Deployment
1. Stakeholder Requirements
Big Data Integrator
6-déc.-16www.big-data-europe.eu
Workshops
Requirement Elication
SUPPORTED BY BDVA Face-to-face interviews
Feedback from each 7 pilots, in 3 phases 7 held per year with
Societal Communities
Importance of Volume Importance of Velocity
Key Results from the Survey (I)
Importance of Variety Efficiency of Data Infrastructures
Key Results from the Survey (II)
Societal Data Value Chain Requirements
A flexible, generic platform for (Big) Data Value Chain Deployment
2. Architecture
Big Data Integrator
6-déc.-16www.big-data-europe.eu
Big Data Integrator Architecture
Prototype developed by BDEo Incorporates existing BD technologyo Facilitates integration and deployment
Main points of the architectureo Dockerizationo Support layer, including integrated UIo Semantification layer
6-déc.-16www.big-data-europe.eu
Generic Architecture
6-déc.-16www.big-data-europe.eu
Plug-and-play BD Platform
Cloud-deployment ready
Domain independent, Customisable
Stacks Open Source solutions
BDI Prototype Releases
1. [July 2016]2. December 20163. ….
Docker containers
6-déc.-16www.big-data-europe.eu
Docker offers lightweight virtualizationo Containers can be shared/provisioned on different Linux variations/versions
Identical base systemo NOT Required
All BDI componentso Docker containers
Architectural design 1.1
Architectural design 1.219
6-déc.-16www.big-data-europe.eu
Architectural design 1.3
Stack
Architectural design 1.4 (released)21
BDE vs Hadoop distributions
BDE is not built on top of existing distributionsTargets
o Communitieso Research institutions
Bridges scientists and open dataMulti-Tier research efforts towards Smart Data
22
BDE vs Hadoop distributionsHortonworks Cloudera MapR Bigtop BDE
File System HDFS HDFS NFS HDFS HDFS
Installation Native Native Native Native lightweight virtualization
Plug & play components (no rigid schema)
no no no no yes
High Availability Single failure recovery (yarn)
Single failure recovery (yarn)
Self healing, mult. failure rec.
Single failure recovery (yarn)
Multiple Failure recovery
Cost Commercial Commercial Commercial Free Free
Scaling Freemium Freemium Freemium Free Free
Addition of custom components Not easy No No No Yes
Integration testing yes yes yes yes --
Operating systems Linux Linux Linux Linux All
Management tool Ambari Cloudera manager MapR Control system
- Docker swarm UI+ Custom
23
A flexible, generic platform for (Big) Data Value Chain Deployment
3. Supported Components
Big Data Integrator
6-déc.-16www.big-data-europe.eu
Dockerized Components
6-déc.-16www.big-data-europe.eu
Processing and storage componentso Re-used existing docker containers (where available)o Dockerized by BDE otherwiseo Ensuring all can be provisioned through Docker Swarm
Other Componentso Semantic Layero Support Layer
Data Acquisition: Apache Flume
Data Storage: Hue, Apache Cassandra, ScyllaDB, Apache Hive, Postgis Search/Indexing: Apache Solr Message Passing: Apache Kafka Data Processing: Spark, Flink Semantic Components: Sansa, Silk, Strabon, Sextant, GeoTriples,
Semagrow, Limes, 4Store, Openlink Virtuoso
BDI Docker Containers (..and counting)
6-déc.-16www.big-data-europe.eu
A flexible, generic platform for (Big) Data Value Chain Deployment
4. In-use: Deployment & Installation
Big Data Integrator
6-déc.-16www.big-data-europe.eu
BDI User profiles28
Platform installation
Manual installation guide
Using Docker Machineo On local machine (VirtualBox)o In cloud (AWS, DigitalOcean, Azure)o Bare metal
Screencasts (Getting Starting with the Platform)
29
Developing a component
Base Docker imageso Serve as a template for a (Big Data) technologyo Easily extendable custom algorithm/data
Published componentso Responsibilities divided b/w partnerso Image repositories on GitHubo Automated builds on DockerHubo Documentation on BDE Wiki
30
Deploying a Big Data Stack
Stack: Collection of communicating components to solve a specific problem
Described in Docker Composeo Component configurationo Application topology
Orchestrator required for initialization processo Components may depend on each othero Components may require manual intervention
31
Support Layer (User Interfaces)
6-déc.-16www.big-data-europe.eu
Integrator UI o Web UIs from BDE dockers (including 3rd party components)
follow these BDE stylesheets
Stack Monitor Appo Workflow Buildero Workflow Monitor
Swarm UI o Allows scaling up/down multiple Docker instances
Stack
Integrator UI33
BDE Workflow Builder34
BDE Workflow Monitor35
Swarm UI36
Demonstrating the ease-of-use in deploying custom instances of the BDI Platform
Recorded video showing an example available:https://www.youtube.com/watch?v=1zHIhFDDdCg
BDI Platform – A Demo
6-déc.-16www.big-data-europe.eu
A flexible, generic platform for (Big) Data Value Chain Deployment
5. Beyond the State-of-the-Art
Big Data Integrator
6-déc.-16www.big-data-europe.eu
Smart Big Data
Increase Big Data value by adding meaning to it!
39
Quelle: Gesellschaft für Informatik
Variety – The most neglected V?
Data Source Heterogeneity
Lack of interoperability/semantics
Semantic Layer tools
6-déc.-16www.big-data-europe.eu
BDE tooling for Semantic Data Lake:o Swagger: Semantics of RESTful APIso Semantic Analytics Stack (SANSA):
Distributed data processing over large-scale Knowledge Graphs
o Semagrow: SPARQL over Big Data storeso Ontario: Querying over Semantic Data
Lakes
Semantic Layer
www.big-data-europe.eu
Semantic Data Lakes o Minimal ingestion
pre-processingo Semantic layer
maintains metadatao Add meaning when
retrieving/processing Data Lake: scalable unstructured data store
Relationship definitions and metadata
JSON-LD CSVW R2RMLXML2RDF
Ongoing Research for Semantic Big Data & Analytics
Knowledge Graphs
Ontario: Semantic Data Lakes
Repository of data in its raw formato Structured, semi-structured, unstructured
Schema-lesso No schema is defined on write, it is defined only on read
Open to any kind of processingAdd a Semantic layer on top of the source datasets
o Semantic data is handled as-iso Non-Semantic data is semantically lifted using existing
ontology terms
43
Ontario: Architecture44
Translate and execute Query via Source-specific Access Method
Decompose to Source-specific Entities
Decompose SPARQL Query
SANSA: Semantic Analytics Stack Abundant machine readable structured information is
available (e.g. in RDF)o Across SCs, e.g. Life Science Data (OpenPhacts)o General: DBpedia, Google knowledge grapho Social graphs: Facebook, Twitter
Need for scalable querying, inference & MLo Link predictiono Knowledge base completiono Predictive analytics
45
SANSA Stack47
More Information
Big Data Integrator:https://github.com/big-data-europe
README includes extensive documentation, instructions and information on supported components
6-déc.-16www.big-data-europe.eu
Free Workshops, Hangouts & Webinars
BigDataEurope Activities
6-déc.-16www.big-data-europe.eu
2nd round of Societal Workshops
6-déc.-16www.big-data-europe.eu
Transport 22 September 2016 Brussels Collocated with Big Data for Transport, Tisa workshop
Food&Agri 30 September 2016 Brussels Collocated with DG AGRI WP2018-20 stakeholder consultation
Energy 4 October 2016 Brussels Collocated with EC H2020 Info Day on “Smart Grids and Storage”
Climate 11 October 2016 Brussels Collocated with Melodies Project Event – Exploiting Open Data
Security 18 October 2016 Brussels Standalone WorkshopSocieties 5 December 2016 Cologne Collocated with EDDI16- 8th Annual
European DDI User Conference Health 9 December 2016 Brussels Standalone Workshop
Other Activities
Fresh set (7) of Societal Workshops in 2017
Various SC-focussed and general hangouts, follow!o Apache Flink & BDE (20 Oct) – available onlineo BDVA & BDE Webinar planned early next yearo Keep track on BDE Website (Events)
6-déc.-16www.big-data-europe.eu
WEB: www.big-data-europe.eu EMAIL: [email protected]
BIG DATA INTEGRATOR www.github.com/big-data-europe
PROJECT COORDINATION (Fraunhofer IAIS)Prof. Sören Auer, auer © cs.uni-bonn · de > Dr. Simon Scerri, scerri © cs.uni-bonn · deEIS Department/Group,Fraunhofer IAIS & CS Department Uni-Bonn, Bonn, Germany
Questions & Contacts
www.big-data-europe.eu6-déc.-16
#BigDataEurope
leads the FraunhoferBig Data Alliance
6-déc.-16www.big-data-europe.eu
SANSA: Read Write Layer
Ingest RDF and OWL data in different formatsusing Jena / OWL API style interfaces
Represent data in multiple formats (e.g. RDD, DataFrames, GraphX, Tensors)
Allow transformation among these formatsCompute dataset statistics and apply functions to
URIs, literals, subjects, objects → DistributedLODStats
54
SANSA: Query Layer
To make generic queries efficient and fast using:o Intelligent indexingo Splitting strategieso Distributed Storageo Distributed/ Federated Querying
Early work in progress: query evaluation (SPARQL-to-SQL approaches, Virtual Views)
Provision of W3C SPARQL compliant endpoint
55
SANSA: Inference Layer
W3C Standards for Modelling: RDFS andOWL
Parallel in-memory inference via rule-basedforward chaining
Beyond state of the art: dynamically build arule dependency graph for a rule set
→ Adjustable performance levels
56
SANSA: ML Layer
Distributed Machine Learning (ML) algorithms thatwork on RDF data and make use of its structure /semantics
Work in Progress:o Tensor Factorization for e.g. KB completion (testing stage)o Simple spatiotemporal analytics (idea stage)o Graph Clustering (testing stage)o Association rule mining (evaluation stage)o Semantic Decision trees (idea stage)
57