PROTEUSScalable Online Machine Learning for Predictive Analytics and Real-Time Interactive VisualizationBONAVENTURA DEL MONTERESEARCHER @DFKI GMBHPH.D. STUDENT @TU BERLINEUROPRO WORKSHOP, EDBT 2017This project is funded
by the European Union. Horizon 2020
2
Value
Velocity
VarietyVeracity
Volume BIG DATA
€€€
???
3
4
PROTEUS is a EU H2020 funded research project which aims to
design, develop, and provide an open-source ready-to-use Big Data solution, able to perform real-time interactive analytics and
predictive analysis through massive online machine learning,
efficiently dealing with extremely large historical data and data stream
CONTENTS
1. PROJECT DETAILS2. VALIDATION SCENARIO3. HYBRID PROCESSING ENGINE4. SCALABLE ONLINE MACHINE LEARNING5. REAL-TIME INTERACTIVE VISUAL ANALYTICS6. CONCLUSION
6
Project Consortium
7
Project details Expected Outcomes
Hybrid processing Batch & Stream processing engine Declarative Language for batch & streams analytics
Scalable Online machine Learning SOLMA Library
Real-time interactive Visual Analytics Web charts library Incremental engine for interactive analytics
Business Impact Validation in realistic industrial use case
8
Hot Strip Mill: Big Data scenario
9
System Architecture
10
Smoother processing of data stream and historical data in the same Flink job
A declarative language for batch and streaming analytics ETL and ML pipelines expressed in an unified language are holistically optimized
Hybrid Processing
Gather and clean sensor
dataPCA
Train ML
Model
D3
D1
D2
Bridging the Gap: Towards Optimizations across Linear and Relational Algebra": Andreas Kunft, Alexander Alexandrov, Asterios Katsifodimos, Volker Markl. BeyondMR workshop @SIGMOD 2016.
11
Scalable Online Machine Learning ML challenge: Distributed Data Streams
Current state of the art of machine learning algorithms for Big Data is dominated by offline learning algorithms that process data-at-rest
Plenty of current data sources are streaming (online, data-in-motion): sensors, social networks, clickstream, etc.
In online learning, the algorithms see the data only once. The traditional meaning of online is that data is processed sequentially one by one but for many epochs: prequential evaluation
12
Real-time Interactive Visual Analytics How to interactively visualize Big Data?
Incremental Analytics engine: incremental partial results in ~ O(1)
Visualization Layer: SSR-enabled web-based library seamlessly connected to the Incremental Analytics engine
https://github.com/proteus-h2020/proteic
13
Conclusions PROTEUS is an EU H2020 international research project PROTEUS will contribute to the Big Data ecosystem with:
An innovative hybrid engine for processing both data-at-rest and data-in-motion SOLMA: An new library for scalable online machine learning Big Data Visualization guidelines: new ways of presenting and working with Big Data Real-time interactive visualization technology: Incremental engine & web-based library
PROTEUS will validate its innovations in a realistic industrial scenario PROTEUS will provide full-scale evaluation and impact assessment including
benchmarks, KPIs and anonymized datasets Specific metrics for the ArcelorMittal use case Generic indicators on the advancements in scalable machine learning, hybrid computation
and real-time interactive visual analytics.
14
Thanks for your attention!Questions?
Contact us: Bonaventura Del Monte
bonaventura dot delmonte at dfki dot de www.dfki.berlin
www.proteus-bigdata.comwww.github.com/proteus-h2020
15
Extra Slides
16
Apache Flink 101 Massive parallel data flow engine with unified batch and stream
processing
Rich set of operators (including native iteration)
Flink Optimizer Inspired by optimizers of parallel database systems Physical optimization follows cost‐based approach
Memory Management Flink manages its own memory Never breaks the JVM heap
17
Scalable Online Machine Learning PROTEUS contribution: SOLMA
User-friendly Extensibility
Basic scalable stream sketches that enable to query the stream Iterative algorithms for approximating the outcome of offline computation
Ready-to-use (supervised & unsupervised) online ML algorithms in Apache Flink
Top Related