Machine Learning for Data...

42
Machine Learning for Data Streams Albert Bifet (@abifet) DATAIA-JST International Symposium on Data Science and AI 11 July 2018

Transcript of Machine Learning for Data...

Page 1: Machine Learning for Data Streamsdataia.eu/sites/default/files/DATAIA_JST_International_Symposium/... · MOA • {M}assive {O}nline {A}nalysis is a framework for online learning from

Machine Learning for Data Streams

Albert Bifet (@abifet)

DATAIA-JST International Symposium on Data Science and AI

11 July 2018

Page 2: Machine Learning for Data Streamsdataia.eu/sites/default/files/DATAIA_JST_International_Symposium/... · MOA • {M}assive {O}nline {A}nalysis is a framework for online learning from

AI Systems• According to Nikola Kasabov, AI systems should exhibit the

following characteristics: • Accommodate new problem solving rules incrementally • Adapt online and in real time• Are able to analyze itself in terms of behavior, error and

success. • Learn and improve through interaction with the environment

(embodiment) • Learn quickly from large amounts of data (Big Data) • Have memory-based exemplar storage and retrieval capacities • Have parameters to represent short and long term memory,

age, forgetting, etc.

2

Page 3: Machine Learning for Data Streamsdataia.eu/sites/default/files/DATAIA_JST_International_Symposium/... · MOA • {M}assive {O}nline {A}nalysis is a framework for online learning from

Data Streams

• Maintain models online

• Incorporate data on the fly

• Unbounded training sets

• Resource efficient

• Detect changes and adapts

• Dynamic models

3

Page 4: Machine Learning for Data Streamsdataia.eu/sites/default/files/DATAIA_JST_International_Symposium/... · MOA • {M}assive {O}nline {A}nalysis is a framework for online learning from

Analytic Standard ApproachFinite training sets

Static models 4

Data Set

Model

Classifier Algorithm builds Model

Page 5: Machine Learning for Data Streamsdataia.eu/sites/default/files/DATAIA_JST_International_Symposium/... · MOA • {M}assive {O}nline {A}nalysis is a framework for online learning from

Data Stream ApproachInfinite training sets

Dynamic models 5

D

M

Update Model

D

M

D

M

D

M

D

M

D

M

D

M

D

M

D

M

D

M

D

M

D

M

Page 6: Machine Learning for Data Streamsdataia.eu/sites/default/files/DATAIA_JST_International_Symposium/... · MOA • {M}assive {O}nline {A}nalysis is a framework for online learning from

Importance$of$Online$Learning$$

•  As$spam$trends$change,$it$is$important$to$retrain$the$model$with$newly$judged$data$

•  Previously$tested$using$news$comment$in$Y!Inc$

•  Over$29$days$period,$you$can$see$degrada)on$in$performance$of$base$model$(w/o$ac)ve$learning)$VS$Online$model$(AUC$stands$for$Area$Under$Curve)$

•  Original$paper$$

Adversarial Learning

• Need to retrain!

• Things change over time

• How often?

• Data unused until next update!

• Value of data wasted

6

Page 7: Machine Learning for Data Streamsdataia.eu/sites/default/files/DATAIA_JST_International_Symposium/... · MOA • {M}assive {O}nline {A}nalysis is a framework for online learning from

AI Challenges

Page 8: Machine Learning for Data Streamsdataia.eu/sites/default/files/DATAIA_JST_International_Symposium/... · MOA • {M}assive {O}nline {A}nalysis is a framework for online learning from

Cédric Villani and Marc Shoenauer

Page 9: Machine Learning for Data Streamsdataia.eu/sites/default/files/DATAIA_JST_International_Symposium/... · MOA • {M}assive {O}nline {A}nalysis is a framework for online learning from

1. Green AI

Page 10: Machine Learning for Data Streamsdataia.eu/sites/default/files/DATAIA_JST_International_Symposium/... · MOA • {M}assive {O}nline {A}nalysis is a framework for online learning from
Page 11: Machine Learning for Data Streamsdataia.eu/sites/default/files/DATAIA_JST_International_Symposium/... · MOA • {M}assive {O}nline {A}nalysis is a framework for online learning from
Page 12: Machine Learning for Data Streamsdataia.eu/sites/default/files/DATAIA_JST_International_Symposium/... · MOA • {M}assive {O}nline {A}nalysis is a framework for online learning from

Green AI

• One pass over the data

• Approximation algorithms: small error ε with high probability 1-δ

• True hypothesis H, and learned hypothesis Ĥ

• Pr[ |H - Ĥ| < ε|H| ] > 1-δ

Page 13: Machine Learning for Data Streamsdataia.eu/sites/default/files/DATAIA_JST_International_Symposium/... · MOA • {M}assive {O}nline {A}nalysis is a framework for online learning from

2. Explainable AI

Page 14: Machine Learning for Data Streamsdataia.eu/sites/default/files/DATAIA_JST_International_Symposium/... · MOA • {M}assive {O}nline {A}nalysis is a framework for online learning from
Page 15: Machine Learning for Data Streamsdataia.eu/sites/default/files/DATAIA_JST_International_Symposium/... · MOA • {M}assive {O}nline {A}nalysis is a framework for online learning from
Page 16: Machine Learning for Data Streamsdataia.eu/sites/default/files/DATAIA_JST_International_Symposium/... · MOA • {M}assive {O}nline {A}nalysis is a framework for online learning from
Page 17: Machine Learning for Data Streamsdataia.eu/sites/default/files/DATAIA_JST_International_Symposium/... · MOA • {M}assive {O}nline {A}nalysis is a framework for online learning from

Decision Tree• Each node tests a features

• Each branch represents a value

• Each leaf assigns a class

• Greedy recursive induction

• Sort all examples through tree

• xi = most discriminative attribute

• New node for xi, new branch for each value, leaf assigns majority class

• Stop if no error | limit on #instances

17

RoadTested?

Mileage?

Age?

NoYes

High

Low

OldRecent

✅ ❌

Car deal?

Page 18: Machine Learning for Data Streamsdataia.eu/sites/default/files/DATAIA_JST_International_Symposium/... · MOA • {M}assive {O}nline {A}nalysis is a framework for online learning from

HOEFFDING TREE• Sample of stream enough for near optimal decision

• Estimate merit of alternatives from prefix of stream

• Choose sample size based on statistical principles

• When to expand a leaf?

• Let x1 be the most informative attribute,x2 the second most informative one

• Hoeffding bound: split if G(x1) - G(x2) > ε

P. Domingos and G. Hulten, “Mining High-Speed Data Streams,” KDD ’00

=

rR2 ln(1/�)

2n

Page 19: Machine Learning for Data Streamsdataia.eu/sites/default/files/DATAIA_JST_International_Symposium/... · MOA • {M}assive {O}nline {A}nalysis is a framework for online learning from
Page 20: Machine Learning for Data Streamsdataia.eu/sites/default/files/DATAIA_JST_International_Symposium/... · MOA • {M}assive {O}nline {A}nalysis is a framework for online learning from

Adaptive Random Forest • Why Random Forests?

• Off-the-shelf learner

• Good learning performance

Adaptive random forests for evolving data stream classification.

Gomes, H M; Bifet, A; Read, J; Barddal, J P; Enembreck, F; Pfharinger, B; Holmes, G; Abdessalem, T. Machine Learning, Springer, 2017.

• Based on the original Random Forest by Breiman

20

Page 21: Machine Learning for Data Streamsdataia.eu/sites/default/files/DATAIA_JST_International_Symposium/... · MOA • {M}assive {O}nline {A}nalysis is a framework for online learning from

3. Ethical Issues

Page 22: Machine Learning for Data Streamsdataia.eu/sites/default/files/DATAIA_JST_International_Symposium/... · MOA • {M}assive {O}nline {A}nalysis is a framework for online learning from
Page 23: Machine Learning for Data Streamsdataia.eu/sites/default/files/DATAIA_JST_International_Symposium/... · MOA • {M}assive {O}nline {A}nalysis is a framework for online learning from

Should data have an expiration date?

Page 24: Machine Learning for Data Streamsdataia.eu/sites/default/files/DATAIA_JST_International_Symposium/... · MOA • {M}assive {O}nline {A}nalysis is a framework for online learning from

Other AI Challenges

Page 25: Machine Learning for Data Streamsdataia.eu/sites/default/files/DATAIA_JST_International_Symposium/... · MOA • {M}assive {O}nline {A}nalysis is a framework for online learning from

1. Open AI

Page 26: Machine Learning for Data Streamsdataia.eu/sites/default/files/DATAIA_JST_International_Symposium/... · MOA • {M}assive {O}nline {A}nalysis is a framework for online learning from

MOA

• {M}assive {O}nline {A}nalysis is a framework for online learning from data streams.

• It is closely related to WEKA

• It includes a collection of offline and online as well as tools for evaluation:

• classification, regression

• clustering, frequent pattern mining

• Easy to extend, design and run experiments

{M}assive {O}nline {A}nalysisMOA (Bifet et al. 2010)

{M}assive {O}nline {A}nalysis is a framework for onlinelearning from data streams.

It is closely related to WEKAIt includes a collection of offline and online as well astools for evaluation:

classification, regressionclusteringfrequent pattern mining

Easy to extendEasy to design and run experiments

Page 27: Machine Learning for Data Streamsdataia.eu/sites/default/files/DATAIA_JST_International_Symposium/... · MOA • {M}assive {O}nline {A}nalysis is a framework for online learning from

27

Page 28: Machine Learning for Data Streamsdataia.eu/sites/default/files/DATAIA_JST_International_Symposium/... · MOA • {M}assive {O}nline {A}nalysis is a framework for online learning from

2. Distributed Data Stream Mining

Page 29: Machine Learning for Data Streamsdataia.eu/sites/default/files/DATAIA_JST_International_Symposium/... · MOA • {M}assive {O}nline {A}nalysis is a framework for online learning from

Streaming

Vision

29

Distributed

IoT Big Data Stream Mining

Page 30: Machine Learning for Data Streamsdataia.eu/sites/default/files/DATAIA_JST_International_Symposium/... · MOA • {M}assive {O}nline {A}nalysis is a framework for online learning from

APACHE SAMOA

30

http://samoa-project.net

Data Mining

Distributed

Batch

Hadoop

Mahout

Stream

Storm, S4, Samza

SAMOA

Non Distributed

Batch

R, WEKA,…

Stream

MOA

G. De Francisci Morales, A. Bifet: “SAMOA: Scalable Advanced Massive Online Analysis”. JMLR (2014)

Page 31: Machine Learning for Data Streamsdataia.eu/sites/default/files/DATAIA_JST_International_Symposium/... · MOA • {M}assive {O}nline {A}nalysis is a framework for online learning from

SAMOA ARCHITECTURE

5 CREATING A FLINK ADAPTER ON APACHE SAMOA

5 Creating a Flink Adapter on Apache SAMOA

Apache Scalable Advanced Massive Online Analysis (SAMOA) is a platform formining data streams with the use of distributed streaming Machine Learning al-gorithms, which can run on top of different Data Stream Processing Engines(DSPE)s.

As depicted in Figure 20, Apache SAMOA offers the abstractions and APIs fordeveloping new distributed ML algorithms to enrich the existing library of state-of-the-art algorithms [27, 28]. Moreover, SAMOA provides the possibility of inte-grating new DSPEs, allowing in that way the ML programmers to implement analgorithm once and run it in different DSPEs [28].

An adapter for integrating Apache Flink into Apache SAMOA was implementedin scope of this master thesis, with the main parts of its implementation beingaddressed in this section. With the use of our adapter, ML algorithms can beexecuted on top of Apache Flink. The implemented adapter will be used for theevaluation of the ML pipelines and HT algorithm variations.

Figure 20: Apache SAMOA’s high level architecture.

5.1 Apache SAMOA Abstractions

Apache SAMOA offers a number of abstractions which allow users to implementany distributed streaming ML algorithms in a platform independent way. The mostimportant abstractions of Apache SAMOA are presented below [27, 28].

40

Page 32: Machine Learning for Data Streamsdataia.eu/sites/default/files/DATAIA_JST_International_Symposium/... · MOA • {M}assive {O}nline {A}nalysis is a framework for online learning from

StreamDM

32

http://huawei-noah/github.io/streamDM

Page 33: Machine Learning for Data Streamsdataia.eu/sites/default/files/DATAIA_JST_International_Symposium/... · MOA • {M}assive {O}nline {A}nalysis is a framework for online learning from

scikit-multiflow

33

Page 34: Machine Learning for Data Streamsdataia.eu/sites/default/files/DATAIA_JST_International_Symposium/... · MOA • {M}assive {O}nline {A}nalysis is a framework for online learning from

scikit-multiflow

Page 35: Machine Learning for Data Streamsdataia.eu/sites/default/files/DATAIA_JST_International_Symposium/... · MOA • {M}assive {O}nline {A}nalysis is a framework for online learning from

scikit-multiflow

Page 36: Machine Learning for Data Streamsdataia.eu/sites/default/files/DATAIA_JST_International_Symposium/... · MOA • {M}assive {O}nline {A}nalysis is a framework for online learning from

scikit-multiflow

Jesse Read Ecole Polytechnique

France

Jacob Montiel Telecom ParisTech

France

Page 37: Machine Learning for Data Streamsdataia.eu/sites/default/files/DATAIA_JST_International_Symposium/... · MOA • {M}assive {O}nline {A}nalysis is a framework for online learning from

Summary• Machine Learning for Data Streams useful for finding approximate

solutions with reasonable amount of time & limited resources

• Challenges:

• Open AI

• Green AI

• Explainable AI

• Ethical Issues

• Distributed Data Stream Mining

Page 38: Machine Learning for Data Streamsdataia.eu/sites/default/files/DATAIA_JST_International_Symposium/... · MOA • {M}assive {O}nline {A}nalysis is a framework for online learning from

Summary

• Green AI

• Explainable AI

• Ethical Issues

• Open AI

• Distributed Data Stream Mining

Page 39: Machine Learning for Data Streamsdataia.eu/sites/default/files/DATAIA_JST_International_Symposium/... · MOA • {M}assive {O}nline {A}nalysis is a framework for online learning from

Green Data Mining

Page 40: Machine Learning for Data Streamsdataia.eu/sites/default/files/DATAIA_JST_International_Symposium/... · MOA • {M}assive {O}nline {A}nalysis is a framework for online learning from

40

Page 41: Machine Learning for Data Streamsdataia.eu/sites/default/files/DATAIA_JST_International_Symposium/... · MOA • {M}assive {O}nline {A}nalysis is a framework for online learning from

Thanks!

41

@abifet

Page 42: Machine Learning for Data Streamsdataia.eu/sites/default/files/DATAIA_JST_International_Symposium/... · MOA • {M}assive {O}nline {A}nalysis is a framework for online learning from

Machine Learning for Data Streams

Albert Bifet (@abifet)

DATAIA-JST International Symposium on Data Science and AI

11 July 2018