Feature Store: the missing data layer in ML pipelines?1 · Hopsworks Feature Store API Reading from...

14
Feature Store: the missing data layer in ML pipelines? 1 FOSDEM 2019 Brussels | HPC, Big Data and Data Science DevRoom Kim Hammar [email protected] @KimHammar1 February 1, 2019 1 Kim Hammar and Jim Dowling. Feature Store: the missing data layer in ML pipelines? https://www.logicalclocks.com/feature-store/. 2018. Kim Hammar (Logical Clocks) Hopsworks Feature Store February 1, 2019 1 / 13

Transcript of Feature Store: the missing data layer in ML pipelines?1 · Hopsworks Feature Store API Reading from...

Page 1: Feature Store: the missing data layer in ML pipelines?1 · Hopsworks Feature Store API Reading from the Feature Store: from hops import featurestore features_df = featurestore.get_features(["average_attendance",

Feature Store: the missing data layer in ML pipelines?1

FOSDEM 2019 Brussels | HPC, Big Data and Data Science DevRoom

Kim Hammar

[email protected]

@KimHammar1

February 1, 2019

1Kim Hammar and Jim Dowling. Feature Store: the missing data layer in ML pipelines?https://www.logicalclocks.com/feature-store/. 2018.

Kim Hammar (Logical Clocks) Hopsworks Feature Store February 1, 2019 1 / 13

Page 2: Feature Store: the missing data layer in ML pipelines?1 · Hopsworks Feature Store API Reading from the Feature Store: from hops import featurestore features_df = featurestore.get_features(["average_attendance",

ϕ(x)y1...yn

x1,1 . . . x1,n

... . . ....

xn,1 . . . xn,n

y

Kim Hammar (Logical Clocks) Hopsworks Feature Store February 1, 2019 2 / 13

Page 3: Feature Store: the missing data layer in ML pipelines?1 · Hopsworks Feature Store API Reading from the Feature Store: from hops import featurestore features_df = featurestore.get_features(["average_attendance",

ϕ(x)y1...yn

x1,1 . . . x1,n

... . . ....

xn,1 . . . xn,n

y?_\_( ") )_/

_

“Data is the hardest part of ML and the most important piece toget right.

Modelers spend most of their time selecting and transformingfeatures at training time and then building the pipelines to deliverthose features to production models.”

- Uber2

2Jeremy Hermann and Mike Del Balso. Scaling Machine Learning at Uber with Michelangelo.https://eng.uber.com/scaling-michelangelo/. 2018.

Kim Hammar (Logical Clocks) Hopsworks Feature Store February 1, 2019 3 / 13

Page 4: Feature Store: the missing data layer in ML pipelines?1 · Hopsworks Feature Store API Reading from the Feature Store: from hops import featurestore features_df = featurestore.get_features(["average_attendance",

ϕ(x)y1...yn

x1,1 . . . x1,n

... . . ....

xn,1 . . . xn,n

y?_\_( ") )_/

_

“Data is the hardest part of ML and the most important piece toget right.

Modelers spend most of their time selecting and transformingfeatures at training time and then building the pipelines to deliverthose features to production models.”

- Uber2

2Jeremy Hermann and Mike Del Balso. Scaling Machine Learning at Uber with Michelangelo.https://eng.uber.com/scaling-michelangelo/. 2018.

Kim Hammar (Logical Clocks) Hopsworks Feature Store February 1, 2019 3 / 13

Page 5: Feature Store: the missing data layer in ML pipelines?1 · Hopsworks Feature Store API Reading from the Feature Store: from hops import featurestore features_df = featurestore.get_features(["average_attendance",

ϕ(x)y1...yn

x1,1 . . . x1,n

... . . ....

xn,1 . . . xn,n

yFeature Store

“Data is the hardest part of ML and the most important piece toget right.

Modelers spend most of their time selecting and transformingfeatures at training time and then building the pipelines to deliverthose features to production models.”

- Uber3

3Jeremy Hermann and Mike Del Balso. Scaling Machine Learning at Uber with Michelangelo.https://eng.uber.com/scaling-michelangelo/. 2018.

Kim Hammar (Logical Clocks) Hopsworks Feature Store February 1, 2019 4 / 13

Page 6: Feature Store: the missing data layer in ML pipelines?1 · Hopsworks Feature Store API Reading from the Feature Store: from hops import featurestore features_df = featurestore.get_features(["average_attendance",

Solution: Disentangle ML Pipelineswith a Feature Store

Raw/Structured Data

Feature StoreFeature Engineering Training

Modelsb0

x0,1

x0,2

x0,3

b1

x1,1

x1,2

x1,3

y

A feature store is a central vault for storing documented, curated, andaccess-controlled features.

The feature store is the interface between data engineering and datamodel development

Kim Hammar (Logical Clocks) Hopsworks Feature Store February 1, 2019 5 / 13

Page 7: Feature Store: the missing data layer in ML pipelines?1 · Hopsworks Feature Store API Reading from the Feature Store: from hops import featurestore features_df = featurestore.get_features(["average_attendance",

Motivation for the Feature Store

The feature store enables:Reusability of features between models and teamsAutomatic backfilling of featuresAutomatic feature documentation and analysisFeature versioningStandardized access of features between training and servingFeature discovery

Num. Curated Features in Feature StoreAvg

.Cos

tof

ane

wM

LPro

ject

Kim Hammar (Logical Clocks) Hopsworks Feature Store February 1, 2019 6 / 13

Page 8: Feature Store: the missing data layer in ML pipelines?1 · Hopsworks Feature Store API Reading from the Feature Store: from hops import featurestore features_df = featurestore.get_features(["average_attendance",

Reusing FeaturesWithout a Feature Store is Complex

Data Sources Dataset 1 Dataset 2 . . . Dataset n

w1,1 . . . w1,n

.... . .

...wn,1 . . . wn,n

Features

w1,1 . . . w1,n

.... . .

...wn,1 . . . wn,n

Features

w1,1 . . . w1,n

.... . .

...wn,1 . . . wn,n

Features

Siloed Feature SetsWithout a feature store it is typical to

have feature sets stored inisolation from each other.

ModelsModels are trained using sets of features.

Without a feature store each model typicallydefines its own feature definitions,

without feature sharing across models.

b0

x0,1

x0,2

x0,3

b1

x1,1

x1,2

x1,3

y≥ 0.9 < 0.9

≥ 0.2

< 0.2

≥ 11.2

< 11.2

B B

A

(−1,−1) (−8,−8)(−10, 0) (0,−10)

40 60 80 100

160

180

200

X

Y

Kim Hammar (Logical Clocks) Hopsworks Feature Store February 1, 2019 7 / 13

Page 9: Feature Store: the missing data layer in ML pipelines?1 · Hopsworks Feature Store API Reading from the Feature Store: from hops import featurestore features_df = featurestore.get_features(["average_attendance",

Reusing FeaturesWith a Feature Store is Simple

Data Sources Dataset 1 Dataset 2 . . . Dataset n

Feature StoreFeature Store

A data management platform for machine learning.The interface between data engineering and data science.

ModelsModels are trained using sets of features.

The features are fetched from the feature storeand can overlap between models.

b0

x0,1

x0,2

x0,3

b1

x1,1

x1,2

x1,3

y≥ 0.9 < 0.9

≥ 0.2

< 0.2

≥ 11.2

< 11.2

B B

A

(−1,−1) (−8,−8)(−10, 0) (0,−10)

40 60 80 100

160

180

200

X

Y

Kim Hammar (Logical Clocks) Hopsworks Feature Store February 1, 2019 8 / 13

Page 10: Feature Store: the missing data layer in ML pipelines?1 · Hopsworks Feature Store API Reading from the Feature Store: from hops import featurestore features_df = featurestore.get_features(["average_attendance",

The Components of a Feature Store

The Storage Layer: For storing feature data in the feature storeThe Metadata Layer: For storing feature metadata (versioning,feature analysis, documentation, jobs)The Feature Engineering Jobs: For computing featuresThe Feature Registry: A user interface to share and discoverfeaturesThe Feature Store API: For writing/reading to/from the featurestore

Feature Storage

Feature Metadata Jobs

Feature Registry API

Kim Hammar (Logical Clocks) Hopsworks Feature Store February 1, 2019 9 / 13

Page 11: Feature Store: the missing data layer in ML pipelines?1 · Hopsworks Feature Store API Reading from the Feature Store: from hops import featurestore features_df = featurestore.get_features(["average_attendance",

Hopsworks Feature Store API

Reading from the Feature Store:

from hops import featurestore

features_df = featurestore.get_features([

"average_attendance",

"average_player_age"

])

Writing to the Feature Store:

from hops import featurestore

raw_data = spark.read.parquet(filename)

pol_features = raw_data.map(lambda x: x^2)

featurestore.insert_into_featuregroup(pol_features , "pol_featuregroup")

Kim Hammar (Logical Clocks) Hopsworks Feature Store February 1, 2019 10 / 13

Page 12: Feature Store: the missing data layer in ML pipelines?1 · Hopsworks Feature Store API Reading from the Feature Store: from hops import featurestore features_df = featurestore.get_features(["average_attendance",

Hopsworks Feature Store API Service

SQL

hops-util

Query PlannerFeature Store Data

Hive on HopsFS

Feature StoreMetadata

Dataframe With Features

Client Interface Feature Store Service Output

Kim Hammar (Logical Clocks) Hopsworks Feature Store February 1, 2019 11 / 13

Page 13: Feature Store: the missing data layer in ML pipelines?1 · Hopsworks Feature Store API Reading from the Feature Store: from hops import featurestore features_df = featurestore.get_features(["average_attendance",

Summary

Machine learning comes with a high technical costMachine learning pipelines needs proper data managementA feature store is a place to store curated and documented featuresThe feature store serves as an interface between feature engineeringand model development, it can help disentangle complex ML pipelinesHopsworks4 provides the world’s first open-source feature store

@hopshadoop

www.hops.io

@logicalclocks

www.logicalclocks.com

We are open source:https://github.com/logicalclocks/hopsworks

https://github.com/hopshadoop/hops

54Jim Dowling. Introducing Hopsworks. https://www.logicalclocks.com/introducing-hopsworks/. 2018.5Thanks to Logical Clocks Team: Jim Dowling, Seif Haridi, Theo Kakantousis, Fabio Buso, Gautier Berthou,

Ermias Gebremeskel, Mahmoud Ismail, Salman Niazi, Antonios Kouzoupis, Robin Andersson, and Alex Ormenisan

Kim Hammar (Logical Clocks) Hopsworks Feature Store February 1, 2019 12 / 13

Page 14: Feature Store: the missing data layer in ML pipelines?1 · Hopsworks Feature Store API Reading from the Feature Store: from hops import featurestore features_df = featurestore.get_features(["average_attendance",

References

Hopsworks’ feature store6 (the only open-source one!)Uber’s feature store7

Airbnb’s feature store8

Comcast’s feature store9

GO-JEK’s feature store10

HopsML11

Hopsworks12

6Kim Hammar and Jim Dowling. Feature Store: the missing data layer in ML pipelines?https://www.logicalclocks.com/feature-store/. 2018.

7Li Erran Li et al. “Scaling Machine Learning as a Service”. In: Proceedings of The 3rd InternationalConference on Predictive Applications and APIs. Ed. by Claire Hardgrove et al. Vol. 67. Proceedings of MachineLearning Research. Microsoft NERD, Boston, USA: PMLR, 2017, pp. 14–29. URL:http://proceedings.mlr.press/v67/li17a.html.

8Nikhil Simha and Varant Zanoyan. Zipline: Airbnb’s Machine Learning Data Management Platform.https://databricks.com/session/zipline-airbnbs-machine-learning-data-management-platform. 2018.

9Nabeel Sarwar. Operationalizing Machine Learning—Managing Provenance from Raw Data to Predictions.https://databricks.com/session/operationalizing-machine-learning-managing-provenance-from-raw-data-to-predictions. 2018.

10Willem Pienaar. Building a Feature Platform to Scale Machine Learning | DataEngConf BCN ’18.https://www.youtube.com/watch?v=0iCXY6VnpCc. 2018.

11Logical Clocks AB. HopsML: Python-First ML Pipelines.https://hops.readthedocs.io/en/latest/hopsml/hopsML.html. 2018.

12Jim Dowling. Introducing Hopsworks. https://www.logicalclocks.com/introducing-hopsworks/. 2018.Kim Hammar (Logical Clocks) Hopsworks Feature Store February 1, 2019 13 / 13