Ballet: A lightweight framework for open-source ...

1
Ballet: A lightweight framework for open-source, collaborative feature engineering Micah J. Smith, Kelvin Lu, and Kalyan Veeramachaneni Massachusetts Institute of Technology Introduction and Motivation Open data science projects consist of community-driven, open-source analysis of data and development of predictive models to address societal problems. Motivation In the Frag- ile Families Challenge [1], which tasked participants with predicting GPA, Grit, Material hardship, Eviction, Layoff, and Job training, feature engineering was the main competition bottleneck. Figure 1: The Fragile Families Challenge dataset I Development of such models must support synchronous collaboration by hundreds of interested contributors I Need a framework to support splitting data science tasks and combining individual components of source code I Feature engineering, an important task in practical machine learning, is susceptible this strategy. I Goals: Accelerate development, improve quality of features, facilitate collaborative contributions, maintain pipeline integrity, avoid “heavy” infrastructure Ballet is a lightweight software framework for collaborative, open-source data science through a focus on feature engineering. https://github.com/HDI-Project/ballet The Ballet framework I Challenge 1: Facilitate incremental patches maintain feature engineering pipeline invariant through extensive software validation and streaming logical feature selection I Challenge 2: Can’t rely on any custom infrastructure (open-source world) design for lightweight architecture Initialize Ballet generates a new GitHub repository from a template which contains an empty feature engineering pipeline. Develop How do contributors grow the pipeline? Develop features clone project write feature Feature validation 1 CI ballet-project .../features/contrib/ create pull request want to improve model trigger build prune features trigger build update badge accepted? trigger build merge Feature pruning bot build X feature pruning evaluation Continuous metrics bot build X compute accuracy, MSE, compactness project structure check feature API check feature acceptance evaluation Maintainer 2 3 4 5 Figure 2: The feature development lifecycle import ballet.eng input = [ ' Full Bath ' , ' Half Bath ' , ' Bsmt Full Bath ' , ' Bsmt Half Bath ' ] def count_baths(df): return df[ ' Full Bath ' ] + 0.5 * df[ ' Half Bath ' ] + \ df[ ' Bsmt Full Bath ' ] + 0.5 * df[ ' Bsmt Half Bath ' ] transformer = ballet.eng.SimpleFunctionTransformer(func=count_baths) feature = Feature(input=input, transformer=transformer , name= ' Bathroom Count ' ) Figure 3: A Ballet feature from the Ames problem Model The resulting feature engineering pipeline is used as a dependency to an AutoML solution or a custom ML model (not the focus of this work). Logical feature selection A logical feature is a function that maps raw variables in one data instance to a vector of feature values, f D j : V p R q j , where q j is the dimensionality of the feature vector. Can have q j > 0, such as with one-hot encodings, embeddings, etc. The logical feature selection problem is to select a subset of feature functions, F * = arg min F 0 ∈P (F ) E[L(A F 0 )] Contrast this with the traditional feature selection problem to select a subset of feature values, X * X . Streaming logical feature selection (SLFS): Let F t be the set of features accepted as of time t, and let f t+1 be proposed at time t +1. I Acceptance problem: accept f t+1 , setting F t+1 = F t f t+1 , or reject, setting F t+1 = F t . I Pruning problem: remove a subset S ⊆F t+1 of low-quality features, setting F t+1 = F t+1 \ S . Example Adapt α-investing [2] to the logical feature selection setting: T = -2(logL(F t ) - logL(F t )) T χ 2 (q ) We can compute a p-value and accept if p t t . Evaluation Case study: Ames housing price prediction. I Extract 249 logical features from 9 public notebooks on Kaggle. I Simulate a scenario in which Kagglers submitted their features to a Ballet project instead. I Iteratively select random notebook, simulate its submission, and validate using SLFS. Figure 4: Performance of streaming and batch feature selection on Ames notebook features in terms of compactness (percentage of logical features selected by algorithm, left) and R 2 (center); mean accepted and rejected features per data scientist using SLFS (right). Takeaways: 72.4% of all features are rejected by the feature validation and SLFS algorithm, suggesting substantial work was redundant across notebooks. Every notebook had both accepted and rejected features, suggesting both that everyone had something to contribute to final pipeline but that everyone did redundant work. Conclusion and future work We describe the design and implementation of the Ballet framework and demonstrate its use through a case study on real-world data. As we continue to develop Ballet, in further research, we hope to assess impacts on developer efficiency through user studies, deploy it in large-scale collaborations (are you interested?), and investigate more sophisticated SLFS algorithms. References [1] A. Kindel, V. Bansal, K. Catena, T. Hartshorne, K. Jaeger, D. Koffman, S. McLanahan, M. Phillips, S. Rouhani, R. Vinh, and et al. Improving metadata infrastructure for complex surveys: Insights from the fragile families challenge, Sep 2018. [2] J. Zhou, D. Foster, R. Stine, and L. Ungar. Streaming feature selection using alpha-investing. Proceeding of the eleventh ACM SIGKDD international conference on Knowledge discovery in data mining - KDD ’05, page 384, 2005. https://dai.lids.mit.edu {micahs,kelvinlu,kalyanv}@mit.edu

Transcript of Ballet: A lightweight framework for open-source ...

Page 1: Ballet: A lightweight framework for open-source ...

Ballet: A lightweight framework foropen-source, collaborative feature engineering

Micah J. Smith, Kelvin Lu, and Kalyan VeeramachaneniMassachusetts Institute of Technology

Introduction and Motivation

Open data science projects consist of community-driven, open-source analysis of data anddevelopment of predictive models to address societal problems.

Motivation In the Frag-ile Families Challenge [1],which tasked participantswith predicting GPA, Grit,Material hardship,Eviction, Layoff, andJob training, featureengineering was the maincompetition bottleneck. Figure 1: The Fragile Families Challenge dataset

I Development of such models must support synchronous collaboration by hundreds ofinterested contributors

I Need a framework to support splitting data science tasks and combining individualcomponents of source code

I Feature engineering, an important task in practical machine learning, is susceptible thisstrategy.

I Goals: Accelerate development, improve quality of features, facilitate collaborativecontributions, maintain pipeline integrity, avoid “heavy” infrastructure

Ballet is a lightweight software framework for collaborative, open-source datascience through a focus on feature engineering.

⇒ https://github.com/HDI-Project/ballet

The Ballet framework

I Challenge 1: Facilitate incremental patches→ maintain feature engineering pipelineinvariant through extensive software validation and streaming logical feature selection

I Challenge 2: Can’t rely on any custom infrastructure (open-source world)→ design forlightweight architecture

Initialize Ballet generates a new GitHub repository from a template which contains anempty feature engineering pipeline.

Develop How do contributors grow the pipeline?

Develop features clone project write feature

Feature validation

1

CI

ballet-project.../features/contrib/

createpull request

want toimprove model

trigger build

prune features

triggerbuild

updatebadge

accepted?trigg

er bu

ild merge

Feature pruning bot build X feature pruning evaluation

Continuous metrics bot build X compute accuracy, MSE, compactness

project structure checkfeature API checkfeature acceptance evaluation

Maintainer

2 3

4

5

Figure 2: The feature development lifecycle

� �import ballet.eng

input = ['Full Bath ', 'Half Bath ','Bsmt Full Bath ', 'Bsmt Half Bath ']

def count_baths(df):

return df['Full Bath '] + 0.5 * df['Half Bath '] + \

df['Bsmt Full Bath '] + 0.5 * df['Bsmt Half Bath ']transformer = ballet.eng.SimpleFunctionTransformer(func=count_baths)

feature = Feature(input=input ,

transformer=transformer ,

name='Bathroom Count ')� �Figure 3: A Ballet feature from the Ames problem

Model The resulting feature engineering pipeline is used as a dependency to an AutoMLsolution or a custom ML model (not the focus of this work).

Logical feature selection

A logical feature is a function that maps raw variables in one data instance to a vector offeature values,

fDj : Vp→ Rqj,where qj is the dimensionality of the feature vector.

⇒ Can have qj > 0, such as with one-hot encodings, embeddings, etc.

The logical feature selection problem is to select a subset of feature functions,

F∗ = argminF ′∈P(F)

E[L(AF ′)]

Contrast this with the traditional feature selection problem to select a subset of featurevalues, X∗ ⊆ X.

Streaming logical feature selection (SLFS): Let Ft be the set of features accepted as oftime t, and let ft+1 be proposed at time t+ 1.

I Acceptance problem: accept ft+1, setting Ft+1 = Ft ∪ ft+1, or reject, settingFt+1 = Ft.

I Pruning problem: remove a subset S ⊆ Ft+1 of low-quality features, settingFt+1 = Ft+1 \ S.

Example Adapt α-investing [2] to the logical feature selection setting:

T = −2(logL(Ft)− logL(F†t ))T ∼ χ2(q)

We can compute a p-value and accept if pt < αt.

Evaluation

Case study: Ames housing price prediction.I Extract 249 logical features from 9 public notebooks on Kaggle.I Simulate a scenario in which Kagglers submitted their features to a Ballet project instead.I Iteratively select random notebook, simulate its submission, and validate using SLFS.

batch streaming0.0

0.2

0.4

0.6

0.8

1.0

com

pact

ness

(%)

batch streaming0.0

0.2

0.4

0.6

0.8

1.0

R-sq

uare

d

1 2 3 4 5 6 7 8 9notebook

0

5

10

15

20

25

30

num

ber o

f fea

ture

s

acceptedrejected

Figure 4: Performance of streaming and batch feature selection on Ames notebook features interms of compactness (percentage of logical features selected by algorithm, left) and R2

(center); mean accepted and rejected features per data scientist using SLFS (right).

Takeaways: 72.4% of all features are rejected by the feature validation and SLFS algorithm,suggesting substantial work was redundant across notebooks. Every notebook had bothaccepted and rejected features, suggesting both that everyone had something to contribute tofinal pipeline but that everyone did redundant work.

Conclusion and future work

We describe the design and implementation of the Ballet framework and demonstrate its usethrough a case study on real-world data.

As we continue to develop Ballet, in further research, we hope to assess impacts on developerefficiency through user studies, deploy it in large-scale collaborations (are you interested?),and investigate more sophisticated SLFS algorithms.

References

[1] A. Kindel, V. Bansal, K. Catena, T. Hartshorne, K. Jaeger, D. Koffman, S. McLanahan, M. Phillips, S. Rouhani, R. Vinh, and et al.

Improving metadata infrastructure for complex surveys: Insights from the fragile families challenge, Sep 2018.

[2] J. Zhou, D. Foster, R. Stine, and L. Ungar.

Streaming feature selection using alpha-investing.

Proceeding of the eleventh ACM SIGKDD international conference on Knowledge discovery in data mining - KDD ’05, page 384, 2005.

https://dai.lids.mit.edu {micahs,kelvinlu,kalyanv}@mit.edu