Automatic Feature Engineering by Deep Reinforcement Learning · 2019. 4. 14. · IEEE, 979–984....

3
Automatic Feature Engineering by Deep Reinforcement Learning Extended Abstract Jianyu Zhang, Jianye Hao , Françoise Fogelman-Soulié, Zan Wang College of Intelligence and Computing, Tianjin University Tianjin, China [email protected],[email protected],[email protected],[email protected] ABSTRACT We present a framework called Learning Automatic Feature Engi- neering Machine (LAFEM), which formalizes the Feature Engineer- ing (FE) problem as an optimization problem over a Heterogeneous Transformation Graph (HTG). We propose a Deep Q-learning on HTG to support efficient learning of fine-grained and generalized FE policies that can transfer knowledge of engineering "good" features from a collection of datasets to other unseen datasets. KEYWORDS Innovative agents and multiagent applications; Deep learning; Fea- ture generation ACM Reference Format: Jianyu Zhang, Jianye Hao , Françoise Fogelman-Soulié, Zan Wang. 2019. Automatic Feature Engineering by Deep Reinforcement Learning. In Proc. of the 18th International Conference on Autonomous Agents and Multiagent Systems (AAMAS 2019), Montreal, Canada, May 13–17, 2019, IFAAMAS, 3 pages. 1 INTRODUCTION Most of existing methods of automatic FE either generate a large set of possible features by predefined transformation operators followed by feature selection (Brute-force) [1, 4, 9] or apply sim- ple Machine Learning / Reinforcement Learning (simple algorithm and/or simple meta-features derived from FE process) to recom- mend a potentially useful feature [2, 3, 3, 5]. The former makes the process computationally expensive, which is even worse for complex features, while the latter limits the performance boost. A recently proposed FE approach [3] is based on RL. It treats all features in the dataset as a union, and applies traditional Q-learning [8] on FE examples to learn a strategy for automating FE under a given budget. Reinforcement Learning (RL) is a suitable way for solving the FE problem, but this work uses Q-learning with linear approximation with only 12 simple manual features, which limits the ability of automatic FE. Furthermore, it ignores the differences between features and applies a transformation operator on all of them at each step. Because of this nondiscrimination of different features, it is computation expensive, especially for large datasets and complex transformation operators. To address the above limitations, in this work, we propose LAFEM (Learning Automatic Feature Engineering Machine), a novel approach * Corresponding author. Proc. of the 18th International Conference on Autonomous Agents and Multiagent Systems (AAMAS 2019), N. Agmon, M. E. Taylor, E. Elkind, M. Veloso (eds.), May 13–17, 2019, Montreal, Canada. © 2019 International Foundation for Autonomous Agents and Multiagent Systems (www.ifaamas.org). All rights reserved. for automatic FE based on Deep Reinforcement Learning (DRL). We define a Heterogeneous Transformation Graph (HTG), which is a heterogeneous directed acyclic graph representing relationships between different transformed versions of features and datasets, to organize the FE process. 2 METHOD In this section, we present a new framework called LAFEM. It contains a Heterogeneous Transformation Graph (HTG) to represent the FE process in feature level and an off-policy RL algorithm on top of HTG to find a good FE policy. We consider a collection of datasets D = {D 0 , D 1 , ..., D N } and each dataset D i or (D for short) can be represented as D = F , ywhere F = { f 0 , f 1 , ..., f n } is a set of features and y is the corresponding target variable we want to predict. We apply a classification algorithm C (e.g. Random Forest) on dataset D and use an evaluation measure m (e.g. log-loss, F1- score) to measure the classification performance. We use P m C (F , y) or P (D) to denote the cross-validation performance of classification algorithm C and evaluation measure m on dataset D. Since the FE process is time-consuming, especially model eval- uation, in practice, we usually need to set a budget (e.g. time) for a particular FE problem. A budget B emax here indicates the maxi- mum count of model evaluation steps we can take for a dataset. A transformation operator τ in FE is a function that is applied on a set of features to generate a new feature f + = τ ({ f i })where the order of the operator is the number of features in { f i }. We denote the set of derived features as F + . For instance, a product transformation applied on two features (order 2) generates a new feature f + = product ( f i , f j ). We use T to denote the set of operators. 2.1 Heterogeneous Transformation Graph The Heterogeneous Transformation Graph (HTG) is a directed het- erogeneous acyclic graph (Figure 1). Each node in HTG corresponds to either a feature (feature node) or a dataset (dataset node). Each feature node corresponds to either one original feature or one fea- ture derived from original features. Each dataset node corresponds to either the original dataset D 0 or a dataset derived from it. The edges are divided into three categories: (a) Feature transformation edge, the edge between two feature nodes f i and f j , i > j 0, indicates that f i is a feature derived from f j . If there are more than one feature nodes { f j } connecting to f i , then f i is a feature derived from all of them, i.e. f i = τ ({ f j }). (b) Dataset transformation edge, the edge between two dataset nodes D i , D j , i > j 0 indicates that D i is a dataset obtained from D j by feature generation, feature selection or model evaluation. (c) The edge from feature node f i Extended Abstract AAMAS 2019, May 13-17, 2019, Montréal, Canada 2312

Transcript of Automatic Feature Engineering by Deep Reinforcement Learning · 2019. 4. 14. · IEEE, 979–984....

Page 1: Automatic Feature Engineering by Deep Reinforcement Learning · 2019. 4. 14. · IEEE, 979–984. [3] Udayan Khurana, Horst Samulowitz, and Deepak Turaga. 2017. Feature Engi-neering

Automatic Feature Engineeringby Deep Reinforcement Learning

Extended Abstract

Jianyu Zhang, Jianye Hao∗, Françoise Fogelman-Soulié, Zan WangCollege of Intelligence and Computing, Tianjin University

Tianjin, [email protected],[email protected],[email protected],[email protected]

ABSTRACTWe present a framework called Learning Automatic Feature Engi-neering Machine (LAFEM), which formalizes the Feature Engineer-ing (FE) problem as an optimization problem over a HeterogeneousTransformation Graph (HTG). We propose a Deep Q-learning onHTG to support efficient learning of fine-grained and generalized FEpolicies that can transfer knowledge of engineering "good" featuresfrom a collection of datasets to other unseen datasets.

KEYWORDSInnovative agents and multiagent applications; Deep learning; Fea-ture generationACM Reference Format:Jianyu Zhang, Jianye Hao∗, Françoise Fogelman-Soulié, Zan Wang. 2019.Automatic Feature Engineering by Deep Reinforcement Learning. In Proc.of the 18th International Conference on Autonomous Agents and MultiagentSystems (AAMAS 2019), Montreal, Canada, May 13–17, 2019, IFAAMAS,3 pages.

1 INTRODUCTIONMost of existing methods of automatic FE either generate a largeset of possible features by predefined transformation operatorsfollowed by feature selection (Brute-force) [1, 4, 9] or apply sim-ple Machine Learning / Reinforcement Learning (simple algorithmand/or simple meta-features derived from FE process) to recom-mend a potentially useful feature [2, 3, 3, 5]. The former makesthe process computationally expensive, which is even worse forcomplex features, while the latter limits the performance boost.

A recently proposed FE approach [3] is based on RL. It treats allfeatures in the dataset as a union, and applies traditional Q-learning[8] on FE examples to learn a strategy for automating FE under agiven budget. Reinforcement Learning (RL) is a suitable way forsolving the FE problem, but this work uses Q-learning with linearapproximation with only 12 simple manual features, which limitsthe ability of automatic FE. Furthermore, it ignores the differencesbetween features and applies a transformation operator on all ofthem at each step. Because of this nondiscrimination of differentfeatures, it is computation expensive, especially for large datasetsand complex transformation operators.

To address the above limitations, in this work, we propose LAFEM(Learning Automatic Feature EngineeringMachine), a novel approach

* Corresponding author.Proc. of the 18th International Conference on Autonomous Agents and Multiagent Systems(AAMAS 2019), N. Agmon, M. E. Taylor, E. Elkind, M. Veloso (eds.), May 13–17, 2019,Montreal, Canada. © 2019 International Foundation for Autonomous Agents andMultiagent Systems (www.ifaamas.org). All rights reserved.

for automatic FE based on Deep Reinforcement Learning (DRL).We define a Heterogeneous Transformation Graph (HTG), which isa heterogeneous directed acyclic graph representing relationshipsbetween different transformed versions of features and datasets, toorganize the FE process.

2 METHODIn this section, we present a new framework called LAFEM. Itcontains a Heterogeneous Transformation Graph (HTG) to representthe FE process in feature level and an off-policy RL algorithm ontop of HTG to find a good FE policy. We consider a collection ofdatasetsD = {D0,D1, ...,DN } and each dataset Di or (D for short)can be represented as D = ⟨F ,y⟩ where F = { f0, f1, ..., fn } is a setof features and y is the corresponding target variable we want topredict. We apply a classification algorithm C (e.g. Random Forest)on dataset D and use an evaluation measurem (e.g. log-loss, F1-score) to measure the classification performance. We use PmC (F ,y)or P(D) to denote the cross-validation performance of classificationalgorithm C and evaluation measurem on dataset D.

Since the FE process is time-consuming, especially model eval-uation, in practice, we usually need to set a budget (e.g. time) fora particular FE problem. A budget Bemax here indicates the maxi-mum count of model evaluation steps we can take for a dataset.

A transformation operator τ in FE is a function that is appliedon a set of features to generate a new feature f+ = τ ({ fi })wherethe order of the operator is the number of features in { fi }. Wedenote the set of derived features as F+. For instance, a producttransformation applied on two features (order 2) generates a newfeature f+ = product(fi , fj ). We useT to denote the set of operators.

2.1 Heterogeneous Transformation GraphThe Heterogeneous Transformation Graph (HTG) is a directed het-erogeneous acyclic graph (Figure 1). Each node in HTG correspondsto either a feature (feature node) or a dataset (dataset node). Eachfeature node corresponds to either one original feature or one fea-ture derived from original features. Each dataset node correspondsto either the original dataset D0 or a dataset derived from it. Theedges are divided into three categories: (a) Feature transformationedge, the edge between two feature nodes fi and fj , i > j ≥ 0 ,indicates that fi is a feature derived from fj . If there are more thanone feature nodes { fj } connecting to fi , then fi is a feature derivedfrom all of them, i.e. fi = τ ({ fj }). (b) Dataset transformation edge,the edge between two dataset nodes Di , D j , i > j ≥ 0 indicatesthat Di is a dataset obtained from D j by feature generation, featureselection or model evaluation. (c) The edge from feature node fi

Extended Abstract AAMAS 2019, May 13-17, 2019, Montréal, Canada

2312

Page 2: Automatic Feature Engineering by Deep Reinforcement Learning · 2019. 4. 14. · IEEE, 979–984. [3] Udayan Khurana, Horst Samulowitz, and Deepak Turaga. 2017. Feature Engi-neering

to dataset node Dk or in the inverse direction, indicates that fi isadded to Dk or dropped from Dk . Since we only allow one featureto be added / eliminated at one time, for any dataset node Dk , thereis at most one feature node connected to one dataset node.

Figure 1: Example of HTG.

2.2 MDP FormulationConsider the FE process with HTG on one dataset D as a Markovdecision process (MDP) problem. At each time step t , a state st ∈ Sconsists of theHTGGt and themodel evaluation budgetBe available.at ∈ A = AG

⋃AS

⋃AE is an action from one of the following

three groups of actions:AG is a set of actions for feature generation, which apply a

transformation τ ∈ T on one or more features { f } to derive onenew feature and add it to dataset Dt . ∀a ∈ AG ,a = ⟨{ f },τ ⟩, where{ f } is one or more features nodes in HTG. τ is a transformationoperator from T .AS is a set of actions for feature selection by RL, which either

drop one feature f from dataset Dt = ⟨Ft ,y⟩ or add back onefeature f that exists in HTG but not in current dataset Dt .AE contains one model evaluation action, which applies clas-

sification algorithm C and measures m on dataset Dt to get theperformance of Dt as PmC (Dt ). Because we can only get the per-formance after the model evaluation action, we call the state afteraction in AE the model evaluated state.

The reward rt of this FE problem in HTG at time step t is:

rt = maxi ∈[0,t+1]

(PmC (Di )) − maxj ∈[0,t ]

(PmC (D j )) (1)

2.3 LAFEM AlgorithmSo far, we have introduced the organization of FE process andthe MDP formulation of FE problem. The most critical part is thealgorithm to find the optimal strategy of FE. We introduce LAFEMframework, an algorithm that can apply an off-policy DRL algorithmA (such as DQN, Double DQN [6], Dueling DQN [7]) on HTG toperform automatic FE.

To increase the generalization ability of the strategy learnedfrom the LAFEM framework, we train the agent on many datasetsto learn a good strategy. To prevent the algorithm from alwaysapplying feature generation or feature selection on the dataset andnever evaluating the performance, we set a constraint Bamax on themaximum number of steps between two model evaluation states.

To train a generalization agent on many datasets, each time weapply ϵ-greedy strategy on one dataset D sampled from D andstore transactions into replay buffer then use mini-batch gradientdescent to update A. We perform this process until convergence.The details of the algorithm are shown in Algorithm 1.

The reward function in Eq. (1) is the original reward for automaticFE problem in HTG. From Eq. (1), rewards only exist when we apply

Algorithm 1 LAFEM

input: a set of datasets D = {D}, replay buffer R, Budget Bemax ,and Bamax , an off-policy DRL algorithm A

1: while not done do2: Bootstrap sample a dataset D from D3: Initialize budgets: Be ← Bemax ,Ba ← Bamax4: while Be > 0 do5: if Ba > 0 then6: Get an action at by ϵ-greedy and execute at7: else8: Choose action at ∈ AE and execuate at9: end if10: Store the transition in R and set Ba ← Ba − 111: if at ∈ AE then12: Set Be ← Be − 1,Ba ← Bamax13: end if14: end while15: while not done do16: Sample a mini-batch from replay buffer R17: Perform one optimization step on A18: end while19: end while

model evaluation so any feature generation or feature selection stepwould never have any immediately reward. As a result, the rewardwould be really sparse and delayed, so-called delayed reward. Tosolve this problem, we design reward shaping by modifying thevalue of PmC (Dt ) where Dt is not an evaluation state as:

PmC (Dt ) =(PmC (Dt+α ) − P

mC (Dt−β ))

α + β + 1(2)

where α and β are the number of steps to next model evaluatedstate and last model evaluated state, respectively. Hence, featuregeneration and feature selection action a ∈ AG

⋃AS can gain im-

mediate reward at each step. In most datasets, the framework out-performs state-of-the-art automatic FE approaches in both modelperformance and time efficiency for both simple and complex trans-formation operators.

3 CONCLUSIONIn this paper, we present a novel framework called LAFEM to per-form automatic feature engineering (FE) which combines featuregeneration, feature selection andmodel evaluation. It includes a heter-geneous transformation graph (HTG) that organized the processingof FE and a Deep Reinforcement Learning algorithm that can per-form automatic FE on the HTG.

ACKNOWLEDGEMENTThe work is supported by the National Natural Science Founda-tion of China (Grant Nos.: 61702362, U1836214), Special Programof Artificial Intelligence, Tianjin Research Program of ApplicationFoundation and Advanced Technology (No.: 16JCQNJC00100), andSpecial Program of Artificial Intelligence of Tianjin Municipal Sci-ence and Technology Commission (No.: 569 17ZXRGGX00150)

Extended Abstract AAMAS 2019, May 13-17, 2019, Montréal, Canada

2313

Page 3: Automatic Feature Engineering by Deep Reinforcement Learning · 2019. 4. 14. · IEEE, 979–984. [3] Udayan Khurana, Horst Samulowitz, and Deepak Turaga. 2017. Feature Engi-neering

REFERENCES[1] James Max Kanter and Kalyan Veeramachaneni. 2015. Deep feature synthesis: To-

wards automating data science endeavors. InData Science and Advanced Analytics(DSAA), 2015. 36678 2015. IEEE International Conference on. IEEE, 1–10.

[2] Gilad Katz, Eui Chul Richard Shin, and Dawn Song. 2016. Explorekit: Automaticfeature generation and selection. In Proceedings of the IEEE 16th InternationalConference on Data Mining ICDM 2016. IEEE, 979–984.

[3] Udayan Khurana, Horst Samulowitz, and Deepak Turaga. 2017. Feature Engi-neering for Predictive Modeling using Reinforcement Learning. arXiv preprintarXiv:1709.07150 (2017).

[4] Hoang Thanh Lam, Johann-Michael Thiebaut, Mathieu Sinn, Bei Chen, Tiep Mai,and Oznur Alkan. 2017. One button machine for automating feature engineeringin relational databases. arXiv preprint arXiv:1706.00327 (2017).

[5] Fatemeh Nargesian, Horst Samulowitz, Udayan Khurana, Elias B Khalil, andDeepak Turaga. 2017. Learning feature engineering for classification. In Pro-ceedings of the 26th International Joint Conference on Artificial Intelligence, IJCAI,Vol. 17. 2529–2535.

[6] Hado Van Hasselt, Arthur Guez, and David Silver. 2016. Deep ReinforcementLearning with Double Q-Learning.. In AAAI, Vol. 2. Phoenix, AZ, 5.

[7] Ziyu Wang, Tom Schaul, Matteo Hessel, Hado Van Hasselt, Marc Lanctot, andNando De Freitas. 2015. Dueling network architectures for deep reinforcementlearning. arXiv preprint arXiv:1511.06581 (2015).

[8] Christopher JCH Watkins and Peter Dayan. 1992. Q-learning. Machine learning8, 3-4 (1992), 279–292.

[9] Jianyu Zhang, Françoise Fogelman-Soulié, and Christine Largeron. 2018. TowardsAutomatic Complex Feature Engineering. In International Conference on WebInformation Systems Engineering. Springer, 312–322.

Extended Abstract AAMAS 2019, May 13-17, 2019, Montréal, Canada

2314