Feature surfacing - meetup

19
Feature surfacing Discover, Aggregate & Evaluate Jean-Baptiste PRIEZ 8 mars 2017

Transcript of Feature surfacing - meetup

Page 1: Feature surfacing  - meetup

Feature surfacingDiscover, Aggregate & Evaluate

Jean-Baptiste PRIEZ8 mars 2017

Page 2: Feature surfacing  - meetup

Feature engineering

XOR

X

Y

Z = (XY > 0)

Z

Page 3: Feature surfacing  - meetup

Users

Sales

Web

UsersCustomerIdFirstnameLastnameAge

SalesCustomerIdProductAmountTime

WebCustomerIdPageTime

Users.Customer_IdUsers.FirstnameUsers.LastnameUsers.AgeOutcomeCount(Sales.Product)CountDistinct(Sales.Product)Mean(Sales.Amount)Sum(Sales.Amount) where Sales.Product = 'Mobile Data'Count(Web.Page) where Day(Web.Time) in [6;7]…

Feature surfacing

Page 4: Feature surfacing  - meetup

LET’S START WITH AN EXAMPLE…

Feature Surfacing

Page 5: Feature surfacing  - meetup

Example: Outbound Mail Campaign

1 Central Table

Customer

(Id, e-mail address, age, state, will buy within 5 days)

Page 6: Feature surfacing  - meetup

Example: Outbound Mail Campaign

3 Peripheral

Tables(visited pages, duration of the session, browser type…)

Pages visited on the website

(number of products, amount spent,order status...)

E-mail campaignreactions

(action, action type, time sincee-mail was sent…)

Orders

Page 7: Feature surfacing  - meetup
Page 8: Feature surfacing  - meetup
Page 9: Feature surfacing  - meetup

Which are the sources, variables to choose? How to represent them?

Should we seat and meditate around a table?Should we try each and every variables manually?

What if we let the machine work?

It’s a Machine Learning problem:• How to smartly explore the entire set of possible aggregates?• Without under/overfitting?• With a linearithmic complexity?

Page 10: Feature surfacing  - meetup

What is Feature Surfacing?

1. Extraction of information contained in a multi-table data source• Aggregation operators• Filter operators

2. Evaluation of aggregates extracted from a star-relational data schema

Feature surfacing consists in applying a set of aggregation operators on the peripheral tables to generate features in the central table.

Centraltable

Peripheraltable1 Peripheraltable2

Peripheraltable3 Peripheraltable4

* *

**

1,1

0,n0,n

0,n0,n

1,1

1,1

1,1

*1rowperentityinthecentraltable,correspondingtoseveralrowsforthesameentityintheperipheraltable.

Extraction Evaluation(supervised)

Page 11: Feature surfacing  - meetup

What are the operators?

Some aggregation operators:Name Return type Operands Label

Count Num Table Number of records

CountDistinct Num Table, Cat Number of distinct values

Mode Cat Table, Cat Most frequent value

Mean Num Table, Num Mean value

StdDev Num Table, Num Standard deviation

Median Num Table, Num Median value

Min Num Table, Num Min value

Max Num Table, Num Max value

Sum Num Table, Num Sum of value

Some filter operators:Name Return type Operands Label

<, ≤ Table Table, Num Table filtered over field values smaller (or equal) than a record

>, ≥ Table Table, Num Table filtered over field values greater (or equal) than a record

= Table Table, Field Table filtered over field values equal than a record

Customize your operators:• Date:before,after,week-end,etc…• Time:morning,afternoon,etc…• String:split,infinitiveverb,etc…• ...

Page 12: Feature surfacing  - meetup

Presentation of some smart aggregates

1. Count(Pages visited)

2. Max(Orders, amount spent)

3. Mode(Email reactions, action type)

4. Median(Pages visited, duration) when Pages visited.device = “smartphone”

The maximal amount spent by the customer

The most frequent email request of the customer

Number of visited pages by the customer

Page 13: Feature surfacing  - meetup

How to be smart?

• Good aggregate • 1st: Aggregation ☀❤🐰

• 2nd: Filter + Aggretation ⭐

• 3rd: Filter + Filter + Aggregation ⚠♨🤔

• … etc ... ⛔🔞

M. BOULLÉ. Towards Automatic FeatureConstruction for Supervised Classification. In ECML/PKDD, P. 181-196, 2014.

Page 14: Feature surfacing  - meetup

How to evaluate and select features?

• Discretization / Grouping → Correlation with the target• Select (the most) correlated features

: target set (ex: sick, healthy)

split such that the trade-off between entropy & compression is optimal

Page 15: Feature surfacing  - meetup

Discretization algorithms

• ChiMerge (R, SAS)• Optimize entropy

• C4.5 (…)• Optimize compression

• Fusinter (Zighed & co - Sinipa)• MDL-disc / MDLP (Fayyad & Irani, Pfahringer - Spark)• MODL (Boullé)• Optimize both: entropy & compression

Page 16: Feature surfacing  - meetup

Popularize: MODL

: target set (ex: sick, healthy)

I: 𝑖" 𝑖# 𝑖$ 𝑖% 𝑖& 𝑖' 𝑖(

nDiscretize with MODL = Minimize the following formula:

𝑉𝑎𝑙𝑢𝑒 𝐷 = log 𝑛 + log 5678"78" +∑ log 5;6<8"

<8"7=>" +∑ log 5;!

5;,A!5;,B!…5;,D!E7=>"

entropycompression

Page 17: Feature surfacing  - meetup

Interpretation of smart aggregates calculated over the visited pages table

Count(VisitedPages) = Number of visited pages

Interpretation graphic shows that:• there is a niche of future buyers :

those who have visited more than 96.5 pages over the period (top segment)• the majority of the base has visited no or only a few pages the site over the

period

Foreachcustomer:

Foreachcustomer:

Median(VisitedPages, duration) = median duration of stay on a specific page

Page 18: Feature surfacing  - meetup

+ & -

• + Good complexity• + Statistically efficient• + Manage overfitting by design

• - not enough to win every Kaggle constests…

Page 19: Feature surfacing  - meetup

Let’s stay intouch!

Jean-Baptiste PRIEZData Scientist

[email protected]