Be naive. not idiot.

17
Be naive. Not idiot. Leveraging the class-conditional independency assumption. Sylvain Ferrandiz 21 février 2017

Transcript of Be naive. not idiot.

Be naive. Not idiot.Leveraging the class-conditional independency assumption.

Sylvain Ferrandiz

21 février 2017

LogisticRegression

XOR

MLP 4 neurons

DecisionTree

Class-conditionnalindependency assumption

Often said simple, or naive, even idiot*

* Idiot's Bayes - not so stupid after all?, Hand, D.J., & Yu, K. (2001). International Statistical Review. Vol 69 part 3, pages 385-399. ISSN 0306-7734.

argmax

yp(y/x1, ..., xK) = argmax

yp(y)

Y

k

p(xk/y)

Estimate univariatedistributions

Parametric assumption?

Nonparametric assumption?

→ Kernels→ Binning (discretization /

grouping

→ Gaussian mixture→ Multinomial distribution

Winning binning?

Outliers

Missing values

Stability

* MODL: a Bayes optimal discretization method for continuous attributes. Boullé, M., (2006). Machine Learning, 65(1):131-165.

No parameter to validate

O(nlog(n))

Selective Naive Bayes

On predictive distributions and Bayesian networks. Kontkanen, P., Myllymäki, P., Silander, T., Tirri, H. & Grünwald, P. (2000). Statistics and Computing, 10, 39-54.

sk 2 {0, 1}

argmax

yp(y/x1, ..., xK) = argmax

yp(y)

Y

k

p(xk/y)sk

Select features

An introduction to variable and feature selection. Guyon, I., Elisseeff, A. (2003)Journal of machine learning research 3 (Mar), 1157-1182

Wrapper approach?

Embedded approach?→ Greedy optimization→ Nested subsets→ Direct objective

optimization

Filter approach?→ Mutual information→ Weak learner

→ Cross-validation

Forward Feature Selection

AB

C D

Pool of actualcandidates

Pool of future candidates

Featuresincluded in the

model

E

Drawindependently

Include iff the AUROCC is improved

Keep it safeotherwise

Forward Feature Selection

A

B

C

D

Pool of actualcandidates

Pool of future candidates

Featuresincluded in the

model

E

Drawindependently

Include iff the AUROCC is improved

Keep it safeotherwise

wk 2 [0, 1]

Soft selection

argmax

yp(y/x1, ..., xK) = argmax

yp(y)

Y

k

p(xk/y)wk

The averaging trick

wk =

Ps2S skp(s/d)Ps2S p(s/d)

* A Parameter-Free Classification Method for Large Scale Learning. Boullé, M., (2009). Journal of Machine Learning Research, 10:1367-1385.

The averaging trick

Explored model only

wk =

Ps2S skp(s/d)Ps2S p(s/d)

wk =

Ps2S skp(s/d)Ps2S p(s/d)

* A Parameter-Free Classification Method for Large Scale Learning. Boullé, M., (2009). Journal of Machine Learning Research, 10:1367-1385.

The averaging trick

Nonparametric prior

wk =

Ps2S skp(s/d)Ps2S p(s/d)

* A Parameter-Free Classification Method for Large Scale Learning. Boullé, M., (2009). Journal of Machine Learning Research, 10:1367-1385.

+ -performance

algorithm complexity

Nonparametric and stable(bye bye cross-validation!)

It’s up to the user to find‘composite’ features and capture correlational relationships, but

It’s where the fun is, ain’t it?

Numeric / Categorical(bye bye dummy-encoding!)

Interpretable*

*https://www.quora.com/What-makes-a-model-interpretable/answer/Claudia-Perlich

Feature engineering

XOR

X

Y

Z = (XY > 0)

Z

Feature surfacing

Users

Sales

Web

UsersCustomerIdFirstnameLastnameAge

SalesCustomerIdProductAmountTime

WebCustomerIdPageTime

Users.Customer_IdUsers.FirstnameUsers.LastnameUsers.AgeOutcomeCount(Sales.Product)CountDistinct(Sales.Product)Mean(Sales.Amount)Sum(Sales.Amount) where Sales.Product = 'Mobile Data'Count(Web.Page) where Day(Web.Time) in [6;7]…

Let’s stay in touch!

Sylvain FERRANDIZMachine Learning Scientist

[email protected]