Machine Learning and Econometrics...• Processing of data requires machine learning . CropYield....

Post on 01-Aug-2020

9 views 0 download

Transcript of Machine Learning and Econometrics...• Processing of data requires machine learning . CropYield....

Machine Learning and Econometrics

Sendhil Mullainathan

Understand OLS

•  The real problem here is minimizing the “wrong” thing: In-sample fit vs out-of-sample fit

AVERAGES NOTATION

Decision Tree Example

Decision Tree Example

Decision Tree Example

Fitting

•  Suppose we fit the best tree we could do to some dataset

•  What would we get?

•  How do we resolve this problem?

OLS vs Subset Selection

• If problem is that we are using too manyvariables what if we…– Looked at functions that only used s of the k

variables?

• Example:– Single best variable that fits best

• Isnt there overfit here too?

Let's do the same thing here

Unconstrained

Constrained: why not do this instead?

fA = arg minf2FA

EHL(f(x), y)

argminf2F EHL(f(x), y)s.t. R(f) c

Complexity measure: tendency to overfit

Constrained minimization

• We could do a constrained minimization

• But notice that this is equivalent to:

• Want the complexity measure to capturetendency to overfit

�R(f)| {z }want: ⇡L(f)�L(f)

fA� = arg minf2FA

EHL(f(x), y)+

Basic insight

• Data has signal and noise

• More complex function classes-– Allow us to pick up more of the signal– But also pick up more of the noise

• So the problem of prediction becomes theproblem of choosing complexity

Overall Structure

• Create a regularizer that:–Measures complexity

• Penalize algorithm for choosing more expressivefunctions– Tuning parameter lambda is the price

• Let it weigh this penalty against in-sample fit

Decision Tree Regularizer

•  What makes a good regularizer? – Depth – Number of data points per leaf – Number of splits

•  What happens as complexity gets higher?

Linear Example

• Linear function class

• Regularized linear regression

Regularizers for Linear Functions

• Linear functions more expressive if use morevariables

• Can weight coefficients

R(f) =Pk

j=1 1�j 6=0

R(�) =kX

j=1

|�j |p

Computationally More Tractable

•  Lasso

•  Ridge

Half the Sauce

• Regularization is one half of the secret sauce

• Gives a single-dimensional way of deciding ofcapturing expressiveness

• Missing ingredient is lambda: how muchcomplexity do we want?

�R(f)| {z }want: ⇡L(f)�L(f)

fA� = arg minf2FA

EHL(f(x), y)+

Choosing lambda

•  How much should we penalize expressiveness?

•  How do you make the over-fit approximation tradeoff?

•  The tuning problem.

The tuning problem

• Back to where we started?

• We have parametrized the tradeoff

• But we still have no way of choosing the level ofcomplexity

Sample SnSample New Data

ESn [L(f(x), y)]| {z }Can measure

E[L(f(x), y)]| {z }Want

Sn

Train Tune

ESTrain [L(f(x), y)]| {z }Can measure

ESTune [L(f(x), y)]| {z }Can measure

Want Out of Sample But only have In Sample

Back to our original problem In-sample: No regularization is best regularization

Sample SnSample New Data

ESn [L(f(x), y)]| {z }Can measure

E[L(f(x), y)]| {z }Want

Sn

Train Tune

ESTrain [L(f(x), y)]| {z }Can measure

ESTune [L(f(x), y)]| {z }Can measure

Traditional Model Selection: Structural assumptions on DGP Analytically calculate diffference

Sample SnSample New Data

ESn [L(f(x), y)]| {z }Can measure

E[L(f(x), y)]| {z }Want

Sn

Train Tune

ESTrain [L(f(x), y)]| {z }Can measure

ESTune [L(f(x), y)]| {z }Can measure

Traditional Model Selection: Structural assumptions on DGP Analytically calculate difference

Sample SnSample New Data

ESn [L(f(x), y)]| {z }Can measure

E[L(f(x), y)]| {z }Want

Sn

Train Tune

ESTrain [L(f(x), y)]| {z }Can measure

ESTune [L(f(x), y)]| {z }Can measure

Empirical Tuning

• But now we can see what level of regularizationdoes best out of sample

• So estimate for many values of lambda�R(f)| {z }

want: ⇡L(f)�L(f)

fA� = arg minf2FA

EHL(f(x), y)+

Now in this case

•  See performance of this in the new “tune” data

•  A few assumptions and… – Simple convex optimization – So choosing between infinitely many procedures

� ⇠ argminEHL(f�(x), y)

Overfit Dominates

Creating Out-of-Sample In Sample

•  Major point: – Not many assumptions – Don’t need to know true model. – Don’t need to know much about algorithm

•  Something profound here: – We use the data itself to choose complexity

•  Aside: What happens as sample goes up?

Why does this work?

1.  Not just because we can split a sample and call it out of sample

–  It’s because the thing we are optimizing is observable

This is more than a trick

•  It illustrates what separates prediction from estimation: –  I cant ‘observe’ my prior.

•  Whether the world is truly drawn from a linear model

– But prediction quality is observable

•  Put simply: – Validity of predictions are measurable – Validity of coefficient estimators require structural

knowledge

This is the essential ingredient to prediction: Prediction quality is an empirical quantity not a theoretical guarantee

Why does this work?

1.  It’s because the thing we are optimizing is observable •  Notice that this works irrespective of number of

variables – This was not directly hard-wired into our

calculations

Why does this work?

1. It’s because the thing we are optimizing is observable 2. By focusing on prediction quality we have reduced dimensionality

To understand this…

•  Suppose you tried to use this to choose coefficients –  Ask which set of coefficients worked well out-of sample.

•  Does this work? •  Problem 1: Estimation quality is unobservable –  Need the same assumptions as algorithm to know whether

you “work” out of sample •  If you just go by fit you are ceding to say you want best predicting

model

–  Coefficients dont exist in the same way predictions do

•  Problem 2: No dimensionality reduction. –  You’ve got as many coefficients as before to search over

We can be more efficient than this

•  Will use a tool called cross-validation

•  Basic insight: – Why not use the hold-out to estimate another

function and see how it does on the train set?

Train

Tune

Cross Validation

Tuning Set = 1/5 of Training Set

Some Notation

•  Cross-validation is used for tuning

•  But after we’ve done that, we cannot use it also to evaluate how well our algorithm is doing

•  Why??

Secret Sauce

•  Key ingredients 1.  Dimensionality reduction through regularization 2.  A focus on predictions means quality observable •  Which means we can empirically tune

Data

Engineering Fitting

Testing

FittingSample

Hold-OutSample

f

Prediction

[L,L]L, f

Loss

Function

Overview of ML Playbook

Use out-of-sample

Performance to

choose

ˆ�

8�: Fitpredictors

testable

out-of-sample

Use

ˆ� to

form

ˆf�

Train Tune Output

Fitting

FittingSample

T

�OLS

28,573

Use out-of-sample

Performance to

choose

ˆ�

8�: Fitpredictors

testable

out-of-sample

{F1, .., Fk}

{f�j� }

Use

ˆ� to

form

ˆf

Divide data T

into k folds:

(yi, xi) 2 F�(i)

8�: Estimate

ˆ

f

�j� on

T \ Fj

Sometimes instead:

1k

Pkj=1

ˆf�j

Train Tune Output

Fitting

Fit on Tusing

ˆ�and output

ˆfA,T

Which �

leads to

best prediction

of yi

by

ˆ

f

��(i)� (xi)?

FittingSample

A�

T

f

Data

Engineering Fitting

Testing

FittingSample

Hold-OutSample

f

Prediction

[L,L]L, f

Loss

Function

Overview of Steps

Applications of Machine Learning

•  New Data

•  Prediction in Policy

Applications of Machine Learning

•  New Data

•  Prediction in Policy

New Data

•  An Example

Xie  et.  al.  (2016)  

What does this have to do with ML?

•  Processing of data requires machine learning

Blumenstock  2015  

What does this have to do with ML?

•  Processing of data requires machine learning

Crop  Yield  

What does this have to do with ML?

•  Processing of data requires machine learning

•  Two kinds of processing: – Pre-processing

•  Extracting any sort of features from image

– Processing •  Conversion of features to economically meaningful

units

Find  Farms  

Relate  visual  features  to  yield  

Considerations

•  Need training data – Hand labeling – Merging to other data sets

•  Don’t be stingy

New Data

•  An Example

•  Kinds of New Data – Satellite Data – Language data

"This  class  was  a  religious  experience  for  me...  I  had  to  take  it  all  on  faith."  

"I  am  convinced  that  you  can  learn  by  osmosis  by  just  siKng  in  his  class."  

"Most  of  us  spent  the  1st  3  weeks  terrified  of  the  class.  Then  solidarity  kicked  in."  

"The  course  was  very  thorough.  What  wasn't  covered  in  class  was  covered  on  the  final  exam."  

TEXT  

Language Features

•  Bag of words

Bag of Words

"This  class  was  a  religious  experience  for  me...  I  had  to  take  it  all  on  faith."  

"I  am  convinced  that  you  can  learn  by  osmosis  by  just  siKng  in  his  class."  

"Most  of  us  spent  the  1st  3  weeks  terrified  of  the  class.  Then  solidarity  kicked  in."  

"The  course  was  very  thorough.  What  wasn't  covered  in  class  was  covered  on  the  final  exam."  

TEXT   Dic(onary  

This  Class  Was  A  Religious  Experience  For  Me  I    Had  To    Take    It  All    On    Faith  

Am  Convinced  That  You  Can  Learn  By  Osmosis  Just  SiKng  In  His      

Most  Of  Us  Spent  The  First  Three  Weeks  .  .  .      

"This  class  was  a  religious  experience  for  me...  I  had  to  take  it  all  on  faith."  

This  

Am  

Most  

of  

class  

convinced  

By  

osmosis  

Three  

Weeks  

1   0   0   0   1   0   0   0   0   0  

0   1   0   0   1   1   1   1   0   0  

"I  am  convinced  that  you  can  learn  by  osmosis  by  just  siKng  in  his  class."  

Can Predict Which Bills Survive Committee

Yano  Smith  and  Wilkerson    

Can Predict Which Bills Survive Committee

Financial Information

Kogan  et.  al.    

10-­‐k  Forms  

Predicting Volatility

What predicts?

Language Features

•  Bag of words •  Modifying bag of words: similarity/synonym •  Syntactic Structure •  Meaning: sentiment analysis

Can use sentiment as a features

Language Features

•  Bag of words •  Modifying bag of words: similarity/synonym •  Syntactic Structure •  Meaning: sentiment analysis •  LIWC

New Data

•  An Example

•  Kinds of New Data – Satellite Data – Language data – Digital Exhaust

Google  Searches  for  “iPhone  slow”  

Choi  Varian  

New Data

•  An Example

•  Kinds of New Data – Satellite Data – Language data – Digital Exhaust – Network Data – …...

Applications of Machine Learning

•  New Data

•  Prediction in Policy

Applications of Machine Learning

•  New Data

•  Prediction in Policy

Question

•  Can prediction be directly useful in policy?

•  These decisions seem inherently causal – “Should we do policy X”? –  “What will X do?” –  “What happens with and without X?”

•  In fact decisions seem inherently causal

Two Toy Policy Decisions

•  Rain Dance

•  Umbrella

•  Common Elements – Both are decisions with payoffs – Both rely on data of the type:

•  Y = rain, X = variables correlated with rain

– Both use data to estimate function y = f(x)

Predic`on  

Causa`on   �

y

Y

X

X0

Y

X

X0

Causa`on  

Rain  Rain  Dance  

Framework

Decision  

Y

X

Atmospheric  Condi`ons  

Y

X

X0

Y

X

X0

Predic`on  

Rain  

Atmospheric  Condi`ons  

Umbrella  

Decision  

Y

X

Framework

Y

X

X0

Predic`on  

Y

X

X0

Causa`on  

Rain  

Atmospheric  Condi`ons  

Rain  Dance   Rain  

Atmospheric  Condi`ons  

Umbrella  

Experiments   Machine  Learning  

Are there Umbrella Problems?

•  Decisions where predictions matter…

•  Where we can have big social impact

•  And with enough data

•  Prediction policy problems

Prediction

A Policy Problem in the US

•  Each year police make over 12 million arrests

•  Where do people wait for trial?

•  Release vs. detain high stakes –  Pre-trial detention spells avg. 2-3 months (can be up to

9-12 months) – Nearly 750,000 people in jails in US – Consequential for jobs, families as well as crime

Kleinberg  Lakkaraju  Leskovec  Ludwig  and  Mullainathan  

Judge’s Problem

•  Judge must decide whether to release or not (bail)

•  Defendant when out on bail can behave badly: – Fail to appear at case – Commit a crime

•  The judge is making a prediction

CrimeRisk

PastRecord

Release

Social Costs

PREDICTION    

PREDICTION    

CAUSATION    

CrimeRisk

PastRecord

Release

Social Costs

Bracelet

How big is this effect?

•  Effect of another police officer – Chaflin and McCrary 2013 – To get a 4 percentage point reduction in crime…

•  Would need ~ 40,000 officers more nationwide •  Costs more than 4.8 billion dollars per year

– Or just implement this prediction rule •  Some fixed costs and minimal flow cost

•  And we’re not even considering the other benefits

Important Caveat in this Analysis

•  Selective labels – The literature ignores this

•  How do we resolve it?

Bail Not Unique

Prediction Policy Problems

•  Decision aids—not substitute for humans

•  Must resolve important policy considerations