Human Decisions and Machine Predictions

Jure Leskovec (@jure)Including joint work with Himabindu Lakkaraju,Sendhil Mullainathan, Jon Kleinberg, Jens Ludwig

Machine Learning

Humans vs. Machines

Humans vs. Machines§ Given the same data/features X

machines tend to beat human:§ Games: Chess, AlphaGo, Poker§ Image classification§ Question answering (IBM Watson)

§ But humans tend see more than machines do. Humans make decisions based on (X,Z)!

Humans Make DecisionsHumans make decisions based on X,Z!§ Machine is trained based on P(Y|X)§ But humans use P(Y|X,Z) to make

decisions§ Could this be a problem?

Yes! Why? Because the data is selectively labeled and cross validation can severely overestimate model performance.

Plan for the Talk1) Can machines make better decisions than humans?

2) Why do humans make mistakes?

3) Can machines help humans make better decisions?

Jure Leskovec (@jure) Stanford University 5

Jure Leskovec, Stanford University 6

Example Problem:Criminal Court Bail

Assignment

Example Decision ProblemU.S. police makes >12M arrests/yearQ: Where do people wait for trial? Release vs. Detain is high stakes:

§ Pre-trial detention spells avg. 2-3 months (can be up to 9-12 months)

§ Nearly 750,000 people in jails in US§ Consequential for jobs, families

as well as crimeJure Leskovec (@jure) Stanford University 7

Judge’s Decision ProblemJudge must decide whether to release (bail) or not (jail)Outcomes: Defendant when out on bail can behave badly:

§ Fail to appear at trial§ Commit a non-violent crime§ Commit a violent crime

The judge is making a prediction!Jure Leskovec (@jure) Stanford University 8

Bail as a Decision Problem§ Not assessing guilt on this crime

§ Not a punishment

§ Judge can only pay attention to failure to appear and crime risk

The ML TaskWe want to build a predictorthat will based on defendant’s

criminal history predict defendant’s future bad behavior

Learning algorithm

Defendantcharacteristics

Predictionof bad

behaviorin the future

What Characteristics to Use?Only administrative features

§ Common across regions§ Easy to get§ No “gaming the system” problem

Only features that are legal to use§ No race, gender, religion

Note: Judge predictions may not obey either of these Jure Leskovec (@jure) Stanford University 11

Data: Defendant Features§ Age at first arrest§ Times sentenced residential

correction§ Level of charge§ Number of active warrants§ Number of misdemeanor cases§ Number of past revocations§ Current charge domestic

violence§ Is first arrest§ Prior jail sentence§ Prior prison sentence§ Employed at first arrest§ Currently on supervision§ Had previous revocation

§ Arrest for new offense while on supervision or bond

§ Has active warrant§ Has active misdemeanor

warrant§ Has other pending charge§ Had previous adult conviction§ Had previous adult

misdemeanor conviction§ Had previous adult felony

conviction§ Had previous Failure to Appear§ Prior supervision within 10 years

(only legal features and easy to get)

Data: Kentucky & Federal

Jurisdiction Number of cases

Fraction released people

Fraction of

Failure to Appear at

Non-violent Crime

ViolentCrime

Kentucky 362k 73% 17% 10% 4.2% 2.8%

Federal Pretrial System

1.1m 78% 19% 12% 5.4% 1.9%

Applying Machine Learning§ Just apply Machine Learning:

§ Use labeled dataset (released defendants)

§ Train a learning model to predict whether defendant when our on bail will misbehave

§ Report accuracy on the held-out test set

��

Prediction Model

Top four levels the decision treeProbability of NOT committing a crime

Typical tree depth: 20-25Jure Leskovec (@jure) Stanford University 15

But there is a problem with this…

The Problem§ Issue: Judge sees factors the machine

does not§ Judge makes decisions based on P(Y|X,Z)

§ X … prior crime history§ Z … other factors not seen by the machine

§ Machine makes decisions based on P(Y|X)§ Problem: Judge is selectively labeling

the data based on P(Y|X,Z)!Jure Leskovec, Stanford University 16

Problem: Selective LabelsJudge is selectively labeling the dataset

§ We can only train on released people§ By jailing judge is selectively hiding labels!

JudgeRelease

Crime? Crime?

Yes No ? ?

Selection on UnobservablesSelective labels introduce bias:

§ Say young people with no tattoos have no risk for crime. Judge releases them.

§ But machine does not observe whether defendants have tattoos.

§ So, the machine would falsely conclude that all young people do no crime.

§ And falsely presume that by releasing all young people it does better than judge!

Challenges: (1) Selective Labels(2) Unobservables Z

Solution: 3 PropertiesOur solution depends on three key properties:§ Multiple judges § Natural variability between judges§ Random assignment of cases to

judges§ One-sidedness of the problem

Solution: Tree Observations1) Problem is one-sided:

§ Releasing a jailed person is a problem§ Jailing a released person is not

AlgorithmRelease Jail

Solution: Three Observations2) In Federal system cases are randomly assigned§ This means that on average all

judges have similar cases

Solution: Three Observations3) Natural variability between judges§ Due to human nature there will be

some variability between the judges§ Some judges are more strict while

others are more lenient

Solution: ContractionSolution: Use most lenient judges!

§ Take released population of a lenient judge

§ Which of those would we jail to become less lenient

§ Compare crime rate to a strict human judge

Note: Does not rely on judges having “similar” predictions

Solution: ContractionMaking lenient judges strict:

What makes contraction work:§ 1) Judges have similar cases

(random assignment of cases)§ 2) Selective labels are one-sided

Defendants (ordered by crime probability)

Released by a lenient judge

Use algorithm to make a lenient judge more strict

Strict human judge

Evaluating the PredictionPredictions create a new release rule:

§ Parameterized by % released§ For every defendant predict P(crime)§ Sort them by increasing P(crime)

and “release” bottom k§ Note: This is different than AUC

What is the fraction released vs. crime rate tradeoff?

Overall Predicted Crime

Holdingreleaserateconstant,crimerateisreducedby0.102(58.45%decrease)

ReleaseRate

CrimeRate(includesF

Allfeatures+MLalgorithm

Judges17%

Machine Learning11%Ov

Release Rate 88%Standard errors too small to displayDecision trees beat LogReg by 30%Contrast with AUC ROC

Predicted Failure to Appear

Release Rate

ML algorithm

Holding release rate constant, crime rate isreduced by 0.06(60% decrease)Judges decisions

Predicted Non-Violent Crime

Release Rate

Holding release rate constant, crime rate isreduced by 0.025(58% decrease)

Notes:Standarderrorstoosmalltodisplayongraph

Judges decisions4.2%

Predicted Violent Crime

Release Rate

Holding release rate constant, crime rate isreduced by 0.014(50% decrease)

Judges decisions2.8%

Notes:StandarderrorstoosmalltodisplayongraphJure Leskovec (@jure) Stanford University 30

Effect of Race

ML does not beat the judge by racially discriminating

Source of Errors

Judges err on riskiest defendantsPredicted risk

Predicted Risk vs. JailingLeast Likely to Jail by judge Most Likely to Jail

Why do Judges do Poorly?Odds are against algorithm: Judge sees many things the algorithm does not:

§ Offender history:Observable by both

§ Demeanor:§ “I looked in his eyes and saw his soul”§ Unobservable by the algorithm

Can we diagnose judges’ mistakes and help them make better decisions?

Why do Judges do Poorly?Two possible reasons why judges are making mistakes:

§ Misuse of observable features§ These are the administrative features

available to the algorithm– E.g.: Previous crime history, etc.

§ Misuse of unobservable features§ Features not seen by the algorithm

– “I looked in his eyes and saw his soul”

A Different TestPredict judge’s decision

§ Training data does NOT include whether defendant actually commits a crime

Gives a release rule Release if predicted judge would release

§ This weights features the way judge does

What does it do differently then?§ Does not use the unobservables!

Predicting the JudgeBuild a model to predict judge’s behavior

§ We predict what the judge will do (and NOT if a defendant will commit a crime)

Mix judge and the model: 1 − 𝛾 𝐽𝑢𝑑𝑔𝑒*𝑠𝑟𝑒𝑎𝑙𝑑𝑒𝑐𝑖𝑠𝑖𝑜𝑛 +

𝜸 𝑷𝒓𝒆𝒅𝒊𝒄𝒕𝒆𝒅 𝒅𝒆𝒄𝒊𝒔𝒊𝒐𝒏

Does this beats the judge (for some 𝜸)?§ Model will do better only if the judge

misweighs unobservable featuresJure Leskovec (@jure) Stanford University 37

Artificial Judge Beats the Judge

Model of the judge beats the judge by 35%

Pure Judge

Pure ML modelof the judge

Putting It All Together

●●

−0.5

−0.4

−0.3

−0.2

−0.1

−0.20 −0.15 −0.10 −0.05 0.00Percentage Point Change in Release Rate

●●●

Judges

Lenient Judge + ML

Predicted Judge

Random

Recap so far…Algorithm beats human judge in the bail assignment problem

Artificial judge beats the real judgeThis means judge is misweighingunobservable features

Can we model human mistakes?Jure Leskovec (@jure) Stanford University 40

Judge j Items i True labels

No crime

Decisions of j

Release

Modeling Human Decisions

41Jure Leskovec (@jure) Stanford University

No crime

Judge j Defendant iDecision of judge j on

defendant i

Decision

True label ti

Modeling Human Decisions

42Jure Leskovec (@jure) Stanford University

pc,c pc,n

pn,c pn,n

Probability that j decided “no crime” to a defendant who will do “crime”

Problem Setting

Input:§ Judges and defendants described by

featuresOutput:

§ Discover groups of judges and defendants that share common confusion matrices

Joint Confusion Model

Judge attributes Item attributes

Judge j Item i

Confusion matrix

Cluster cj

Clusterdi

Item label

Decision of judge j on item i

b1 b2 b3 b4 zia1 a2 a3 a4 a5

DecisionTru

Similar judges share confusion matrices when deciding on similar items

What do we learn on Bail?Judge attributes: § Year, County, Number of Previous Cases,

Number of Previous Felony Cases, Number of Previous Misd. Cases, Number of Previous Minor Cases

Defendant attributes: § Previous Crime History, Personal Attributes

(Children/Married etc.), Social Status (Education/House Own/Moved a lot in past 10 years), Current Offense Details

Why judges make mistakes

Decision

Less experienced judges make more mistakes¡ Left: Judges with high number of cases¡ Right: Judges with low number of cases

0.90.1

0.1 0.6

0.4 0.6

Judges with many felony cases are stricter¡ Left: Judges with many felony cases¡ Right: Judges with few felony cases

Decision

0.8 0.80.2 0.2

0.5 0.5 0.3 0.7

Defendants who are single, did felonies, and moved a lot are accurately judged

Decision

Defendants who have kids are confusing to

judges

0.90.1

0.1 0.5

0.50.5

ConclusionNew tool to understand human decision makingThe framework can be applied to various other questions:

§ Decisions about patient treatments§ Assigning students to intervention

programs§ Hiring and promotion decisions

Power of AlgorithmsAlgorithms can help us understand if human judges are biased

Algorithms can be used to diagnose reason for bias

Key insight§ More than prediction tools§ Serve as behavioral diagnostic tools

Focusing on DecisionsNot just about predictionKey is starting with decision:

§ Performance benchmark: Current “human” decisions§ Not ROC but “human ROC”

§ What are we really optimizing?

Focus on decision also raises new concerns

Reflections: New Concerns1) Selective labels

§ We don’t see labels of people that are jailed

§ Extremely common problem:Prediction -> Decision -> Action

§ Which outcomes we see depends on our decisions

Reflections: New Concerns2) Data responds to prediction§ Whenever we make a

prediction/decision we affect the labels/data we see in the future

§ Creates a self-reinforcing feedback loop

Reflections: New Concerns3) Constraints on input features:For example, race is an illegal feature

§ Our models don’t use itBut, is that enough?

§ Many other characteristics correlate with race

How do we deal with this additional constraint?

Reflections: New Concerns4) Omitted payoff bias§ Is minimizing the crime rate really the

right goal?§ There are other important factors

§ Consequences of jailing on the family§ Jobs and the workplace§ Future behavior of the defendant

§ How could we model these?Jure Leskovec (@jure) Stanford University 55

ConclusionBail is not the only prediction problem

Many important problems in public policy are predictive problems!

Potential for large impact

References§ The Selective Labels Problem: Evaluating Algorithmic Predictions in

the Presence of Unobservables. H. Lakkaraju, J. Kleinberg, J. Leskovec, J. Ludwig, S. Mullainathan. ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD), 2017.

§Human Decisions and Machine Predictions. J. Kleinberg, H. Lakkaraju, J. Leskovec, J. Ludwig, S. Mullainathan. NBER Working Paper, 2017.

§ A Bayesian Framework for Modeling Human Evaluations by H. Lakkaraju, J. Leskovec, J. Kleinberg, S. Mullainathan. SIAM International Conference on Data Mining (SDM) 2015.

§ Confusions over Time: An Interpretable Bayesian Model to Characterize Trends in Decision Making. H. Lakkaraju, J. Leskovec. Neural Information Processing Systems (NIPS), 2016.

Human Decisions and Machine Predictions · 2017. 10. 26. · _2H2 b2_mH2 iQCm/;2 "H +F >BbT MB+...

Transcript of Human Decisions and Machine Predictions · 2017. 10. 26. · _2H2 b2_mH2 iQCm/;2 "H +F >BbT MB+...

Human Decisions and Machine Predictions · 2017. 10. 26. · _2H2 b2_mH2 iQCm/;2 "H +F >BbT MB+...

Documents

Transcript of Human Decisions and Machine Predictions · 2017. 10. 26. · _2H2 b2_mH2 iQCm/;2 "H +F >BbT MB+...

IQD Advanced Clock Modules - … · IQD Advanced Clock Modules ... IQCM -100 ... For applications where the GPS signal will not be lost for a full 24hrs, ...

ORAL.-NASAL MASKS, OXYGEN, MC-1 AND MIBU-5/Pdevelopment of oral-nasal masks, oxygen, mc-1 and mbu-5/p henry w. seeler life support systems laboratory aerospace medical laboratory august

Frequency Products For Military Applications Selection Guide · 2019. 3. 18. · IQCM-110 60 x 60 5.0V

Specifying GPS Disciplined Oscillators · Some of our clock modules are designed to offer full PTP 1588 functionality (the IQCM-300). If this means nothing to you If this means nothing

1 OPNFV Summit 2015 Doctor - Fault Management Gerald Kunzmann, DOCOMO Carlos Goncalves, NEC Ryota Mibu, NEC.

September 11, 2009 Revision 0 Security: IPSec Board-B2 ...downloads.canon.com/isg_public/iradvanceC09075/IPSec_Board-B2_… · The IPSec settings are not applied to a packet which

Session 1(iqcm)

Gerald Kunzmann, DOCOMO Carlos Goncalves, NEC Ryota Mibu, NEC

Construction - Tokyo Station City · B2_ Former moat（Sotobori Dori） C2_ Kyobashi “Cats Alley” B1_ Side street between buildings B2_ Alley packed with taverns Tokyo Station

Specifying GPS Disciplined Oscillators - IQD · Specifying GPS Disciplined Oscillators ... Some of our clock modules are designed to offer full PTP 1588 functionality (the IQCM ...

trc-adeac.trc.co.jp · tou to made. Mibu-no-Tadami. Koi su tyô waga Na wa madaki tatini keri, hito sirezu koso omoi-somesi ka. Hito sirezÿl koso omoi- somesi ka. ... mada humi mo

NFVオープンソースの動向nv/NV2017-19 (Invited) Mibu (NEC).pdfOpen Platform for NFV (OPNFV) オープンソースベースのNFVリファレンスプラットフォームの確立が目的

Doc for 07 _ojt 320 b1 b2_ Ed4

青年族群使用 Instagram 之心理需求與持續使用意圖 …web.lins.fju.edu.tw/conference/2017/ppt/B2_青年族群...本研究採用問卷調查法，自2016 年11 月30

Illustrative Quality Control Manual - ISCA · This illustrative quality control manual (“IQCM”) ... The management takes full responsibility for ... This will form a

Ryota Mibu, NEC SFQM and Doctor Maryam Tahhan, Intel Emma ... · Gnocchi integration. Project in OPNFV working on building an open-source NFVI fault management ... VNF ceilometer

Inglesina Dreambook - collection 2011 (catalogo by Mibu)

Iq（アイキュー）SYSTEM 取 扱 マ ニ ュ ア ル地上第一の布として、床付き布わくまたは布材IQC- K,IQCM- Kを2m以下の位置に設けること。 （10）布材

エスペックIQcm (Mcm 1-2cm 23% O-lcm Fig.2 Bq/ 3.1 1200 Bq/kg C Cs I-2cm4% 4% 4-Scm 4% Fig.4 Cs Cs 4.1 24 KBF794N1)PkJ TJä:7.2 30 g 10 . 4.2 Fig.5 9000C k IN B Fig.6 3 X Fig.7

Doc 02.05.17 13:49:57 - Flagstaff Unified School District...o —IQcm— Given side of 12 cm, what is the apothem? the length of the side? Given an apothem of 1045 cm, what is the

Iq（アイキュー）SYSTEM 取扱マニュアル地上第一の布として、床付き布わくまたは布材IQC- K,IQCM- Kを2m以下の位置に設けること。（10）布材