The Practice of Data Science - IBM€¦ · − Did the booking date stamp occur on a weekend? −...

Post on 18-Jun-2020

3 views 0 download

Transcript of The Practice of Data Science - IBM€¦ · − Did the booking date stamp occur on a weekend? −...

© 2016 IBM Corporation

The Practice of Data Science

Lisa Sokol, Ph.Dlsokol@us.ibm.com

© 2016 IBM Corporation2

Look At All The DataLook At All The Data

Let Data Lead the WayLet Data Lead the Way Leverage Data as it is CapturedLeverage Data as it is Captured

Changing the Way We Do Analytics

© 2016 IBM Corporation3

Basic Process

Ingest

data

Transform

: clean

Create

and build

model

Evaluate

Deliver

and deploy

model

Communicate

results

Understand

problem and

domain

Explore and

understand

data

Transform:

shape

OUTPUT

ANALYSIS

INPUT

© 2016 IBM Corporation4

Work Task Example

Given a person is Arrested

Who Gets Released on Bond? and

How Fast?

© 2016 IBM Corporation5

Understand the Domain

�Analytics requires an understanding of the data & the judicial process

�Need to learn how a Judge decides whether or not to allow bond

−SME’s indicate Judicial bond decisions are

based on

• “Threat to community” = Qualitative assessment (Current Charges + Past Charges +Time Line)

•Ties to Community

© 2016 IBM Corporation6

The Hunt for Data -- The Rap Sheet

�Charges

− Criminal code - thousands of numbers

− Time/date of arrest

− Sentenced or not (sometimes)/Released or not

�Personal Information

− Dirty & incomplete

�Arresting Organization

© 2016 IBM Corporation7

�Acquiring Rap Sheet Data

−Access required all sorts of agreements

−Different jurisdictions, different content and form

−Task requirement: Meta data mapping and integration

•Consistent Crime codes (NCIC)

© 2016 IBM Corporation8

Explore and Understand the Data

�Analyze variables

� Values, max, min, number of variables, coverage or % missing data, distribution shapes, etc.

� Outliers

� Anomalous values

, Number of days from arrest to adjudication

© 2016 IBM Corporation9

Data Transformation

�To clean or not to clean

−Strategic decision

−Decision criteria

�Identity normalization

−Alias challenge

−Alias challenge as it relates to Data Science

•Model Creation

•Model Score

© 2016 IBM Corporation10

Data Transformation: Enriching the data by adding context to data

Context: The cumulative history derived from data observations about entities

� Example – Safety of firefighters

� Current environmental temperature

Or

� Current environmental temperature and temperature history for that person

Or

� Current environmental temperature, temperature history for that person, and how long it will take to exit the building

© 2016 IBM Corporation11

Data Transformation: Threat

NCIC Charge Code NCIC Charge Category NCIC Charge

101 Sovereignty Treason

105 Sovereignty Sedition

�There a several thousand codes

− Which codes are considered “threat”

− How do codes compare in “threat”

− How do you combine codes

− How to figure in the temporal aspect of crime

© 2016 IBM Corporation12

Scoring: Threat to Community

�Two main components: scoring of individual charges and crime history

�Scoring of each charge derived from two parameters− A loss-of-memory parameter, which determines how fast the severity

of the charge declines over time (this parameter might be zero)− A lack-of-forgiveness parameter, which will determine what

proportion of the original severity level remains forever� Scoring of crime history

− Scores of each charge/conviction are accumulated (the model determines how)

For each crime,

look up the scoring parameters, and the time

the crime was committed, and evaluate the

individual crime scores

Submit all of

those scores into the

cumulative history scoring

function

Threat to the

Community of the

offender

12

© 2016 IBM Corporation13

Data Transformation

� Target Variable: Time to Release− The time-to-release variable is obtained by subtracting the booking time stamp

from the release time stamp.

� Counting− Total number of a type of crime

− Total number of a specific threat to community grouping

� Distance variables− Compare ZIP codes of booking location and arrestee’s home to determine if

arrestee is “local” to booking locality

� Date stamp variables− Did the booking date stamp occur on a weekend?

− Did the booking date stamp occur on a holiday, during holiday, just before a holiday?

� Time of Day− Early in shift? Late in shift?

− Net – about 1600 variables created

© 2016 IBM Corporation14

Modeling Process is Iterative

Predictive Modeling Algorithm: Train Model

Evaluate and Tweak Model

Score and Assess Model

Divide Data Set into 3 Segments

© 2016 IBM Corporation15

Picking a Model

�Target variable characteristics (binary, continuous, etc.) typically dictate model selection

�Model selection

− Assessment via Accuracy and Error

− Different models can select different variables as predictors

© 2016 IBM Corporation16

05/1 1

Predictive Modeling Environment

© 2016 IBM Corporation17

Model – Decision Tree

© 2016 IBM Corporation18

Scoring - How good is the Model?

�Mission dictates model accuracy requirements

�Lots of different measurements of goodness

− Model Confidence

− Two types of error

• Number of people who were predicted to be released AND were not

• Number of people who were not to predicted to be released AND were

− Number of different other scoring mechanisms

© 2016 IBM Corporation19

Disappointment

�Horrible accuracy and error

�Re-Think assumptions

�Aha moment

© 2016 IBM Corporation20

Deployment

�Models (or rules) get deployed to the mission environment

− Can deploy more than one model

�Model should exploit new data as it arrives

�Predictive power of models must be monitored over time

− Develop thresholds which define the limits of allowable model variance; if model exceeds variable, must re-calibrate the model

− Need to establish monitoring mechanism