Asymmetric Loss Functions for Forecasting in Criminal ...berkr/loss copy.pdfAsymmetric Loss...

Asymmetric Loss Functions for Forecasting inCriminal Justice Settings∗

Richard BerkDepartment of Statistics

Department of CriminologyUniversity of Pennsylvania

April 19, 2010

AbstractThe statistical procedures typically used for forecasting in crimi-

nal justice settings rest on symmetric loss functions. For quantitativeresponse variables, overestimates are treated the same as underesti-mates. For categorical response variables, it does not matter in whichclass a case is inaccurately placed. In many criminal justice settings,symmetric costs are not responsive to the needs of stakeholders. It canfollow that the forecasts are not responsive either. In this paper, weconsider asymmetric loss functions that can lead to forecasting proce-dures far more sensitive to the real consequences of forecasting errors.Theoretical points are illustrated with examples using criminal justicedata of the kind that might be used for “predictive policing.”

1 Introduction

Forecasting has been an important part of criminal justice practice since theearly 20th century. Much of the initial work was undertaken for parole pre-

∗Thanks go to Roger Koenker for several helpful e-mail exchanges on his programin R for quantile regression. Thanks also go the Larry Brown and Andreas Buja forsuggestions on statistical inference for quantile regression. The work reported in thispaper was supported by a grant from the National Institute of Justice Predictive PolicingDemonstration and Evaluation Program.

1

diction (Burgess, 1928; Borden, 1928), which later evolved into re-offendingmore generally (Glueck and Glueck, 1950). Early in the post-war era, anumber of key methodological concerns were raised in a manner that is stillinsightful (Ohlin and Duncan, 1949; Goodman, 1952; 1953a; 1953b). Theneed to evaluate forecasting accuracy with real forecasts, not fitted values, isone example.

Over the past 30 years, criminal justice forecasting applications have be-come more varied: forecasts of crime rates (Blumstein and Larson, 1969;Cohen and Land, 1987; Eckberg, 1995), prison populations (Berk et al.,1983; Austin et al., 2007), repeat domestic violence incidents (Berk et al.,2005) and others. New methodological issues continue to surface and in thatcontext, it is not surprising that the performance of criminal justice forecastshas been mixed at best (Farrington, 1987; Gottfredson and Moriarty, 2006).Complaints about “weak models” is an illustration.

In this paper, we consider a somewhat unappreciated statistical featureof criminal justice forecasts. We consider the role of loss functions. The mostpopular loss functions are symmetric. For quantitative response variables,the direction of disparity between a forecast and what eventually occurs doesnot matter. Forecasts that are too small are treated the same as forecaststhat are too large. In many policy settings, symmetric loss is not responsiveto the costs associated with actions that depend on the forecasts. The re-cent interest in “predictive policing” (Bratton and Malinowski, 2008) is oneinstance. Is an overestimate of the number of future robberies in a neigh-borhood as costly as an underestimate? In one case, too many resourcesmay be allocated to a particular area. In the other case, that area is under-served. The first may be wasteful, while the second may lead to an increasein crime. The consequences could be rather different and the resulting costsrather different as well. Yet, this point is overlooked in some very recent andsophisticated criminal justice forecasting (Cohen and Gorr, 2006).

Analogous issue can arise when the response variable is categorical. Theclass into which an observation is incorrectly forecasted does not matter. If,for example, the outcome to be forecasted is whether an individual on paroleor probation will commit a homicide, false positives can have very differ-ent consequences from false negatives (Berk et al., 2009). A false positivecan mean that an individual is incorrectly labeled as “high risk.” A falsenegative can mean that a homicide that might have been prevented is not.There is no reason to assume that the costs of these two outcomes are theeven approximately the same. Qualitative outcomes can also play a role in

2

predictive policing.In the pages ahead, we expand on these illustrations. We look more for-

mally at forecasting loss functions, consider the potential problems associatedwith symmetric loss, and examine some statistical forecasting procedures thatbuild on asymmetric loss functions. Loss functions for quantitative and cate-gorical response variables are considered. Several examples from real data areused to illustrate key points. The overall message is straightforward: whenthe costs of forecasting errors are asymmetric and an asymmetric loss func-tion is properly deployed, forecasts will more accurately inform the criminaljustice activities than had a symmetric loss function been used instead.

2 An Introduction to Loss Functions

All model-based statistical forecasts have loss functions built in. The lossfunctions are introduced as part of the model building process and generallycarry over to the forecasts that follow. Occasionally, a new loss function isintroduced at the forecasting stage.

Consider a quantitative response variable Y and a set of regressors X.1

There is a statistical procedure that computes some function of the regressors,f(X) minimizing a particular loss function L(Y, f(X)). For squared error,such as used in least squares methods,

L(Y, f(X) = (Y − f(X))2. (1)

For absolute error such as used in quantile regression,2

L(Y, f(X)) = |Y − f(X)|. (2)

Sometimes these loss functions are minimized directly.3 Least squares meth-ods for quadratic loss are the obvious example. An illustration is shown inFigure 1. Note that the loss for a residual value 3 is the same as the loss fora residual value of -3.

1The bold font denotes a matrix. For ease of exposition, we are not including laggedvalues of the response in X. For this discussion, there is no important loss of generality.

2Absolute error, also known as L1 loss, comes up in a variety of topics within theoreticalstatistical (e.g., Bickel and Doksum, 2001: 18-19). Quadratic loss is sometimes called L2

loss.3Equations 1 and 2 are notational. They are not expressions for how the actual loss

value would be computed.

3

Sometimes loss functions are minimized indirectly. Maximum likelihoodprocedures applied linear regression with normally distributed disturbancesis the obvious example. We are concerned in this paper with the loss functionitself, not the methods by which it is minimized.

As already noted, there can be on occasion one loss function for the fore-casting model and another loss function for the forecast itself. For example,one might use quadratic loss for normal regression and then once the condi-tional distribution of the forecast is determined, apply linear loss. Instead offorecasting the conditional mean, one might forecast the conditional median.

-3 -2 -1 0 1 2 3

02

46

810

12

Quadratic Loss function

Residual Value

Loss

Val

ue

Figure 1: An Example of A Quadratic Loss Function

When the response variable is categorical, the ideas are the same, butthe details differ substantially. Drawing on the exposition of Hastie andhis collegues (2009: 221), consider a categorical response variable G with1, 2, . . . , K categories, sometimes called classes. There is a function of theregressors, f(X), that transforms the values of the regressors into the prob-ability pk of a case falling into class k. There is another function of theregressors, G(X), that uses pk to place each case in a class. Thus, therecan in principle be two different loss functions. With logistic regression, for

4

example, f(X) is a linear combination of regressors that computes fittedprobabilities. Then, G(X) assigns a class to each case using some thresholdon those fitted probabilities (e.g., assign the case to the class for which it hasthe highest probability). Two different loss functions can be employed.

Alternatively, the procedure goes directly to G(X). There is no interven-ing function. Then, there is a single loss function. Nearest neighbor methodsas commonly used are one example, although one-step approaches usuallycan be reformulated, if necessary, as two-step approaches.

Two common loss functions for categorical response variables are

L(G, G(X)) = I(G 6= G(X), (3)

where I() denotes an indicator variable,4 and

L(G, p(X)) = −2K∑k=1

I(G = k) log pk(X). (4)

The first loss function is called “0-1 loss,” which codes whether a classifica-tion error has been made. The sum of the classification errors is commonlytransformed into a measure of forecasting accuracy (e.g., the proportion ofcases correctly forecasted).5

The second loss function is −2×log likelihood and is called the “deviance”or “cross entropy loss.”6 The deviance is often reported for generalized linearmodels. For the special case of normal regression, the error sum of squaresis the deviance.

Each of the loss functions shown in equations 1 through 4 are examplesof symmetric loss. For equation 1, squaring removes the sign of residuals.For equation 2, the sign is removed by computing the absolute value of theresidual. In both cases, once the sign is removed, fitted values that are toolarge are treated the same as fitted values that are too small. For equation 3,all misclassifications are treated the same regardless of the incorrect classinto which a case is placed. For equation 4, all classes are treated the samewith respect to classification errors because the weight given to each class issolely determined by the probability of placement in that class. And thatprobability has no necessary connection to the costs of classification errors.

4In this case, the indicator operator returns a value of 1 when a case does not fall inclass k, and 0 otherwise,

5This measure assumes that the costs of false positives and false negatives are the same.6Again, these equations are notational, not computational.

5

Statistical procedures that rest upon the loss functions briefly introducedand all other common loss functions (e.g., McCullagh and Nelder, 1989: 34;Hastie et al., 2009: 343-350), assume symmetric loss. It follows that sym-metric loss functions dominate statistical practice. Forecasting applicationsare no different. Yet, as Nobel Laureate Clive Granger observed long ago fora quadratic loss function C(e) used in forecasting applications, “The obviousproblem with this choice for C(e) is that it is a symmetric function, whereasactual cost functions are often nonsymmetric” (Granger, 1989: 15).7 Grangergoes on to explain that it can be difficult for a stakeholder to specify a pre-ferred loss function. And even if that were possible, there were at the timesome rather daunting technical obstacles to overcome before asymmetric lossfunctions could be readily implemented.

Over the past 20 years, there have been some important statistical de-velopments that can make asymmetric loss functions practical, and some ofthese require less of stakeholders than Granger assumed. We turn to a simplebut powerful example.

3 Quantile Regression and Beyond

Consider a quantitive random response variable Y. Drawing on Koenker’sexposition (2005: Section 3.1), Y has a continuous distribution function sothat

F (Y ) = P (Y ≤ y). (5)

F (Y ) is the cumulative distribution function for Y. Then for 0 < τ < 1 andan inverse function F−1(),8

F−1(τ) = inf[y : F (y) ≥ τ ]. (6)

The τth quantile is a value of Y. It is chosen so that the probability thatY is equal to or larger than τ is as small as possible. In the most commoninstance, τ = .5, which represents the median. Then less formally stated,the median is the value of y beneath which Y falls with a probability .50.

7Working within econometric traditions, Granger prefers the term cost function to theterm loss function, but in this context, these are just two names for the same thing.Likewise, “nonsymmetric” means the same thing as “asymmetric.”

8An inverse function f−1(), when it exists, is defined so that If f(a) = b, then f−1(b) =a.

6

All other values of τ have analogous interpretations. Common values of τare .25, symbolized by Q1 and .75 symbolized by Q3. These are called the“first quartile” and “third quartile” respectively. The median can be calledthe “second quartile.”

Let di = yi− fτ (yi), where fτ (yi) is potential estimate of the τth quantile;di is a conventional deviation score but computed for a quantile, not themean. The quantile loss function is then

L(Y, fτ (y)) =n∑i=1

di(τ − I(di < 0)), (7)

where as before, I() is an indicator function, and the sum is over all casesin the data set. The indicator variable is equal to 1 if the deviation score isnegative, and 0 otherwise. So, the right hand side is equal to the deviationscore multiplied by either τ or τ − 1. This is how τ directly affects the loss.

Suppose τ = .50. Then according to equation 7, the loss for deviationscore di = 3 is 1.5. If the deviation score di = −3 the loss is also 1.5. Butsuppose that τ = .75. Now the loss for di = 3 is 2.25, and the loss for di = −3is .75.

When the median is the target, the sign of the deviation score does notmatter. The loss is the same. But when the third quartile is the target, thesign matters a lot. Positive deviations scores are given 3 times more weightthan negative deviation scores as the sum of n cases is being computed.In other words, for all quantiles other than the median, the loss functionis asymmetric. Moreover, one can alter the weight given to the deviationscores by the choice of τ . Thus, by choosing the third quartile, positivedeviation scores are given 3 times the weight of negative deviation scores.Had one chosen the first quartile, positive deviation scores would be given1/3 the weight of negative deviation scores. Values of τ > .5 make positivedeviations more more costly than negative deviations. Values of τ < .5 makenegative deviations more more costly than positive deviations.

Figure 2 illustrates an asymmetric a linear loss function when the quan-tile is less than .50. The loss for negative deviations increases much fasterthan the loss for positive deviations. Overestimates are more costly thanunderestimates. Figure 3 illustrates an asymmetric linear loss function whenthe quantile is greater than 50. The loss for positive deviations increasesmuch faster than the loss for negative deviations. Underestimates are more

7

costly than overestimates.9

0 +-

Deviation Score

Loss

high

low

An Asymmetric Linear Loss Function for A Quantile Less Then .50

Figure 2: An Example of An Asymmetric Linear Loss Function for τ < .50

The same principles can apply to regression when the target is then aconditional quantile. Such quantiles are computed as a function of predictorsso that the forecasts fall on a hyperplane in the space defined by the predictorsand the response.10 We basically are back to equation 2. In equation 7, fτ (X)replaces fτ (y), and instead of talking about deviation scores, we talk aboutresiduals. Looked at another way, in broad brush strokes this is just likeleast squares regression, but it is not the sum of the squared residuals thatis minimized. It is the sum of the absolute values of the residuals such thatthere is the option of treating negative residuals differently from positiveresiduals.

There is a large literature on quantile regression. Gilchrist (2008: 402)points out that quantile regression has its roots in work by Galton in the

9When τ = .50, the loss function is a “perfect” V such that both arms of the V havethe same angle with the horizontal axis.

10In the simplest case of a single regressor, the conditional quantiles are assumed to fallon a straight line.

8

0 +-

Deviation Score

Loss

high

low

An Asymmetric Linear Loss Function for A Quantile Greater Than .50

Figure 3: An Example of An Asymmetric Linear Loss Function for τ > .50

1880’s, but that Galton’s work was largely ignored until more recent papersby Parzen (e.g., Parzen, 1979). It then fell Koenker and Bassett (1978) andKoenker (2005) to provide a full-blown, operational regression formulation. Anumber of important extensions followed (e.g., Oberhofer, 1982; Tibshirani,1996; Kriegler, 2007). The general properties of quantile regression are nowwell known (Berk 2004: 26). Britt (2008) offers a criminal justice applicationin which sentencing disparities are analyzed.

In contrast, the literature on forecasting with quantile regression is rela-tively spare and then, the emphasis has been primarily on the benefits whenconditional quantiles are used rather than the conditional mean (Koenkerand Zhao, 1996; Tay and Wallius, 2000). The loss function follows from theconditional quantile to be forecasted. That reasoning legitimately can bereversed. The relative costs of forecasting errors are first determined, fromwhich one arrives at the appropriate quantile to be forecasted. And that isthe stance taken in this paper.

When quantile regression is used as just described, it is a one-step proce-dures. A forecast is just a conditional quantile, much like a fitted value. For

9

the forecast, the outcome is just not known. The same loss function is usedfor both the fitted values and the forecast. However, one can forecast quan-tiles in a two step manner that in some situations provides greater precision.For example, one can apply normal regression to forecast the full conditionaldistribution of the response and then use as the forecast the conditional thirdquantile of that conditional distribution (Brown et al., 2005). The first stepemploys quadratic loss. The second step employs linear loss.

3.1 An Illustration

Quantile regression can be a useful forecasting tool for the amount of crimein a particular geographical area. In this instance, the data come from theCentral police reporting district in the city of Los Angeles. There are over 4years of data on the number of violent crimes aggregated up to the week. Onemight want to forecast the amount of violent crime a week in advance for thatreporting district. District commanders often allocate the law enforcementresources on week-to-week basis.

Predictors include earlier crime counts for the Central Reporting Districtand earlier crime counts for the four contiguous reporting districts: Newton,Rampart, North East, and Hollenbeck. All of the predictors are lagged be-cause the goal is forecasting. From what is known this week and perhapsearlier, the predicted amount of crime can be used to anticipate what mayhappen the following week.

The lagged values for the Central Reporting District can capture impor-tant temporal trends. The lagged values for the other four reporting districtscan capture spatial dependence, such as offenders working across reportingdistrict boundaries. With longer predictor lags (e.g., a month), more distantforecasting horizons can be employed. In principle, other predictors could beincluded.

Any of several conventional regression forecasting procedures could beused such as those from the general linear model, ARIMA models, the gen-eralized additive model, and more. But all assume quadratic loss. Here, thismeans that crime forecasts that are too high are treated the same as crimeforecasts that are too low. In practice, this is probably unreasonable. Fore-casts that are too high may lead to an allocation of resources to where theyare not really needed. Forecasts that are too low may lead to an increase incrime that could have been prevented. There is no reason to assume thatthe costs of these outcomes are the same. Indeed, informal discussions with

10

police officials in several cities clearly indicate that they are not.One can take an important aspect of such costs into account if it is possible

to specify the relative costs of overestimates and underestimates. In manyinstances, relative costs can be elicited from stakeholders, at least to thesatisfaction of those stakeholders (Berk et al., 2009, Berk, 2009). Moreover,if one proceeds with conventional quadratic loss, equal relative costs areautomatically being assumed. These also need to be explicitly be justified.There is really no way to avoid the issue of relative costs.

Consider the three cost ratios of overestimates to underestimates usedearlier: 3 to 1, 1 to 1, 1 to 3.11 One quantile regression analysis was un-dertaken for each cost ratio. The first implies fitting the conditional firstquartile (Q1|X). The second implies fitting the conditional second quartile(Q2|X), also known as the median. The third implies fitting the conditionalthird quartile (Q3|X).

Variables Q1 Coefficients Q2 Coefficients Q3 CoefficientsIntercept 15.04 18.98 19.53Newton 0.28 0.34 0.27Rampart -0.01 -0.01 0.01

North East -0.19 0.12 0.21Hollenbeck 0.04 0.10 0.24

Central 0.28 0.29 0.31

Table 1: Regression Coefficients for Quantile Regression Forecasting Modelsof Violent Crime in the Central Reporting District Using One Week Lags forAll Predictors (Coefficients with p-values < .05 for the usual null hypothesisin bold font)

Table 3.1 shows the three sets of regression coefficients and the resultsof bootstrapped statistical tests Each regression coefficient represents thechange in the conditional quantile for a unit change in its associated report-ing district regressor. For these data, a unit change is one crime. Thus, forevery three additional violent crimes per week in the Central Reporting Dis-trict, there is little less than one additional violent crime the following week.The lagged values for violent crime in the the Newton Reporting District

11Any other cost ratios are fair game. For example, using a quantile of .8 would implyrelative costs of 1 to 4 for overestimates compared to underestimates.

11

have a similar interpretation. No causal inferences need be made, in partbecause the three quantile regressions are not intended to be causal models.Moreover, precisely why there is apparently a relatively strong link betweenviolent crime in Central Reporting District and violent crime in the NewtonReporting District for all three regressions should be explored, but is not asalient matter here. The intent is to concentrate on forecasting.

Table 3.1 also reports the test results for the usual null hypothesis thatthe parameter value for any given regression coefficient is zero. There areseveral different methods to arrive at reasonable standard errors for quantileregression coefficients, but no consensus yet on which is best (Koenker, 1994:section 3.10). For these data, the residuals are effectively independent withconstant variance, given the past values of the regressions, so that most of theavailable methods should work in principle. And in fact, all tests conditionalon the past values of the regressors lead to effectively the same results.12 It isimportant keep in mind, however, that the reported test results can dependheavily on unbiased the regression coefficient estimates. We make no suchclaims here, and will return to the issues shortly.

Many of the regression coefficients are much the same for the three re-gressions implying roughly parallel sets of fitted values.13 Does the quantilechosen then make a difference in the one-week-ahead fitted values for theCentral Police Division? One could reasonably expect to see meaningfuldifferences in at least the fitted values because the intercept estimates inTable 3.1 vary from a little over 15 crimes to a little less than 20 crimes.

Figure 4 shows the Central Reporting District violent crime time seriesby week with the three sets of fitted values overlaid. Each set of fitted valueshas a correlation of nearly .70 with its observed crime counts. The fit qualityimplies that there is some important structure in the violent crime time seriesfor the Central Reporting District. The violent crime patterns over weeks arenot just white noise. This is very important because otherwise one can do nobetter than forecasting an unconditional quantile (e.g., the overall median).

The fitted values from the Q1 regression are shown in green. The fitted

12There are a number of very interesting bootstrap options that were also examined(Kocherginsky et at., 2005; He and Hu, 2002; Parsen et al., 1994; Bose and Chatterjee,2003). Again, conditioning on “history,” the results were pretty much the same. Oneconditions on history because of the forecasting application. Given what is known now,what do we anticipate a week from now?

13In general, this need not be the case. Regression coefficients can vary substantiallydepending on the quantile used.

12

0 100 200 300 400

4060

80100

120

Quantile Regression Fitted Values for Q1 (green), Q2 (blue), and Q3 (red)

Week

Num

ber o

f Vio

lent

Crim

es

Figure 4: Time Series of Violent Crime with Three Set of Fitted ValuesOverlaid

values from the Q2 regression are shown in blue. They assume equal costsand represent forecasting business as usual. The fitted values from the Q3 re-gression are shown in red. One can easily see how the fitted values are shiftedupward as one moves from Q1 to Q2 to Q3. This is consistent, respectively,with overestimates being 3 times more costly than underestimates for Q1,overestimates being equally costly as underestimates for Q2, and overesti-mates being 1/3 as costly as underestimates for Q3. The quantile used reallymatters for the fitted values. Therefore, they should substantially matter forforecasts.14

A week’s worth of new data was appended to the existing data set, andone forecast for each quantile regression was made using the existing model;the model’s parameters were not re-estimated with the new data included.For this demonstration, the one-week-ahead violent crime count is known.

14The fitted values are not forecasts even though they are one-week-ahead conditionalquantiles. The one-week-ahead observed values of the response variable were used toconstruct them. This is important because fitted values will generally overstate how wella model forecasts.

13

In practice, it would not be.15

Table 2 includes for each of the three quartiles the forecasted crime countsand estimated upper and lower forecasting bounds. The estimates increase,from 39.9 to 46.9 to 51.5 from Q1 to Q2 to Q3. For this illustration, the one-week-ahead crime count is known to be 34.0. Although the estimate using Q1

is the closest to 34.0, the estimate chosen is determined by the relative costsof overestimates to underestimates one imposes. The conventional defaultvalue of 46.9 assumes equal costs. If overestimates are 3 times more costlythan underestimates, 39.9 is used as the forecast. If overestimates are 1/3rdas costly as underestimates, 51.5 is used as the forecast. The differencebetween equal costs business as usual and the two other forecasts is roughly10%. These may be modest differences, but the costs differentials are modesttoo. More dramatic cost differences will lead to more dramatic differences inforecasts. We will see later a case in which a 20 to 1 cost ratio, rather than1 to 1, was properly used in a real application.

The lower and upper bounds were constructed bootstrapping the quantileregression model. They are used to compute error bars.16 One can assumethat 95% of the bootstrap forecasted values fell between the lower and up-per bounds. However, the bounds do not represent true confidence intervalsbecause the values forecasted almost certainly contain some bias. Rather,the upper and low bounds indicate the likely range of forecasted values inindependent realizations of the data, assuming that the social processes re-sponsible for violent crime in the Central Reporting District do not changesystematically between the time the forecast is made, and the time the crimecount for the following week materializes.

The quantile approach to asymmetric loss functions can be applied withseveral other statistical procedures sometimes used in forecasting. One pow-

15It is also possible to update the model over time (Swanson and White, 1997). As eachforecasted observation materializes, that new observation is added to the data set and themodel is rebuilt before a new forecast is made. This is an appealing idea in principle, butit raises some tricky issues for statistical inference and the proper discounting of p-valuesfor multiple and dependent statistical tests. Moreover, updating the forecasting modelafter each new observation becomes available may not always be practical.

16Standard errors for quantile regression forecasts can be computed in several differ-ent ways. Bootstrap methods require the fewest assumptions but will often lead to thewidest error bars. Consistent with how the statistical tests for the quantile regressioncoefficients were computed, we treated earlier values of all regressors as fixed because theforecast is made conditional on “history.” Several alternative bootstrap approaches basedon somewhat different formulations produced error bars that were quite similar.

14

Q1 Model Q2 Model Q3 ModelLower Bound 37.9 44.9 49.8

Estimate 39.9 46.9 51.5Upper Bound 43.0 48.7 53.3

Table 2: Forecasted Values and Error Bands When the Observed WeeklyValue is 34 Crimes

erful example adopts a quantile approach within an analogue of the general-ized additive model (Hastie and Tibshirani, 1990). The righthand side can bea linear combination of parametric and/or nonparametric quantile functionslinking each regressor to the response variable (Koenker, Ng, and Portnoy,1994).

Figure 5 is an illustration. The data are from the 7th Police District inWashington, D.C.. The observed values, shown with blue circles, are thenumber of Part I crimes per week. The fitted values are shown in black.They are constructed from three predictors: lagged values of the response,a regressor for non-seasonal trends, and a regression for season cycles. Allare implemented in a non-parametric fashion as cubic regression splines froma B-spline basis (Koenker, 2009). The forecast and the bootstrapped 95%forecasting error bands are shown in red. A seasonal pattern over the fulltime series is apparent with a linear upward trend starting early in 2005. Inthis instance, the third quantile is fitted.

Another good approach, especially when there are a very large numberof predictors and highly nonlinear functional forms are anticipated, is ma-chine learning. One option is to use quantiles in stochastic gradient boosting(Kriegler, 2007). Another option is to use quantile random forests (Mein-shausen, 2006). Random forests is implemented as usual for quantitativeresponse variables. But in the final step when fitted values are constructed,the conditional quantile is used rather than the conditional mean. Quantilestochastic gradient boosting may well turn out to be the best choice (Krieglerand Berk, 2009), but a discussion is beyond the scope of this paper.17

To date, there seems to be no principled way to build in asymmetric costsfor other than linear loss functions. So, there does not seem to be an obviousanalogue to quantile regression for squared loss. There is some promising

17All three procedures — quantile nonparametric regression, quantile stochastic gradientboosting, and quantile random forests — are available in R and work well.

15

0 52 104 156 208 260 312 364 416

6080

100

120

Multiple B-Spline Quantile Autoregression with Weekly Trends for the 7th District (Tau=.75)

Week and Year 2001 2002 2003 2004 2005 2006 2007 2008

Num

ber o

f Crim

es P

er W

eek

*--

Figure 5: Time Series of Violent Crime Forecast from Nonparametric Quan-tile Regression

work, however, within machine learning traditions (Kriegler, 2007).

4 Logistic Regression and Beyond

When a response variable is categorical, the approach can be quite different.For ease of exposition, we focus on the binary case. For a binary responsevariable Y , we begin with conventional logistic regression in which

ln(

P

1− P

)= Xβ, (8)

and the loss function is the deviance as defined in equation 4. The goal is toaccurately forecast the class into which a case will fall.

For logistic regression, there is an easy way to take the costs of forecastingerrors into account. Suppose for logistic regression is applied to a data setthat includes the values of key predictors and the binary outcome of interest.Estimates of the regression coefficients are obtained. Consider now a newobservation for which only the values of the predictors are known. Theintent to forecast into which of the two outcome classes the case will fall.

16

For example, one might be interested in forecasting failure on parole frominformation available shortly before release from prison.

Conventionally, the forecast into one class or the other (e.g., fail or not)depends on which side of the 50-50 midpoint a case falls. If for a givencase, the model’s fitted probability is equal to or greater than .50, the case isassigned to one category. If for that case, the fitted probability is less than.50, the case is assigned to the other category. This is analogous to usingthe median (i.e., Q2) in quantile regression. Misclassification into either classhas the same costs.

The analogy to quantile regression carries over more broadly. Suppose inthe parole failure analysis, an arrest coded as “1.” Suppose that no arrestcoded as “0.” Then, if one uses a threshold of .75 instead of .50, one hasassumed that the cost of falsely forecasting an arrest is 3 times the cost offalsely forecasting no arrest. Likewise, if one uses a threshold of .25, one hasassumed that the cost of falsely forecasting an arrest is 1/3rd the cost offalsely forecasting no arrest.

But unlike quantile regression, the introduction of asymmetric costs isundertaken in the very last analysis step. Only the forecasted class is affected.The regression coefficients and all other output do not change. If there are noplans to use the regression output for other than the forecasts, no importantharm may be done. Usually, however, there is at least some reliance onthe regression coefficients to help justify the forecasts made. For example,stakeholders might be skeptical about a forecast if an offender’s prior recordwere not strongly related to the chances of failure on parole, other thingsequal.

Probably the earliest multivariate classification procedure that could buildin asymmetric costs at the front end is classification and regression trees, alsoknown as CART (Bieman et.al., 1984). CART is now familiar to many re-searchers. What may not be so widely known is that there are two wayswithin CART to properly introduce asymmetric costs. One way is to ex-plicitly specify different costs for false positives and false negatives. Anotherway is to alter the prior probability of the response.18 Either method leadsto the same outcome because, somewhat surprisingly, the two methods are inCART mathematically identical. Both serve to weight the fitting function.How this works is discussed in considerable detail elsewhere (Berk, 2008:

18The prior probability of the response in CART is in practice usually the marginaldistribution of the response in the data being analyzed.

17

Section 3.5.2).CART is not much used. Among its problems is instability. Small dif-

ferences in the data can lead to trees with very different features. However,CART has become a building block for several very effective machine learningprocedures. Random Forests is an instructive example.

The random forests algorithm takes the following form, substantiallyquoted from a recent paper by Berk et al (2009: 209)

1. From a data set with N observations, a probability sample of size n isdrawn with replacement.

2. Observations not selected are retained as the out-of-bag (OOB) datathat can serve as test data for that tree. Usually, about a third of theobservations will be OOB.

3. A random sample of predictors is drawn from the full set that is avail-able. For technical reasons that are beyond the scope of this paper(Breiman, 2001), the sample is typically small (e.g., 3 predictors). Thenumber of sampled predictors is a tuning parameter, but the usualdefault value for determining the predictor sample size seems to workquite well in practice.

4. Classification and Regression Trees (CART) is applied (Breiman et al.,1984) to obtain the first partition of the data.

5. Steps 3 and 4 are repeated for all subsequent partitions until furtherpartitions do not improve the model’s fit. The result for a categoricaloutcome is a classification tree.

6. Each terminal node is assigned a class in the usual CART manner.In the simplest case, the class assigned is the class with the greatestnumber of observations.

7. The OOB data are dropped down the tree, and each observation is as-signed the class associated with the terminal node in which that obser-vation falls. The result is the predicted class for each OOB observationfor a given tree.

8. Steps 1 through 7 are repeated a large number of times to produce alarge number of classification trees. The number of trees is a tuning

18

parameter, but the results are usually not very sensitive for any numberof trees of about 500 or more.

9. For each observation, classification is by majority vote over all trees forwhich that observation was OOB.

The averaging over trees introduces stability that CART lacks. Samplingpredictors facilitates construction of nonparametric functional forms that canbe highly nonlinear and responsive to subtle features of the data. The use ofOOB data means that random forests does not overfit no matter how manytrees are grown. No classifiers to date consistently classify and forecast moreaccurately than random forests (Breiman, 2001; Berk, 2008).

Because random forests builds on CART, an asymmetric loss functioncan be implemented in a similar manner. In the first random forests step, astratified sampling procedure, can be used. Each outcome class becomes astratum. One can then sample disproportionately so that, in effect, the priordistribution of the response is altered. When a given class is over-sampledrelative to the prior distribution of the response, the prior is shifted towardthat class. Misclassifications for this class are then given more weight (SeeBerk, 2008: Section 5.5).19

To see how this can play out, consider a recent study in which the goalwas to forecast parole or probation failures. A failure was defined as com-mitting a homicide or attempted homicide or being the victim of a homicideor an attempted homicide. In the population of offenders, about 2 in 100individuals failed by this definition within 18 months of intake. After muchdiscussion, the stakeholders agreed on a 20 to 1 ratio for the costs of falsenegatives to false positives. Failing to identify a prospective perpetratoror victim was taken to be 20 times more costly than falsely identifying aprospective perpetrator or victim.

One can see in Table 3 that just as intended, there are about 20 false posi-tives for every false negative (987/45). One false negative has about the samecost as 20 false positives. The random forests procedure correctly forecastsfailure on probation or parole with about 77% accuracy and correctly fore-casts no failure with 93% accuracy. The 77% accuracy is especially notable

19If one uses the procedure randomForests in R (written by Leo Breiman and AdeleCutler and ported to R by Andy Liaw), there are several other options for taking costsinto account. Experience to date suggests that the stratified sampling approach should bepreferred.

19

Forecast: No Forecast: Yes Forecast ErrorActual: No 9972 987 .09Actual: Yes 45 153 .23

Table 3: Confusion Table for Forecasts of Shooters and Victims using a 20to 1 Cost Ratio (Yes = shooter or victim, No = not a shooter or victim)

Forecast: No Forecast: Yes Forecast ErrorActual: No 14208 2 .0001Actual: Yes 151 101 .60

Table 4: Confusion Table for Forecasts of Shooters and Victims Using theCost Ratio Implied by the Prior (Yes = shooter or victim, No = not a shooteror victim)

because in the overall population of probationers and parolees, only about2% fail. Put another way, for the subset of offenders identified by the randomforests procedure, approximately 77 out of each 100 offenders will actuallyfail compared approximately 2 out each 100 for the general population ofindividuals under supervision.20

The default is to draw a unstratified random sample of observations (withreplacement). Whatever the relative costs implied by the prior distribution ofthe response, these are the relative costs imposed. What are the consequenceshere?

Table 4 contains random forests output for the same data used for Table 3,but with the default of equal a priori costs. Because the prior distribution ofthe response is so unbalanced and no cost adjustments are made, there arenow 151 false negatives for 2 false positives. The realized costs are dramati-cally upside down. Instead of a 20 to 1 cost ratio of false negatives to falsepositives, the cost ratio is now about 1 to 75. Most stakeholders would findthis reversal of the cost ratio completely inconsistent with how the relativecosts of false negatives and false positives are actually viewed.

The forecasting consequences that can result from accepting the prior

20It is important to emphasize that because of the OOB data, confusion tables show realforecasting accuracy, not goodness of fit. Each tree is grown from one random sample of theobservations with forecasts made for non-overlapping random sample of the observations.

20

distribution of the response when it is inappropriate are dramatically rep-resented in the table. Random forests is able to forecast with nearly 100%accuracy the offenders did not fail. But one can forecast the absence of fail-ure with about 98% accuracy using no predictors whatsoever given the 98%to 2% split in the response. The gain in accuracy is unimpressive. And al-though forecasting error drops from 9% to virtually 0%, that is not the groupthat policy makers are most worried about.

Random forests was correct over 75% of the time forecasting failures underrelative 20 to 1 costs, but with no cost adjustments, random forests is only40% accurate. Forecasting skill is cut nearly in half. To appreciate what thismeans, with 20 to 1 relative costs there would be about 25 offenders in every100 whose failure as a perpetrator or victim would not be predicted. Andif not predicted, prevention is made moot. With relative costs determinedsolely by the prior distribution of the response, there would be approximately60 offenders in every 100 whose failure as a perpetrator or victim would notbe predicted. The number of violent crimes that might have been preventedmore than doubles. In short, we see that once again the costs assignedto forecasting errors can dramatically affect the results, and automaticallyaccepting the usual default costs implied by the prior distribution distributionof the response can lead to forecasts that are not responsive to the needs ofstakeholders.

5 Conclusions

Not all forecasting errors have the same consequences. For decision makerswho will be using forecasts, these different consequences can have real bite. Itstands to reason, therefore, that forecasts that will be used should be shapedsensibly and sensitively by the costs of forecasting errors. A key instanceis when the costs are asymmetric. We have considered in this paper howasymmetric costs can be introduced into loss functions to affect the forecastsproduced. Sometimes different cost ratios can lead to very different forecasts.This is a good thing. Accepting the usual default of equal a priori costs inan unexamined manner is not.

In practice, it is important to elicit from stakeholders their views on therelative costs of forecasting errors. If the stakeholders cannot arrive at asingle assessment, it can be instructive to work with several and provide thestakeholders with the several forecasts. They can then decide how to proceed.

21

If there are no stakeholders when the forecasts are being constructed, it canstill make good sense to apply several different cost ratios and produce theforecasts for each. It will often be useful to know how sensitive the forecastsare to a likely range cost ratios.

22

References

Austin, J., Clear, T., Duster, T., Greenberg, D.F., Irwin, J., McCoy, C.,Mobley, A., Owen, B. and J. Page. (2007) Unlocking America: Whyand How to Reduce America’s Prison Population, Washington. D.C.,JFA Institute.

Berk, R.A., He, Y., and S. B. Sorenson (2005) “Developing a Practical Fore-casting Screener for Domestic Violence Incidents.” Evaluation Review29 (4): 358-383.

Berk, R.A. (2008) Statistical Learning from a Regression Perspective. NewYork, Springer.

Berk, R.A., (2009) “The Role of Race in Forecasts of Violent Crime.” Raceand Social Problems 1: 231-242.

Berk, R.A., Messinger, S.L., Rauma, D., and J. E. Berecochea (1983) “Pris-ons as Self-Regulating Systems: A Comparison of Historical Patternsin California for Male and Female Offenders.” Law and Society Review17(4): 547-586.

Berk, R.A., Sherman, L., Barnes, G., Kurtz, E., and L. Ahlman (2009)“Forecasting Murder within a Population of Probationers and Parolees:A High Stakes Application of Statistical Learning.” Journal of theRoyal Statistical Society (Series A) 172, part 1: 191-211.

Blumstein, A., and R. Larson (1969) “Models of a Total Criminal JusticeSystem.” Operations Research 17(2): 199-232.

Borden, H.G. (1928) “Factors Predicting Parole Success.” Journal of theAmerican Institute of Criminal Law and Criminology 19: 328-336.

Bose, A., and S. Chatterjee, (2003) “Generalized Bootstrap for Estimatorsof Minimizers of Convex Functions.” Journal of Statistical Planningand Inference 117: 225-239.

Bratton, W.J., and S.W. Malinowski (2008) “Police Performance Manage-ment in Practice: Taking COMPSTAT to the Next Level.” Policing2(3): 259-265.

23

Breiman, L. (2001) Random forests. Machine Learning, 45, 5-32.

Breiman, L. Friedman, J.H. Olshen, R.A. and Stone, C.J. (1984) Classifi-cation and Regression Trees. Monterey, Ca, Wadsworth Press.

Britt, C.L. (2009) “ Modeling the Distribution of Sentence Length Deci-sions Under a Guideline System: An Application of Quantile Regres-sion Models.” Journal of Quantitative Criminology, 4, Online.

Brown, L. Gans, N., Mandelbaum, A., Sakov, A., Shen, H., Zeltyn, S.,and L. Zhao. (2005) “Statistical Analysis of a Telephone Call Center:A Queueing-Science Perspective.” Journal of the American StatisticalAssociation 100(449) 36-50.

Burgess, E.W. (1928) “Factors Determining Success or Failure on Parole,”in A.A. Bruce, A.J. Harno, E.W. Burgess, and J. Landesco (eds.) TheWorking of the Indeterminante Sentence Law and the Parole Systemin Illinois. Springfield, Illinois, State Board of Parole: 205-249.

Cohen, J. and W.L. Gorr (2006) Development of Crime Forecasting andMapping systems for the Use by Police. Final Report, U.S. Departmentof Justice (Grant # 2001-IJ-CX-0018).

Cohen L.W. and K.C. Land (1987) “Age Structure and Crime: SymmetryVersus Asymmetry and the Projection of Crime Rates Through the1990s”. American Sociological Review 52 :170-183.

Eckberg, D.L (1995) “Estimates of Early Twentieth-Century U.S. HomicideRates: An Econometric Forecasting Approach.” Demography 32: 1-16.

Gilchrist, Warren (2008) “Regression Revisited.” International StatisticalReview 76(3): 401-418.

Glueck, S., and E. Glueck (1950) Unraveling Juvenile Delinquency. Cam-bridge, Mass.: Harvard University Press.

Goodman L.A. (1952) “Generalizing the Problem of Prediction.” AmericanSociological Review 17: 609-612.

Goodman L.A. (1953a) “The Use and Validity of a Prediction Instrument.I. A Reformulation of the Use of a Prediction Instrument.” AmericanJournal of Sociology 58: 503-510.

24

Goodman L.A. (1953b) “II. The Validation of Prediction.” American Jour-nal of Sociology 58: 510-512.

Gottfredson, S.D., and L.J. Moriarty (2006) “Statistical Risk Assessment:Old Problems and New Applications.” Crime & Delinquency 52(1):178-200.

Granger, C.W.J (1989) Forecasting in Business and Economics, second edi-tion, New York: Academic Press.

Hastie, T.J. and R.J. Tibshirani (1990) Generalized Additive Models. NewYork: Chapman & Hall.

Hastie, T.J., Tibshirani, R.J., and J. Friedman (2009) Elements of Statisti-cal Learning, second edition, NewYork, Springer.

He, X., and F. Hu (2002). “Markov Chain Marginal Bootstrap.” Journalof the American Statistical Association 97 (459): 783-795.

Kocherginsky, M., He, X., and Y. Mu (2005) “Practical Confidence Intervalsfor Regression Quantiles.” Journal of Computational and GraphicalStatistics 14: 41-55.

Koenker, R.W., and G. Bassett (1978) “Regression Quantiles.” Ecomomet-rica 46: 3-50.

Koenker, R. W. (2005). Quantile Regression. Cambridge, Cambridge Uni-versity Press.

Koenker, R.W. (2009) ‘Quantile regression in R: A Vignette.” cran.r-project.org/web/packages/quantreg/vignettes/rq.pdf.

Koenker, R., Ng, P, and S. Portnoy, (1994) “Quantile Smoothing Splines”.Biometrika 81: 673680.

Koenker, R., and Q. Zhao (1996) “Conditional Quantile Estimation andInference for ARCH Models.” Econometric Theory 12: 793-813.

Kriegler, B. (2007). Cost-Sensitive Stochastic Gradient Boosting Within aQuantitative Regression Framework. PhD dissertation, UCLA Depart-ment of Statistics.

25

Kriegler, B. and R.A. Berk (2009) “Estimating the Homeless Populationin Los Angeles: An Application of Cost-Sensitive Stochastic Gradi-ent Boosting,” working paper, Department of Statistics, UCLA, underreview.

McCullagh, P. and J.A. Nelder (1989) Generalized Linear Models. NewYork: Chapman & Hall.

Meinshausen, N. (2006) “Quantile Regression Forests.” Journal of MachineLearning 7: 983-999.

Ohlin L.E. and O.D. Duncan (1949) “The Efficiency of Prediction in Crim-inology.” American Journal of Sociology 54: 441-452.

Parzen, M.I (1979) “Nonparametric Statistical Data Modeling.” Journal ofthe American Statistical Association 74: 105-1‘21.

Parzen, M. I., L. Wei, and Z. Ying (1994) “A Resampling Method Basedon Pivotal Estimating Functions” Biometrika 81: 341350.

Oberhofer, W. (1982) “The Consistency of Nonlinear Regression Minimizingthe L1-Norm.” The Annals of Statistics 10(1): 316-319.

Swanson, N.R., and H. White (1997) “Forecasting Economic Time SeriesUsing Flexible Versus Fixed Specification and Linear Versus NonlinearEconometric Models.” International Journal of Forecasting 13(4): 439-461.

Tay, A.S. and K.F. Wallis (2000) “Density Forecasting: A Survey.” Journalof Forecasting 19: 235-254

Tibshirani, R.J. (1996) “Regression Shrinkage and Selection Via the LASSO.”Journal of the Royal Statistical Society, Series B, 25: 267–288.

26

Asymmetric Loss Functions for Forecasting in Criminal ...berkr/loss copy.pdfAsymmetric Loss...

Documents

Transcript of Asymmetric Loss Functions for Forecasting in Criminal ...berkr/loss copy.pdfAsymmetric Loss...