Predspot: Predicting Crime Hotspots with Machine Learning€¦ · Araújo jr, Adelson. Predspot:...

Federal University of Rio Grande do NorteCenter of Exact and Earth Sciences

Systems and Computing - PPgSC/UFRN

Predspot: Predicting Crime Hotspotswith Machine Learning

Adelson Araújo jr

Natal-RN

September 2019

Adelson Araújo jr


Master dissertation presented to the Pro-gram of Postgraduate Studies in Systemsand Computing (PPgSC) of the FederalUniversity of Rio Grande do Norte as arequirement to the M.Sc. degree.

Supervisor

Prof. Dr. Nélio Alessandro Azevedo Cacho

Federal University of Rio Grande do Norte – UFRN

Natal-RN

September 2019

Araújo jr, Adelson. Predspot: predicting crime hotspots with machine learning /Adelson Dias de Araújo Júnior. - 2019. 80f.: il.

Dissertação (Mestrado) - Universidade Federal do Rio Grandedo Norte, Centro de Ciências Exatas e da Terra, Programa de Pós-Graduação em Sistemas e Computação. Natal, 2019. Orientador: Nélio Alessandro Azevedo Cacho. Coorientador: Leonardo Bezerra.

1. Computação - Dissertação. 2. Policiamento preditivo -Dissertação. 3. Manchas criminais - Dissertação. 4. Aprendizadode máquina - Dissertação. I. Cacho, Nélio Alessandro Azevedo.II. Bezerra, Leonardo. III. Título.

RN/UF/CCET CDU 004

Universidade Federal do Rio Grande do Norte - UFRNSistema de Bibliotecas - SISBI

Catalogação de Publicação na Fonte. UFRN - Biblioteca Setorial Prof. Ronaldo Xavier de Arruda - CCET

Elaborado por Joseneide Ferreira Dantas - CRB-15/324

Dedicated to all potiguar people.

Acknowledgment

I thank God.

I thank the support of my extraordinary and close family, Flávia Bezerril, Adelson

Araújo, Fernando Bezerril, Graça Bezerril, Paula Sette, Fernando Bezerril Neto, Matheus

Coutinho, Fernanda Bezerril, Marina Bezerril, Maria Fernanda Bezerril, Mariana Bezerril

and Alexandre Serafim.

I thank my closest academic educators, Nélio Cacho, Leonardo Bezerra, Renzo Tor-

recuso and many others. Also those that financially contributed with my research, the

SmartMetropolis project and Google Latin America.

I thank my closest friends Giovani Tasso, Pedro Araújo, José Lucas Ribeiro, Mickael

Figueredo, João Marcos do Valle, Beatriz Vieira, Júlio Freire, Ottony Chamberlaine and

Lúcio Soares.

It is sometimes difficult to avoid the impression that

there is a sort of foreknowledge of the coming series of events.

Carl Jung


Author: Adelson Dias de Araújo Júnior

Supervisor: Prof. Dr. Nélio Alessandro Azevedo Cacho

Abstract

Smart cities are increasingly adopting data infrastructure and analysis to improve the

decision-making process for public safety issues. Although traditional hotspot policing

methods have shown benefits in reducing crime, previous studies suggest that the adop-

tion of predictive techniques can produce more accurate estimates for future crime con-

centration. In previous work, we proposed a framework to generate near-future hotspots

using spatiotemporal features. In this work, we redesign the framework to support (i) the

widely used crime mapping method kernel density estimation (KDE); (ii) geographic fea-

ture extraction with data from OpenStreetMap; (iii) feature selection, and; (iv) gradient

boosting regression. Furthermore, we have provided an open-source implementation of

the framework to support efficient hotspot prediction for police departments that can-

not afford proprietary solutions. To evaluate the framework, we consider data from two

cities, namely Natal (Brazil) and Boston (US), comprising twelve crime scenarios. We take

as baseline the common police prediction methodology also employed in Natal. Results

indicate that our predictive approach estimates hotspots 1.6-3.1 times better than the

baseline, depending on the crime mapping method and machine learning algorithm used.

From a feature importance analysis, we found that features from trend and seasonality

were the most essential components to achieve better predictions.

Keywords : predictive policing, hotspot prediction, machine learning, crime forecasting.

Predspot: Predizendo Hotspots Criminaiscom Aprendizado de Máquina

Autor: Adelson Dias de Araújo Júnior

Orientador: Prof. Dr. Nélio Alessandro Azevedo Cacho

Resumo

As cidades inteligentes estão adotando cada vez mais infraestrutura e análise de da-

dos para melhorar o processo de tomada de decisões em questões de segurança pública.

Embora os métodos tradicionais de policiamento de hotspot tenham se mostrado eficazes

na redução do crime, estudos anteriores sugerem que a adoção de técnicas preditivas

pode produzir estimativas mais precisas para a concentração espacial de crimes de um

futuro próximo. Em nossas pesquisas anteriores, propusemos uma metodologia para gerar

hotspots do futuro usando variáveis espaço-temporais. Neste trabalho, redesenhamos a es-

trutura do framework para suportar (i) o método de mapeamento de crimes amplamente

utilizado - estimativa de densidade de kernel (KDE); (ii) extração de características ge-

ográficas com dados do OpenStreetMap; (iii) seleção de atributos e; (iv) regressão com

o algoritmo Gradient Boosting. Além disso, fornecemos uma implementação de código

aberto da estrutura para suportar a predição eficiente de hotspots. Para avaliar nossa

abordagem, consideramos dados de duas cidades, Natal (Brasil) e Boston (EUA), com-

preendendo doze divisões de tipo de crime. Tomamos como método base de comparação

uma metodologia comumente utilizada e também empregada em Natal. Os resultados

indicam que nossa abordagem preditiva estima hotspots em média 1,6 a 3,1 vezes mel-

hor que a abordagem tradicional, dependendo do método de mapeamento do crime e do

algoritmo de aprendizado de máquina usado. A partir de uma análise de importância de

atributos, descobrimos que tendência e sazonalidade eram os componentes mais essenciais

para obter melhores previsões.

Palavras-chave: policiamento preditivo, previsão de hotspot, aprendizado de máquina,

previsão de crimes.

List of Figures

1 An illustration of KGrid. The events are clustered by using the K-Means

algorithm. Each cluster has external edges, which form a convex polygon.

These polygons are the topological separation of the city into subregions

or cells of a grid. By aggregating the count values in each cell, one can

map hotspots. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . p. 22

2 An illustration of Kernel Density Estimation and its parameters. Each

bold point (grid cell) represents an arbitrary place in which a kernel

function applies a density estimation around a bandwidth. For a set of

events/points, this procedure returns an array of KDE values indexed by

the cells identifier. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . p. 23

3 KDE results for different bandwidth values. The clear difference in res-

olution is observed between 0.1 and 0.01 miles (bottom). We observe

underfitting in the top left and underfitting in the bottom right situations. p. 24

4 Seasonal and trend components of a time series. In blue, the original

series show a varying behaviour which can be further explained by a

trend (in black) and a seasonality (in orange). The trend follows the

moving average of the series and the seasonality represents the cyclical

aspect of the original series in monthly oscillation. . . . . . . . . . . . . p. 25

5 Results of the survey with 54 police sergeants about the rating of different

landmarks and demographic aspects for determining hotspots. At the

top, the rating for property (CVP), in the middle for lethal or violent

(CVLI) and for drug-related crimes (TRED) at the bottom. . . . . . . p. 27

6 Geographic feature layers from Natal generated using KDE of residential

streets (left) and schools (right). Note that residential streets are denser

and concentrated in the north of the city, but still widespread in other

places. Schools are more concentrated in the center of the city, following

to the south, but with some concentration in the north. . . . . . . . . . p. 29

7 Model selection begins with loading and preparing datasets. Required

data sources include a crime database for model training and connection

to the OpenStreetMap API to load PoI data. Also, the city’s shapefile is

important for filtering data entered within its borders. . . . . . . . . . p. 35

8 Between data loading and model building, the feature ingest process is

responsible for assembling the independent variables and the variable to

be predicted. This starts with the spatiotemporal aggregations of crimes,

through crime mapping method and time series manipulation, and PoI

data spatial aggregation. The grid is a supporting element in this step

and will be used as the places where criminal incidence will be predicted.

Time series Cij extracted from grid cells indicate a set of values from a

grid cell i in period j, and PoI features Gki represent a the density of a k

PoI category located in a grid cell i. . . . . . . . . . . . . . . . . . . . . p. 37

9 The extraction of the independent variables (features) of the time series

in Predspot is conducted through a trend and seasonality decomposition

(through STL) and series differentiation. Each decomposition will gener-

ate a new series, and the variables will be composed of k lags from each

of these series. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . p. 38

10 An artificial example of temporal features for two places and three time

intervals. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . p. 38

11 In the machine learning modeling step, the best qualified features are

selected and feed into the algorithms. By adjusting various algorithms,

we can evaluate the models to use the one that has the best predictive

performance. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . p. 39

12 In the prediction pipeline, one must tailor the model selection steps to

return predictions using previously trained models. This means no longer

loading the entire dataset, just the crimes from the last period. Then,

extracting temporal features, filtering previously selected features, and

requesting the trained model for predictions. . . . . . . . . . . . . . . . p. 41

13 The web service comprises managing the prediction pipeline to attend

online requests. This requires the use of a file system, which we call

Volume, for caching the prediction results and an ETL process controller

to trigger prediction generation for each new period. . . . . . . . . . . . p. 43

14 The sample sizes of the twelve crime scenarios. Natal has more hetero-

geneous crime scenarios compared to Boston. . . . . . . . . . . . . . . . p. 51

15 Monthly sampled time series of the twelve crime scenarios. . . . . . . . p. 52

16 Time series decomposition for Residential-Burglary (in Boston) daytime

and nighttime scenarios. The second row is composed of trend series,

which are a smoothed version of the original. Third row are the seasonal

patterns, clearly distinct from day and night. Fourth row represent the

differentiated component. . . . . . . . . . . . . . . . . . . . . . . . . . . p. 53

17 Spatial representations of the two crime mapping methods for Residential-

Burglary (in Boston) daytime and nighttime crime scenarios. In the first

row, KGrid aggregation, and in the second row KDE. . . . . . . . . . . p. 53

18 PoI data from Natal (left) and Boston (right) extracted from Open-

StreetMap. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . p. 54

19 A representation of the geographic features taken from three PoI cate-

gories, hospitals (first row), residential streets (second row) and touristic

places (third row) of Natal (left) and Boston (right). . . . . . . . . . . p. 55

20 Cross-validation MSE results for KGrid-based models for Natal (in red)

and Boston (in blue) crime scenarios. . . . . . . . . . . . . . . . . . . . p. 59

21 Cross-validation MSE results for KDE-based models for Natal (in red)

and Boston (in blue) crime scenarios. . . . . . . . . . . . . . . . . . . . p. 60

22 PRRMSE results of five trials evaluating trained models for each crime

scenario. In the left side, the two crime mapping methods are compared

and in the right, machine learning algorithms. Note that KDE outper-

forms KGrid in all crime scenarios, but more sharply in crime scenarios

that have fewer data points, such as CVLI. On the other side, one can

note that GB models have higher percentiles, but with much more variance. p. 61

23 Feature importance of KGrid-RF models. . . . . . . . . . . . . . . . . . p. 66

24 Feature importance of KGrid-GB models. . . . . . . . . . . . . . . . . . p. 67

25 Feature importance of KDE-RF models. . . . . . . . . . . . . . . . . . p. 68

26 Feature importance of KDE-GB models. . . . . . . . . . . . . . . . . . p. 69

List of Tables

1 KDE parameters and machine learning hyperparameters considered in

the grid search tuning. . . . . . . . . . . . . . . . . . . . . . . . . . . . p. 56

2 Selected KDE parameters for the crime mapping methods applied in each

crime scenario. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . p. 57

3 Selected KDE parameters for geographic feature extraction applied for

PoI data aggregation. . . . . . . . . . . . . . . . . . . . . . . . . . . . . p. 58

4 Average and standard deviation of PRRMSE for each predictive approach,

considering five trials of the twelve crime scenarios. . . . . . . . . . . . p. 62

5 A pairwise statistical comparison of the four predictive approaches, con-

sidering the results from post-hoc analysis. . . . . . . . . . . . . . . . . p. 63

6 Wall time spent on model selection phase for each dataset. . . . . . . . p. 63

7 Selected hyperparameters of the machine learning models for the Natal

crime scenarios. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . p. 79

8 Selected hyperparameters of the machine learning models for theBoston

crime scenarios. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . p. 80

Contents

1 Introduction p. 13

1.1 Problem Statement . . . . . . . . . . . . . . . . . . . . . . . . . . . . . p. 14

1.2 Objectives . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . p. 15

1.3 Contributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . p. 16

2 Background p. 18

2.1 Predictive Hotspot Policing . . . . . . . . . . . . . . . . . . . . . . . . p. 18

2.2 Spatiotemporal Modelling . . . . . . . . . . . . . . . . . . . . . . . . . p. 21

2.2.1 Crime Mapping Methods . . . . . . . . . . . . . . . . . . . . . . p. 21

KGrid . . . . . . . . . . . . . . . . . . . . . . . . . . . . p. 21

Kernel Density Estimation (KDE) . . . . . . . . . . . . . p. 22

2.2.2 Time Series Decompositions . . . . . . . . . . . . . . . . . . . . p. 23

2.2.3 Geographic Features . . . . . . . . . . . . . . . . . . . . . . . . p. 26

2.3 Machine Learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . p. 28

2.3.1 Prediction Task . . . . . . . . . . . . . . . . . . . . . . . . . . . p. 30

2.3.2 Feature Selection . . . . . . . . . . . . . . . . . . . . . . . . . . p. 31

3 The Predspot Framework p. 33

3.1 Model Selection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . p. 34

3.1.1 Dataset Preparation . . . . . . . . . . . . . . . . . . . . . . . . p. 34

3.1.2 Feature Ingest . . . . . . . . . . . . . . . . . . . . . . . . . . . . p. 36

3.1.3 Machine Learning Modelling . . . . . . . . . . . . . . . . . . . . p. 39

3.2 Prediction Service . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . p. 40

3.2.1 Prediction Pipeline . . . . . . . . . . . . . . . . . . . . . . . . . p. 41

3.2.2 Web Service . . . . . . . . . . . . . . . . . . . . . . . . . . . . . p. 42

3.3 Implementation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . p. 43

3.3.1 Predspot python-package . . . . . . . . . . . . . . . . . . . . . . p. 43

3.3.2 Predspot service . . . . . . . . . . . . . . . . . . . . . . . . . . . p. 44

4 Evaluation p. 46

4.1 Datasets . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . p. 47

4.2 Experiment Methods and Metrics . . . . . . . . . . . . . . . . . . . . . p. 47

4.3 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . p. 50

4.3.1 Exploratory Data Analysis . . . . . . . . . . . . . . . . . . . . . p. 50

4.3.2 Parameter Selection . . . . . . . . . . . . . . . . . . . . . . . . . p. 56

4.3.3 Model Performance . . . . . . . . . . . . . . . . . . . . . . . . . p. 58

4.3.4 Feature Importance Analysis . . . . . . . . . . . . . . . . . . . . p. 64

5 Concluding Remarks p. 70

5.1 Future work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . p. 72

References p. 74

Appendix A -- Selected Parameters of Machine Learning Models p. 79

13

1 Introduction

Cities are the habitat of the increasing majority of human people, and the management

of their resources has become a complex task. With the growing impact of Information and

Communication Technologies (ICT), coupled with the fact that information has become as

valuable as energy (BATTY, 2013), city governance is undergoing a technological paradigm

shift to the so-called smart cities. This demand, justified by the current belief that ICT can

be a great facilitator of city management (ANGELIDOU, 2014; LEE et al., 2008), expresses

the need for the constitution of more sustainable city models. We agree with Caragliu, Bo

and Nijkamp (2011) who define a smart city as one where investments in social and human

capital, as well as in traditional and modern infrastructure (ICT), lead to sustainable

economic growth, high quality of life and the management of city’s resources through

participatory governance.

However, improvements in quality of life are unlikely to be effective if they disregard

crime incidence levels. According to a recent report of BBC (2018), among the 50 most

dangerous cities in the world (accounting by homicides per capita), only eight are not in

Latin America. Brazilian cities stand out in this ranking, with 17 cities. The city that hosts

this study, Natal, Brazil, holds the fourth position with slightly more than 100 homicides

per 100,000 inhabitants, which is a rate fifteen times higher than the global average rate

measured by the United Nations Office on Crime and Drugs (UNODC, 2013). Certainly,

these regions claim for changes in the way police resources are allocated.

Using Geographical Information Systems (GIS) to support the patrol vehicle dispatch

has become routine in many law enforcement agencies in the world, in a so-called hotspot

policing manner (SHERMAN; GARTIN; BUERGER, 1989). Based on the empirical evidence

that many crimes concentrate in few places (WEISBURD, 2015), an increasing body of

research has explored the effects of sending patrol units to spots of high crime incidence.

Braga, Papachristos and Hureau (2014) have found strong evidence that it does help

reduce crimes and Weisburd et al. (2006) suggested other benefits aside crime reduction.

Traditional hotspot policing literature focused on the historical aggregation of crime data

14

until the first prediction efforts appeared in America in the late 1990s (GORR; HARRIES,

2003).

Nowadays, smart cities are gradually adopting predictive data analysis to enhance

decision-making in public safety. Related works have reported many examples of such

quantitative techniques to support patrol planning as "predictive policing" applications

(PERRY, 2013; MOSES; CHAN, 2018). Machine learning is one of the techniques that has

gained momentum in such context due to the accuracy of its estimation and the flexibility

to explore patterns from a range of data, such as geographic (LIN; YEN; YU, 2018), de-

mographic (BOGOMOLOV et al., 2014) and social media (GERBER, 2014). However, crime

mapping methods (ECK et al., 2005) and spatiotemporal modeling may be crucial to make

efficient predictive models (ECK et al., 2005).

1.1 Problem Statement

Previous studies have shown that statistical models can make estimates that exceed

in terms of accuracy the human capacity to predict crime incidence over space and time

(MOHLER et al., 2015). Nevertheless, Moses and Chan (2018) suggest a considerable body

of research lacks accountability and transparency, and that these models should not be

implemented without an adequate description of the processing steps involved. Indeed,

we have found no research that at the same time (i) evaluates its efficiency against a

traditional hotspot policing approach implemented by the police and (ii) provides a clear

breakdown of the processing steps involved to implement such a predictive system.

A considerable body of research (LIN; YEN; YU, 2018; MALIK et al., 2014) developed

steps, guidelines and frameworks to implement a crime hotspot prediction model, with a

variety of standards. In a previous work (ARAUJO et al., 2018), we designed a framework to

model machine learning with time series autoregressive features based on a crime mapping

method named KGrid (BORGES et al., 2017; ZIEHR, 2017). Our ambitious purpose was to

create a standard of the processing steps involved in modelling machine learning models

to predict hotspots, but this first effort lacked in some points. First, despite having tested

several machine learning algorithms, there was still room for improvement in the results

found, perhaps because we considered a separate model for each place. Second, we did

not consider the kernel density estimation (KDE), extensively recommended by related

works (ECK et al., 2005; CHAINEY, 2013; HART; ZANDBERGEN, 2014), in our experiments.

We also consider as part of the transparency problem mentioned by Moses and Chan

15

(2018) the lack of open-source tools to leverage efficient crime hotspot prediction. Large

police departments may not feel this much as they have the budget to purchase commercial

solutions that meet their needs. However, small police departments, which often have more

worrying demands for violence, may not be able to provide more efficient tools. If they

want to build a prediction system, it can cost even more than buying one and they can

take much time to build. We argue that an open-source programming interface can ease

the implementation of web service to be deployed in a low budget police department.

1.2 Objectives

The purpose of this work is to improve our previously proposed prediction framework

through alternative crime mapping and feature engineering approaches, and provide an

open-source implementation that police analysts can use to deploy more effective predictive

policing.

Our first specific objective is to improve the efficacy of our previously proposed frame-

work. To do so, we consider alternative crime mapping and prediction algorithms, respec-

tively kernel density estimation (KDE) and gradient boosting regression (GB). We eval-

uate our expanded framework on two datasets, from Natal (Brazil) and Boston (US). We

compare our results with the traditional approach used by criminal analysts to generate

hotspots. Natal’s police department uses a data aggregation in the time window of the pre-

vious month to build their patrol plans, and we noted that this is a common baseline prac-

tice in related works (BROWN; OXFORD, 2001; COHEN; GORR; OLLIGSCHLAEGER, 2007).

By improving our framework, we also intend to describe better the process of producing

a crime hotspot predictive pipeline, from the model training to its online deployment.

We also investigate challenges concerning the tools available and the methodology

involved. The procedures of translating criminal events into attributes that spatially de-

scribe the concentration of crime and adjusting a predictive algorithm may involve many

different tools. Often, proprietary solutions as PredPol and HunchLab are robust, but

they mainly serve cities with higher purchasing power. Also, their tools are focused on

being easy to use and sometimes they can be rigid to configure particular procedures, such

as crime mapping and prediction algorithms. The democratization of such technologies

emerges as a real necessity since many of the poorer and more dangerous cities cannot

afford such solutions. Therefore, our second objective is to design an open-source software

and details its procedures to estimate future hotspots.

16

1.3 Contributions

In this study, we present a set of relevant contributions to predictive policing litera-

ture. First, we present qualitative improvements towards a transparent process necessary

to model machine learning for future hotspot crime estimation. We connect a broad set

of techniques and organize them into purpose-related processing steps for adjusting effi-

cient models and deploy them with a web service. As a second contribution, we provide

an open-source python-package implementation of our Predspot framework, to ensure re-

producibility of our approach and code reuse. The structure of the Predspot framework

enables the use of different crime mapping approaches and machine learning algorithms.

Thus, we argue that future work may use our standard framework to improve model

performance even more.

A third contribution is that we show empirical evidence that the predictive we modeled

estimate better than the traditional baseline implemented in Natal police. Depending on

the crime mapping method and machine learning algorithm chosen, predictive approaches

can average 1.6 to 3.1 times better than baseline. Fourth, we find that KDE is a robust

technique to predict crime incidence when fewer samples are available. In the smallest

samples we analyzed (lethal crimes in Natal), the predictive approaches with KDE were

much better than baseline compared to the cases we had more samples (property crimes

in Natal). Conversely, KGrid approaches have estimated better when more samples are

available.

Finally, a fifth contribution is related to the importance of analyzing the spatiotem-

poral features used. We modelled features based on trend and seasonality components of

time series and also geographic features from OpenStreetMap data. In our feature impor-

tance analysis, we find that trend and seasonality lags were the features that contributed

most in the adjusted models. From our knowledge, no previous work has investigated such

temporal patterns alongside machine learning algorithms for crime hotspot modelling.

17

This study is divided as follows. In Chapter 2, we present the necessary theoretical

framework to follow the proposals made in this work, from the hotspot policing discus-

sions and critique, through the spatiotemporal modeling methods to finally the machine

learning background necessary for this work. In Chapter 3, we introduce our framework

presenting the adaptations we made to the previous version, dividing the purpose-related

steps involved and present the open-source implementation we made. In Chapter 4, we

show the datasets used in our experiments, explaining the evaluation process conducted

and present the results. In Chapter 5, we conclude our study with a resumption of our

objectives, methods and experimental results, and provide further recommendations for

future work.

18

2 Background

A broad and complex spectrum of concepts underlies a predictive policing solution.

In this chapter, we review some theoretical aspects, starting from traditional hotspot

policing literature and the application of predictive algorithms to estimate crime hotspots.

Specifically, Weisburd’s law of crime concentration in places is used as a premise for the

implementation of spatially focused policing strategies, and we explore related discussions

in criminology and hotspot prediction to implement predictive policing.

Proper modeling of crime variables can impact prediction model success and efforts

to translate crime events into independent variables (features) are necessary. This spa-

tiotemporal modeling process, also called feature engineering, starts with a crime mapping

method and can be assisted temporally with time series decomposition. In crime hotspot

prediction literature, it is also common to use ancillary variables to help describe crime

spatial patterns, such as demographic (BOGOMOLOV et al., 2014), geographic (LIN; YEN;

YU, 2018) and from social media data (GERBER, 2014). Particularly for geographic vari-

ables, we present some strategies to use data from points-of-interest (PoI) in the city as

an alternative to help in the predictions.

Last, an increasing body of research (VOMFELL; HÄRDLE; LESSMANN, 2018; PERRY,

2013; BOGOMOLOV et al., 2014) suggest that machine learning algorithms fit very well with

predictive policing. Still, the many facets of machine learning, such as the algorithms and

prediction tasks, make it necessary to discuss its application more thoroughly in the

context of spatiotemporal crime analysis.

2.1 Predictive Hotspot Policing

Criminologists point to different strategies for reducing crime, disorder and fear (WEIS-

BURD; ECK, 2004). Among the methodologies addressed, there is strong evidence of the

effectiveness in patrolling micro-regions of crime concentration (WEISBURD; ECK, 2004;

19

PERRY, 2013; BRAGA, 2001). Perhaps this can be justified by the fact that policing strate-

gies have considered Weisburd’s law of crime concentration in places (WEISBURD, 2015),

which states that "for a defined measure of crime at a specific microgeographic unit, the

concentration of crime will fall within a narrow bandwidth of percentages for a defined

cumulative proportion of crime". For example, experiments suggest that criminal occur-

rences are concentrated in around 10% of places in the cities (ANDRESEN; LINNING, 2012;

ANDRESEN; WEISBURD, 2018), with variations for each crime type and study region. A

2006 US national survey (KOCHEL, 2011) reported that 90% of large policing departments

have considered this crime concentration pattern to draw the so-called hotspot policing

operations (SHERMAN; GARTIN; BUERGER, 1989). Researchers are continually reviewing

experiments on hotspot policing efficiency (BRAGA; PAPACHRISTOS; HUREAU, 2014), sug-

gesting that 20 out of 25 of the experiments have observed benefits on reducing crimes,

social disorder and an overall improvement in the perception of community safety.

Hotspot policing has received substantial interest and criticism, such as the claim

that crime displacement is a straight consequence of the former (REPPETTO, 1976). Con-

versely, according to Weisburd et al. (2006), the inevitable crime displacement idea has

been questioned because displacement is rarely total, and most of the times irrelevant.

Moreover, the author argued that focused patrolling leads to the diffusion of "other bene-

fits not related to crime" and also to the criminal’s discouragement. Another criticism on

hotspot policing is that most hotspots are related to poverty and race issues, hence increas-

ing inequality and even creating an environment of lowered police legitimacy (KOCHEL,

2011). In addition, Rosenbaum (2006) suggested that most police activity in hotspots is

enforcement-oriented and that aggressive strategies can increase negative contact with

citizens, mostly where perceptions of crime tend to be worse (GAU; BRUNSON, 2010). Al-

though reporting some short-term adverse effects, and pointing guidelines on minimizing

the latter, Kochel and Weisburd (2017) presented experimental results showing that there

is no long-term harm to communities’ public opinion when supported by continuous polic-

ing. However, the impossibility of storing every crime in databases is still a problem to

be solved, and victims of such data gathering limitation will often continue to be ignored

by law enforcement (MOSES; CHAN, 2018).

In contrast to the criticism discussed above, police scholars have argued that hotspot

policing is a model for police innovation (WEISBURD; BRAGA, 2006). Further, in the pursuit

of innovation in hotspot policing, predictive algorithms have been used to support preciser

estimators (MOHLER et al., 2015). According to Gorr and Harries (2003), the role of crime

forecasting had stopped being considered infeasible at the beginning of the 2000s, after

20

a major success of crime mapping systems, when the US National Institute of Justice

(NIJ) awarded five grants for studies to extended accuracy of short-term forecasts. The

aim was to estimate precisely spatial crime concentration, as the first step to practical

intervention. After a decade, the term predictive policing was coined and become a trend,

reflecting the role of "the application of analytical techniques to identify promising targets

for police intervention and prevent or solve crimes", according to Perry (2013). Stated in

another perspective, Ratcliffe (2015) suggests that predictive policing involves "the use

of historical data to create a spatiotemporal forecast of areas of criminality or crime hot

spots that will be the basis for police resource allocation decisions with the expectation

that having officers at the proposed places and time will deter or detect criminal activity".

Still, predictive policing may require practical policing planning, in contrast to the role

of spatial crime forecast per se, as discussed by Gorr and Harries (2003).

Indeed, few studies have assessed the effect of predictive hotspots against traditional

crime mapping with GIS to reduce crime incidence (HUNT; SAUNDERS; HOLLYWOOD,

2014), and Moses and Chan (2018) suggested that there are two ways of evaluating a

predictive policing solution. The first is by reporting the "drops in particular categories

of crime in particular jurisdictions employing its software" and the second by measuring

the "predictive accuracy of particular tools". One may argue that the former is more

prone to the predictive policing definition and the second to spatial crime forecasting

analysis. Some studies have evaluated both, e.g. Hunt, Saunders and Hollywood (2014)

have shown a null effect on applying predictive hotspot policing, suggesting an important

role for patrol program implementation failures and low statistical power of the tests

conducted. On the other hand, an experiment in the Los Angeles Police Department

(MOHLER et al., 2015) showed promising results on both incidence decrease in particular

categories of crimes and predictive performance. In their study, models predicted 1.4

to 2.2 times better than trained crime analysts (accuracy evaluation), leading to 7.4%

crime reduction, compared with 3.5% of the treatment effect (impact evaluation). Their

treatment approach, or baseline, was to let criminal analysts manually indicate a risk

value for the delimited regions to their knowledge.

In consultation with the Natal Police Department, we note that they use a strategy

for estimating hotspots using historical data. They apply a spatial aggregation using data

from the previous month to plan the next patrol. Such a methodology assumes that the

immediate past may indicate a good measure of the future, and has already been used in

related studies (BROWN; OXFORD, 2001; COHEN; GORR; OLLIGSCHLAEGER, 2007).

21

2.2 Spatiotemporal Modelling

To deliver accurate predictions, a hotspot prediction analysis must be based on solid

spatiotemporal modeling. Previous studies (ECK et al., 2005; CHAINEY; TOMPSON; UHLIG,

2008) have reported different methods of aggregating crime spatially, using crime mapping

methods. Here, we review two of such methods, namely KGrid and KDE. We then discuss

strategies for representing temporal patterns using time series decomposition methods

and how the results of such transformation can be useful to map crimes in a rich set

of spatiotemporal variables. To complement them, ancillary geographic variables can be

derived from OpenStreetMap data of the city, and we discuss an alternative strategy to

do so.

2.2.1 Crime Mapping Methods

The literature of criminal spatial analysis commonly refers to the procedures of di-

viding the city into subregions and aggregating criminal events as crime mapping. This

aggregation creates the relationship between a subregion and a crime incidence level. In

a prediction task, this is the first step to generate the feature set, composed by crime

incidence levels to each place in each time interval of available data.

Describing several techniques, Eck et al. (2005) discusses their qualitative pros and

cons. Some of these methods divide the city from the distribution of criminal events (e.g.

forming spatial ellipses) and others from geometries regularly spaced (e.g. rectangles or

points). The forms of aggregation of crimes may be divided into (1) counting crimes within

the area bounded by each subregion, and (2) calculating the weighted sum of the crimes

by their distances to the centre of each subregion. In this work, we compare two crime

mapping methods, KGrid and KDE, when translating crime events into features to fit

machine learning models.

KGrid The techniques of dividing crime into spatial ellipses assume that crime groups

spatially into geographic units over a time window. One of the ways to build these groups

of places is by using clustering. Recently, Borges et al. (2017) proposed a division of

the city based on the construction of convex polygons drawn from the spatial grouping of

criminal events. This technique, called KGrid, consists in applying the K-Means algorithm

in the crimes location attribute to define a grid. In Figure 1, we present an illustration of

crime mapping in KGrid. The spatial aggregation of crimes is made by counting events

22

Event

Grid cell

Figure 1: An illustration of KGrid. The events are clustered by using the K-Means algo-rithm. Each cluster has external edges, which form a convex polygon. These polygons arethe topological separation of the city into subregions or cells of a grid. By aggregating thecount values in each cell, one can map hotspots.

within the polygons, also called grid cells. The author also mentioned that this method has

the advantage of considering the topology of criminal incidence to project the regions and

that the resolution can be adjusted to meet different analytic demands. The parameter

K controls this resolution and can cause effects on the performance of the algorithms, as

verified by Araujo et al. (2018).

Kernel Density Estimation (KDE) Among crime mapping methods, there is evi-

dence that KDE is the most appropriate technique for hotspot mapping (CHAINEY; TOMP-

SON; UHLIG, 2008). The practical reasons for that are the visual effect that simulates

meaningful heatmaps, and the inherent spatial correlation considered on data aggrega-

tion (HART; ZANDBERGEN, 2014). Illustrated by Figure 2, it consists of creating a grid

of points regularly spaced, and applying a kernel function to return a density estimation

for each point. Equation 2.1 formally defines it for bidimensional analysis, where h is the

bandwidth parameter of the kernel function K applied, and dx,y(i) is the distance between

all the incident i and the centre of the grid point described by its coordinates x, y. To

analyze it temporally, one can (i) add the third dimension to the formula, or (ii) generate

a density estimation for each time window, forming a time series.

f(x, y | h) = 1

nh2

n∑i=1

K(dx,y(i)

h) (2.1)

23

Kernel function (Gaussian)

Bandwidth Grid cell

Event

Figure 2: An illustration of Kernel Density Estimation and its parameters. Each boldpoint (grid cell) represents an arbitrary place in which a kernel function applies a densityestimation around a bandwidth. For a set of events/points, this procedure returns anarray of KDE values indexed by the cells identifier.

As discussed, KDE presents user-defined parameters to be configured. Among them

are the kernel function, its bandwidth and the grid resolution. Previous studies have in-

vestigated the effect of these parameters on crime hotspot mapping precision (CHAINEY,

2013; HART; ZANDBERGEN, 2014), suggesting that kernel and bandwidth are the most rel-

evant factors to be analyzed. Figure 3 illustrates that selecting an appropriate bandwidth

has severe implications for crime incidence representation when using KDE. We can see

that when bandwidth is equal to 1 mile, it results in an underfitting situation and the

bandwidth of 0.01 mile gives an overfitting distribution.

A simple way to select bandwidth is to use Silverman’s rule of thumb (SILVERMAN,

2018). However, it is only applicable to estimations using the Gaussian kernel. By ana-

lyzing both bandwidth and kernel, Hart and Zandbergen (2014) showed that Gaussian

kernels are not the optimal choices for mapping crime occurrences, suggesting linear ker-

nels instead. To retrieve a multiple parameter combinations that maximizes likelihood

estimation, Mohler et al. (2011) suggests running a grid search with cross-validation.

2.2.2 Time Series Decompositions

Temporal patterns have been explored by the crime prediction literature since re-

searchers started to forecast crime one period ahead, and Gorr and Harries (2003) date it

from the 1990s. Roughly speaking, when humans want to predict something, they look for

24

B = 1 mile B = 0.5 mile

B = 0.1 mile B = 0.01 mile

Figure 3: KDE results for different bandwidth values. The clear difference in resolutionis observed between 0.1 and 0.01 miles (bottom). We observe underfitting in the top leftand underfitting in the bottom right situations.

different items in the past and estimate what they think will happen next. The forecasting

methods we discuss here were based on a similar intuition, using past observations (lags) of

time series to estimate a value one or more steps ahead. In this sense, autoregressive (AR)

models were extensively used, and Brown and Oxford (2001) suggested them as suitable

methods for baseline comparison. In a more comprehensive formulation, autoregressive

integrated moving average (ARIMA) models were proposed to deal with non-stationary

series (BOX et al., 2015). Besides extracting p AR lags, ARIMA works with q moving av-

erage components (smoothed versions of the original series) and d series differentiations

(see Eq. 2.2) that are much more prone to be stationary. After decomposing the series

into these components, Box et al. (2015) suggest to apply parameter estimation using

non-linear methods.

Yd(t) = Y (t)− Y (t− 1) (2.2)

Seasonal-trend decomposition by loess (STL) (CLEVELAND et al., 1990) was also used

in forecasting models for crime prediction (BORGES et al., 2017; MALIK et al., 2014). By rep-

resenting series in an additive configuration of trend, seasonality and residuals (Equation

2.3), this decomposition method can reveal other temporal patterns. We depict seasonal

25

and trend components of an illustrative time series in Figure 4.

Y (v) = T (v) + S(v) +R(v) (2.3)

Jun2014

Jul Aug Sep Oct Nov Dec

Time

100

120

140

160

180

Crim

es

Original seriesSeasonalityTrend

Figure 4: Seasonal and trend components of a time series. In blue, the original seriesshow a varying behaviour which can be further explained by a trend (in black) anda seasonality (in orange). The trend follows the moving average of the series and theseasonality represents the cyclical aspect of the original series in monthly oscillation.

Further, temporal modelling that uses previous observations requires an adequate

selection of the number of lags to be considered. In the case of ARIMA components, we did

not find an optimal methodology behind the selection of its lags, only practical guidelines

considering autocorrelation functions, but the seminal study of Box et al. (2015) suggested

to prioritize parsimony (less complex models) to avoid overfitting. It is reasonable to

think that lag selection should depend on the time series sample frequency. STL theorists

recommend the number of lags of the seasonal component to be related with time series

sample frequency, e.g. to take 12 lags in monthly series (CLEVELAND et al., 1990). The

assumption underlying it is that the 13th month may be very correlated to the 1st, and

the features introduced will not present seasonal contributions proportionally with the

complexity of adding one more feature. Still, with fewer features, one may keep sufficient

information, but we do not know what specific lags contribute more to each time series.

A particular crime type would benefit from the first three lags and the other benefit most

from the last. To solve this, we will discuss later in this chapter feature selection techniques

26

based on machine learning.

2.2.3 Geographic Features

After applying crime mapping and extracting spatiotemporal features from crime

points, other secondary factors can help to explain the urban place in which the subre-

gions were derived. Related studies have proposed joining many exogenous information

to help in crime prediction tasks, such as social media traffic (GERBER, 2014) demogra-

phy aspects (BOGOMOLOV et al., 2014) and geographic location of PoI data (LIN; YEN;

YU, 2018). We argue that these strategies are difficult to reproduce in an arbitrary crime

prediction application since the data availability can be a problem. Among the three

types of information mentioned, we find that the latter can be acquired using the volun-

teered geographic information systems such as OpenStreetMap. Thus, our secondary set

of features, namely geographic features, will be explored under the PoI data available on

OpenStreetMap to ensure more reproducibility potential across other cities.

To identify relevant PoI categories that help describe hotspots, we conducted an opin-

ion survey in the Natal’s police department (SESED/RN) with police cops that work in

patrols. A total of 54 interviewed cops were asked to assign a value for the relative impor-

tance (an integer between 1 and 5) of different PoI categories that may spatially explain

crime incidence of three different types: property (CVP in the nomenclature of Natal’s

Police Department), violent or lethal (CVLI) and drug-related crimes (TRED). We in-

structed them to assign 5 to the items that most contribute to (attracting or repulsing)

crime incidence and 1 to items that do not influence in their opinion. The results are

shown in Figure 5.

27

0 1 2 3 4

Street lighting levelGangs location

Public squares and sports playgroundNeighborhood population

SchoolsResidential streets

Neighborhood's per capita incomePrimary streets

Touristic placesBars and restaurants

Night clubsHospitals or clinicsPolice departments

Places of worshipApartment concentration

Gated community

CVP

0 1 2 3 4

Gangs locationStreet lighting level

Public squares and sports playgroundNeighborhood population

Neighborhood's per capita incomeBars and restaurants

Touristic placesPrimary streets

SchoolsResidential streets

Night clubsHospitals or clinicsPolice departments

Places of worshipApartment concentration

Gated community

CVLI

0 1 2 3 4 5

Importance

Gangs locationSchools

Public squares and sports playgroundStreet lighting level

Touristic placesNeighborhood population

Bars and restaurantsNeighborhood's per capita income

Night clubsResidential streets

Primary streetsPolice departmentsHospitals or clinics

Gated communityApartment concentration

Places of worship

TRED

Figure 5: Results of the survey with 54 police sergeants about the rating of differentlandmarks and demographic aspects for determining hotspots. At the top, the rating forproperty (CVP), in the middle for lethal or violent (CVLI) and for drug-related crimes(TRED) at the bottom.

28

They believe "Gangs location", "Street lighting level" and "Public squares" are cru-

cial aspects for the three crime categories. Particularly for TRED crimes, "Schools" and

"Touristic places" arise as essential features for them. Demographic elements, such as the

population of the neighborhood and its per capita income, are other aspects in which

they regard relevance when describing dangerous places. Another interesting fact is that

residential streets are highlighted for CVP crimes, as found in a related work (DAVIES;

JOHNSON, 2015). From our perspective, the results of the questionnaire do not necessarily

reflect the aspects that determine which places are dangerous or not but give us a direc-

tion to select PoI categories in the broad number of OpenStreetMap features. Also, we

must say that the purpose of the survey conducted was not to compare the cops’ opinion

with algorithm results, but to consult cops’ opinion regarding geographic risk factors and

then model features considering data availability.

Related works have considered geographic features to model crime incidence using

PoI data aggregation. Caplan and Kennedy (2011) suggested that the density of some

facilities in city blocks, considering a bandwidth, can represent their spatial concentration.

They also indicated that in violent crimes, the distance between the closest facilities may

be another correlated spatial pattern. Wang, Brown and Gerber (2012) have used both

mentioned methods, counting PoI within city blocks and taking the distance from the

city block center to the closest PoI, generating spatial information regarding each layer of

PoI. Differently, Lin, Yen and Yu (2018) used the counting strategy but also considering

weighting neighbor city blocks, to increase the spatial autocorrelation in their aggregation.

We argue that these approaches can be adapted to consider in a single variable both

density and distance decay by using KDE (example in Figure 6), as we will discuss later.

Also, we will present in the next chapter our approach to select the subset of PoI to be

included in the predictions of each crime type, considering the particular correlations that

each facility can correspondingly present.

2.3 Machine Learning

In the previous section, we explored the methods behind translating crime events

and PoI locations into independent variables (features) for describing crime incidence in

space and time. We mentioned that the crime mapping method is the spatial aggregation

approach and using a temporal sample frequency, one can generate time series for each

grid cell. Further, to extract a more diverse set of temporal patterns (such as trend

and seasonality), we described methods of time series decompositions, which we argue

29

Figure 6: Geographic feature layers from Natal generated using KDE of residential streets(left) and schools (right). Note that residential streets are denser and concentrated in thenorth of the city, but still widespread in other places. Schools are more concentrated inthe center of the city, following to the south, but with some concentration in the north.

being useful also as feature extraction methods. To complement features with external

information, we suggested using OpenStreetMap data and extract geographic features

based on PoI density, instead of the current practice of related studies. Nonetheless, the

prediction methodology was not discussed yet.

To forecast crime incidence levels periods ahead using spatiotemporal features, super-

vised machine learning methods have been aroused as efficient tools in many recent related

studies (LIN; YEN; YU, 2018; VOMFELL; HÄRDLE; LESSMANN, 2018; ZIEHR, 2017; ARAUJO

et al., 2017; BORGES et al., 2017). In such class of heuristic algorithms, there is a set of

them specifically designed to learn relationships between a group of inputs or features X

and outputs or target variable y. This set of algorithms are called supervised because they

iteratively adjust internal weights to minimize the error between the predictions and the

actual value. This process is called training and involves adjusting internal parameters

using features extracted from the dataset. After training a model using an algorithm, one

can use it to predict values for new data.

In crime prediction studies, researchers have used many supervised machine learning

algorithms, such as Support Vector Machines (YU et al., 2011), Random Forest (VOMFELL;

HÄRDLE; LESSMANN, 2018), Multilayer Perceptron (ARAUJO et al., 2018), including based

on Deep Neural Networks (LIN; YEN; YU, 2018), and several others. To the best of our

30

knowledge, there is not a consensus on the best algorithm for crime prediction tasks. In

this work, we do not intend to search for the best algorithm among those mentioned, but

we aim to evaluate how different the performances can be with different algorithms in

different crime mapping approaches. In our experiments, we will choose to consider Ran-

dom Forest and Gradient Boosting for two reasons. First, because they were empirically

suggested as efficient algorithms in crime prediction studies, respectively by Borges et

al. (2017), Vomfell, Härdle and Lessmann (2018). Second, because both have similarities

among each other, such as being ensemble algorithms based on Decision Tree, i.e. they are

constituted by a finite set of Decision Trees combined to provide a better prediction. The

assumption behind ensemble algorithms is that a group of weak learners forms a stronger

one (BREIMAN, 2001).

Still, each of these ensemble algorithms has its proper way to combine learners, namely

"bagging" in Random Forest and "boosting" in Gradient Boosting. Bagging is when a

model randomly chooses subsets of data, with replacement, to give training samples for

Decision Trees, fits them and then retrieve the average of their predictions. In addition to

this process, the so-called bootstrap aggregation, the Random Forest algorithm trains each

of its trees with different features, randomly selected for each one. On the other hand,

boosting is when a model incrementally adds a new learner, updating weights (using

gradient descent in the case of Gradient Boosting) of the samples in which there were

more mispredictions (VOMFELL; HÄRDLE; LESSMANN, 2018)/.

Besides the algorithm choice, there are other concerns when applying machine learning

for modelling hotspot predictions. First, supervised machine learning is often distinguished

between classification and regression tasks, and in crime analysis, this can change the

output considerably, as we will discuss. Second, we present feature selection methods also

based on machine learning to filter lag components of each temporal decomposition, as

well as filtering PoI data layers, that are most important for a more accurate estimation

in each type of crime.

2.3.1 Prediction Task

A prediction task is defined accordingly with the target variable, which can be a

class (hotspot or coldspot) or ordinal values (crime incidence levels). Previous work im-

plemented both, classifiers (BOGOMOLOV et al., 2014; ARAUJO et al., 2018; LIN; YEN; YU,

2018) and regressors (MALIK et al., 2014; BORGES et al., 2017; ARAUJO et al., 2017) to es-

timate dangerous places in the future. From our perspective, the latter strategy is more

31

appropriate, since classifying a place as a hotspot or not, or even as "low", "medium"

and "high" dangerousness, must follow an aggregation on an ordinal hierarchy derived

from crime incidence values. Thus, aggregating this value into classes would hide inherent

variance present in the data.

Another concern related to such aggregation is that depending on the particular quan-

titative definition of a hotspot (e.g. more incidences than the average of four last observa-

tions), class unbalance may be a problem (ARAUJO et al., 2018). One can define a hotspot

threshold in which few samples are delimited, and it may generate too much of a class.

Also, much of the samples variance is lost in this discretization. On the other hand, mod-

elling regressors in a highly variant samples requires further inspection on outlier filtering

to prevent the model from biased predictions. For instance, if samples are concentrated

in lower values, the model will prefer to predict less values to get better overall perfor-

mance. We argue that crime mapping method parameter selection, described in Section

2.2 is crucial to obtain parsimony target variables and consequently, more efficient mod-

els. Thus, the choice of the prediction task involves a trade-off analysis, between biased

samples with regards to the choice of a threshold and variant samples when considering

raw crime incidence levels.

2.3.2 Feature Selection

As we discussed, the selection of temporal lags and PoI layers leads to a leaner repre-

sentation of spatiotemporal and geographic patterns. For instance, the trend component

extracted from time series of crimes would have more correlated information with the

firsts lags, and the seasonal component with the lasts. Also, hospitals density may be

relevant to predict burglary crimes, but not to violent crimes. When a model uses many

variables to predict a target, the algorithm fitting process becomes more complex, since

the model will have much more parameters than inputs, which will result in lack of model

stability and overfitting (VERLEYSEN; FRANÇOIS, 2005). Sometimes adding variables can

even disturb algorithm performance, because the algorithm will try to fit variables that

can even be noise to the predicted variable. Therefore, applying feature selection is an

essential step to ensure all the models will use all variables available to predict.

The feature selection task can be performed by machine learning algorithms in three

different ways, using wrapper, embedded or filter methods. Wrapper methods combine a

search strategy with a predictor to select the best subset of features, training a machine

learning algorithm for a subset of features randomly take, producing a set of models.

32

The model with the best performance represents the best set of features to be selected.

Embedded methods differ from the wrapper ones because they analyze the model structure

instead of the performance. They consider the weights assigned by the predictor for each

feature as a measure of importance, excluding the least important ones. Finally, filter

methods consider feature importance by using a measure for correlation (e.g. χ2) with the

target variable, and also calculate feature-to-feature correlation to avoid redundancy.

In this work, we followed the suggestion of Kniberg and Nokto (2018), that systemati-

cally evaluated several feature selection algorithms, providing useful guidelines. Although

not having a single algorithm performing choice for both runtime and predictive per-

formance, they suggested that a Decision Tree modelled as an embedded method had

reasonable results in the two aspects. The idea of such embedded is to calculate feature

importance applying for each feature random permutations and measuring the average

performance drop when fitting the Decision Tree. The features with the lower loss are

assumed as not important because it is not influencing the predictions as the others do.

33

3 The Predspot Framework

Given that data-driven predictive analysis varies according to the developers’ exper-

tise, one can find different methodologies and frameworks to predict crime hotspots. For

instance, Malik et al. (2014) divided the stages of its processing into (1) geospatial divi-

sion in subregions, (2) generation of time series, (3) prediction and (4) visualization of

results. Similarly, Lin, Yen and Yu (2018) proposed to (1) create a grid, (2) intersect the

grid in the city map, (3 to 6) extract data and grid features, (7) train a machine learning

algorithm and (8) assess the latter. The similarities across methodologies motivated us to

pursue a more generalized approach, namely Predspot, which we discuss in this chapter.

In previous work, we proposed a framework detailing the steps of spatiotemporal

modelling and machine learning for crime hotspot prediction (ARAUJO et al., 2018). Our

purpose was to improve the tasks’ transparency and the parameter selection involved.

In this chapter, we introduce a redesign of this framework to include a more generic ap-

proach for applying efficient crime mapping, and a more detailed feature ingest procedure.

We agree with Domingos (2012) that suggests the success of a machine learning solution

is on feature engineering endeavors. The framework is divided into two phases, namely

model selection and prediction service, analogously to the training and prediction steps of

machine learning algorithms. Each phase has its steps to achieve the final goal. This divi-

sion has the purpose of differentiating the model adjustment and its usage in operational

policing software.

Furthermore, in this chapter, we explain how our methodology was implemented.

We detail our python-package software to support model selection operations and a web

service interface to illustrate how the prediction service can be managed. These elements

shall guide the software routines involving deploying the Predspot framework in a police

department.

34

3.1 Model Selection

As we discuss in Section 2.3, supervised machine learning algorithms need a training

step to adjust internal parameters and to predict based on the patterns found in the train-

ing data. In this section, we overview how to prepare the dataset, apply a crime mapping

method, extract features from time series and fit a model considering hyperparameter

tuning. This workflow is comprised of three steps, namely "dataset preparation", "feature

ingest" and "machine learning modelling". These three steps compose the so-called model

selection phase, which purpose is to train, evaluate and save an efficient model that shall

be used operationally. We describe each step providing explanations regarding the neces-

sary inputs and parameters to be configured, as well as the corresponding workflow and

how to evaluate the model selected. It is worthy mentioning that the evaluation of the

model selection phase is limited to assess the predictions of the model adjusted. Thus,

it is measured in terms of error or accuracy ratios and not directly concerned with the

practical impact of policing operations.

3.1.1 Dataset Preparation

From a systemic point of view, the inputs of the model selection phase are a vectorized

file of city map (e.g. shapefile), a sufficiently large database of crime records provided by

the police department, and auxiliary data sources. To apply the procedures, the crime

records must have been registered at least with latitude, longitude, timestamp and crime

type. Also, in the Predspot framework, we propose using OpenStreetMap as the auxiliary

data source. It provides data of Points-of-Interest (PoI) of many cities in an open-source

manner, thus increasing the reproducibility potential of our approach. The data can be

extracted through the OverPass API.1

The first processing step of model selection is "dataset preparation", illustrated in

Figure 7. It concerns (1) loading the data from Crimes DB and external sources, (2)

applying spatial filters using the City Shape and (3) separating crime data into Crime

Scenarios, according to a crime types division. The spatial filtering process consists

of applying a spatial join operation taking georeferenced data of crimes and PoI that

are within the city boundaries. It is also important to drop duplicate records, and look if

there are "default" location values which events without proper registration are improperly

assigned to. Since crime data can be mostly acquired through human interactions, spatial1https://wiki.openstreetmap.org/wiki/Overpass_API

35

bias can arise (KOCHEL; WEISBURD, 2017). We argue that previously exploring the dataset

to clean invalid records is crucial to avoid harming the model with invalid data.

SHAPEFILE

CRIMESDB

PoIDATA

CITYSHAPE

FILTERS

CRIMESCRIMESCRIMESCENARIO

OpenStreetMapAPI

Figure 7: Model selection begins with loading and preparing datasets. Required datasources include a crime database for model training and connection to the OpenStreetMapAPI to load PoI data. Also, the city’s shapefile is important for filtering data enteredwithin its borders.

In addition, to manage the separation of crime scenarios, it is important to follow

the division made by the local authorities presented in the data. For example, if the

police department is concerned with burglary crimes and the data has many types of

burglaries, the aggregation of all types of burglary must be made carefully and aligned

with the police department’s opinion. Otherwise, we suggest using the default division

of the data rigorously. Supported by empirical evidence (ANDRESEN; LINNING, 2012), we

argue that it is not appropriate to aggregate crime types into a broader category. The

sum of spatial contributions of different sources of crime, such as residential burglary and

drug-related offenses, can generate areas in the middle of these two types of events where

no crime happens at all. Also, the Natal’s police department suggested dividing the data

into daytime and nighttime scenarios for each crime type, since different patterns can

arise. In the following steps of Predspot, we use each of these crime scenarios (crime type

and day period) as separate datasets.

Regarding PoI data extraction, one can easily download data from OpenStreetMap

querying from the Overpass API or using the web-based tool Overpass Turbo.2 The

data is categorized into map features: streets, traffic signs, and intersections belonging

to the "highway" category; hospitals, schools, restaurants and other facilities belonging

to "amenity"; other interesting categories are "leisure", "tourism" and "nature".3 Still,

choosing PoI categories may not be a trivial task, and we argue that it can be made by

analyzing geographic risk factors with the help of policing experts, as presented in Section2https://overpass-turbo.eu/3https://wiki.openstreetmap.org/wiki/Map_Features

36

2.2.3 or by analyzing related studies. For example, Davies and Johnson (2015) argues that

the street network has shown relevance in his crime prediction study. The selection of the

most relevant PoI layers is discussed later in this chapter as a feature selection problem.

3.1.2 Feature Ingest

Before the "feature ingest" step begins, it is necessary to choose the spatial and

temporal units of predictions. First, the choice of the spatial unit of analysis can be made

according to police department patrolling policies, or according to current practices of

crime mapping. If the police department wants predictions by neighborhood, the grid is

made up of neighborhoods in the city. If there is no such restriction and spatial resolution

is a priority, artificial grids from a crime mapping method may be a suitable alternative.

For example, KDE works with a grid of points, and KGrid with a set of convex hull

polygons. Artificial grids have the advantage of configuring spatial resolution using a

parameter. For example with K in KGrid or the space between grid points in KDE. As

a drawback of high-resolution grids, it generates sparser time series more challenging to

predict (MALIK et al., 2014; BORGES et al., 2017). Therefore, one cannot merely increase

the grid resolution without further inspecting the results. On the other hand, the second

choice is to select the temporal sample frequency. Similarly to grid cells, too small time

intervals of aggregation may also result in sparse time series. Even that police may prefer

hourly sampled predictions, we agree with Malik et al. (2014) that suggested weekly, or

monthly aggregates are more appropriate, depending on the sample size.

With the spatial and temporal units defined, the "feature ingest" step’s workflow con-

sists of the following procedure (illustrated in Figure 8). The first data item necessary is

the set of crime events of a given crime scenario (described as tuples of latitude, longitude

and timestamp). As it has spatiotemporal attributes, we start the ingest by aggregat-

ing crimes spatially, accordingly with the Grid disposition derived from the Mapping

Method, and then temporally by sampling time series for each place and time interval,

completing the so-called Spatiotemporal Aggreg.. This result in Time Series Cij,

indexed by the grid cell i and the time interval j.

Then, we use these time series to start the Temporal Feature Extraction, illus-

trated in Figure 9. In Section 2.2.2 we described two time series decomposition methods,

namely ARIMA and STL. Although each method has its way of setting parameters to

make predictions, we take advantage of the derived components as the basis for the feature

37

DATASETCRIME

SCENARIO

DATASETCRIME

SCENARIO

FEATURESET

X y

CRIMESCENARIO TIME

SERIESCij

CITYSHAPE

SPATIO-TEMPORALAGGREG.

TEMPORALFEATURE

EXTRACTION

PoIDATA

GRID

SPATIALAGGREG.

PoIDENSITY

Ghi

MAPPINGMETHOD JOIN

Figure 8: Between data loading and model building, the feature ingest process is responsi-ble for assembling the independent variables and the variable to be predicted. This startswith the spatiotemporal aggregations of crimes, through crime mapping method and timeseries manipulation, and PoI data spatial aggregation. The grid is a supporting element inthis step and will be used as the places where criminal incidence will be predicted. Timeseries Cij extracted from grid cells indicate a set of values from a grid cell i in period j,and PoI features Gk

i represent a the density of a k PoI category located in a grid cell i.

extraction process to feed supervised machine learning algorithms. Then, we consider as

learning target the time series of original values one period ahead. Assuming that each

region i of the Grid has particular trend, seasonality and differentiation aspects, we apply

STL and Diff. operations for C1j, C2j, ..., Cnj (being n the size of the Grid) separately.

Finally, we take k lags accordingly with the temporal sample frequency to represent past

observations of each component as T kij, Sk

ij, Dkij. The objective is to represent the corre-

sponding crime incidence level of time j (target) with independent variables (features)

expressed in past observations of trend, seasonality and differentiation in times j − 1,

j − 2, ..., j − k.

Further, we propose to complement such temporal representation of crime incidence,

using PoI data to help to describe each place i of the Grid geographically in terms of

the facilities nearby. To do so, it is necessary to apply a Spatial Aggreg. operation.

In Section 2.2.3, we discussed that related studies have considered counting PoI items

within grid cells or the distance between the closes PoI (WANG; BROWN; GERBER, 2012;

LIN; YEN; YU, 2018). We argued that KDE might include these both aspects by weighting

items with a kernel that considers distance decay. Therefore, we propose using KDE as

the Spatial Aggreg. method. The objective of such aggregation is to produce PoI

density Ghi values for each region i and each PoI category h extracted.

38

STL

TRENDTij

SEAS.Sij

DIFF. DIFF.Dij

Tkij

Skij

Dkij

kLAGS

TEMPORALFEATUREEXTRACTION

TIMESERIESCij

Figure 9: The extraction of the independent variables (features) of the time series inPredspot is conducted through a trend and seasonality decomposition (through STL) andseries differentiation. Each decomposition will generate a new series, and the variableswill be composed of k lags from each of these series.

Ultimately, the "feature ingest" step ends by joining T , S, D and G to produce the

feature matrix X, and the corresponding crime incidence levels target y that can be used

to fit a machine learning model. In Figure 10, we illustrate an artificial example of how

the temporal features can be disposed. Besides, we also consider timestamp attributes,

such as the corresponding year and month, and the geographic features to compose the

final Feature Set.

53

48

55

32

31

34

14

44

43

46

28

27

29

14

43

46

45

27

29

30

14

46

45

47

29

30

26

14

5

7

-4

1

3

1

14

7

-4

-3

3

1

-2

14

-4

-3

2

1

-2

-1

14

3

6

-2

2

4

-1

14

6

-2

-4

4

-1

5

14

T1 T2 T3 S1 S2 S3 D1 D2

1

1

1

2

2

2

3

0

1

2

0

1

2

0

-2

-4

1

-1

5

3

14

D3

13

13

13

2

2

2

14

G1

22

22

22

5

5

5

14

G2

6

6

6

16

16

16

14

G3

2

2

2

11

11

1

14

G4

15

15

15

9

9

9

14

G5Place, Time y

Figure 10: An artificial example of temporal features for two places and three time inter-vals.

39

3.1.3 Machine Learning Modelling

The input required to start the machine learning modelling step is the feature set.

As we discussed in Section 2.3.3, feature dimensionality can influence the performance

of the predictive algorithm, introducing complexity in adjusting the model weights and

even adding noise to the data. Thus, to filter the features of each crime scenario, feature

selection is necessary before adjusting the models. We propose it to be done as part of

the "machine learning modeling" step because we use machine learning-based strategies

for feature selection, as suggested by Kniberg and Nokto (2018).

With noisy features properly filtered, different supervised learning algorithms can be

adjusted to predict crime hotspots, following the flow illustrated in Figure 11. There is not

a clear consensus on the best machine learning algorithm for predicting crime hotspots.

As in the previous version of our framework, we do not propose the use of a single learning

algorithm, but experimenting with several of them to select the one with the best score

for each crime scenario. In this sense, our framework takes a model-agnostic approach

to use the one that better fits, according to appropriate assessment. Considering that

a substantial volume of machine learning algorithms have recently emerged, we argue

that the search for the best algorithm can displace the focus of the modelling approach

proposed in this study.4

SUPERVISEDLEARNINGALGORITHM

TUNING


TUNING


TUNING

FEATURESELECTIONALGORITHM

EVALUATION

FEATURESET

X y GOLDENMODEL

Figure 11: In the machine learning modeling step, the best qualified features are selectedand feed into the algorithms. By adjusting various algorithms, we can evaluate the modelsto use the one that has the best predictive performance.

Still, we consider as relevant using a Tuning strategy to optimize machine learning

hyperparameters. Previous studies have shown the improvement of algorithm performance

when applying it (BERGSTRA; BENGIO, 2012). Olson et al. (2017) suggested hyperparam-

eter tuning led up to 50% improvements in CV score. In the Predspot, as we are training4Automated machine learning (FEURER et al., 2015) has shown to be a promising model-agnostic

approach which we will explore in future work.

40

a model with lag components of time series, we propose using the K-Fold Time Series

Cross-Validator (BERGMEIR; HYNDMAN; KOO, 2018), a variation of K-Fold, for each hy-

perparameter configuration experimented. The feature set extracted from the training

data is divided into K folds, in the kth split, it returns first k folds as train set and the

(k + 1)th fold as test set.

Finally, the selection of the so-called Golden Model is made by considering the one

with the best cross-validation score. This golden model should be saved for operational use

in the prediction phase that we discuss next. Another information that should be saved in

this process is which features were selected for each scenario by the Feature Selection

Algorithm, since the Golden Models were trained with separate Feature Sets. It

is worthy to recall that all the steps on "feature ingest" and "machine learning modelling"

should be repeated for each crime scenario separately.

The score retrieved in the model selection phase indicates how good predictions can

be, but it does not represent the effectiveness of patrolling policies that use them. Also,

this score alone does not indicate how good it is to use predictions made by models in

contrast to using traditional approaches. To evaluate the models produced, we suggest

adopting a comparison with the naïve approach of single lag autoregressive models, as

considered in related studies (BROWN; OXFORD, 2001; COHEN; GORR; OLLIGSCHLAEGER,

2007).

3.2 Prediction Service

Towards predictive hotspot policing operations, a robust software must connect the

crime database to the pipeline we described, and retrieve geographical information of fu-

ture crime hotspots. In themodel selection phase, we trained an efficient model to generate

predictions. In the next phase, namely prediction service, we describe how the predictions

can be taken for new data using the trained model (the "prediction pipeline" step) and

how this processing can be structured in a conceptual software architecture (the "web

service" step). We provide a high-level description in which the interaction between the

service and the prediction consumer is detailed using the routines we described previously

in this chapter. With the service architecture detailed, we do not aim to provide a highly

scalable processing workflow, but rather to yield a software interface for predicting crime

hotspots. Additionally, we discuss the reasons why a service may be more suitable than a

dashboard tool and indicate prediction output format encoding.

41

3.2.1 Prediction Pipeline

The process of extracting features and using the trained model to return predictions

for each new period automatically depends on a connection to the crime database. In

contrast to the "dataset preparation" step that uses a static training dataset, the first

task of the "prediction pipeline" step (illustrated in Figure 12) should process incoming

data, loaded from such updated crime database. For each new period, the pipeline must

trigger a data load, incorporating new crime events to the previously stored data. Then,

the same preparations described in the "dataset preparation" phase should be applied,

such as spatially filtering the data and dividing the crimes into crime scenarios.

DATASETPREPARATION

FEATUREINGEST PREDICTION

DATABASECONECTION

PREDICTIONPIPELINE

DATA X y TRAINEDMODEL

PREDICTIONLAYER

Figure 12: In the prediction pipeline, one must tailor the model selection steps to returnpredictions using previously trained models. This means no longer loading the entiredataset, just the crimes from the last period. Then, extracting temporal features, filteringpreviously selected features, and requesting the trained model for predictions.

After that, to apply the feature ingest procedures, there must be a sufficient subset

of data to extract the previously configured k lags. For example, if the prediction is made

monthly and k = 12 lags were extracted, the data needed should include all criminal events

in the given scenario from the 12 months before that period. With enough set of data, the

only "feature ingest" task that must be repeated is spatiotemporal aggregation because

it is not necessary to apply spatial aggregation of PoI, given their temporal immutability.

Still, the geographic features must be rejoined with the temporal feature extracted.

After extracting all the features described, it is necessary to filter the same set of

features selected previously for each crime scenario and use them to feed the trained

models. No training tasks should be applied during this process, such as tuning and model

42

evaluation since the adjusted model has already been considered the most efficient. To

complete the "prediction pipeline" step, one may assure (i) the prediction output consists

of crime incidence levels yi for each place i one step ahead in time and also that (ii) this

process can be repeated for each new period.

3.2.2 Web Service

In the next step, the methods applied before are organized in software architecture.

Although no previous work has discussed such an implementation issue, at least from the

best of our knowledge, we argue that efforts for improving predictive policing algorithms

transparency should consider how these systems would be deployed in the pool of existing

software of the police department.

In this step, we have put together in a high-level description the model selection

phase with the "prediction pipeline" step, detailing how this articulation can be assisted

by auxiliary components (see Figure 13). A first component to support the described

operations is the Volume. It consists of a file system where it should be stored (i) the

recently loaded crime data from the updated database, (ii) the trained model, (iii) the list

of features selected for each crime scenario, and (iv) all the predictions retrieved, so that

queries can request them again readily. The second component consists of an Extraction,

Transform and Load (ETL) Controller routine dedicated to (i) trigger the prediction

pipeline for each new period, as well as (ii) managing the GIS client requests and the

access to the Volume.

As mentioned, we do not intend to describe an optimized architecture, but rather

to outline how information can flow from the crime database to the trained model. We

propose this workflow to be designed in the form of a web service, decoupled from other

police department software. The reason for this is that we note that a police department

may have several dashboard tools for interacting with georeferenced data. Thus, adding

one more separate tool only visualize to hotspot predictions may isolate information in

different environments, difficulting the decision-making process. Therefore, we argue that a

decoupled service that makes available predictions on demand to another software through

an application interface increases the usefulness of the latter without competing with it

in terms of usability. We suggest that the output format of the predictions be GeoJSON

so that requests are met with information already encoded in a simple representation

geographic format and readily be visualized in a GIS user interface.

43

PREDICTIONPIPELINE

FEEDS

FEEDSDATA

SOURCES

COMMUNICATES

ETLCONTROLLER

GISCLIENT

FEEDS

FEEDSVOLUME

WEBSERVICE

STORESTHETRAINEDMODEL

MODELSELECTION

Figure 13: The web service comprises managing the prediction pipeline to attend onlinerequests. This requires the use of a file system, which we call Volume, for caching theprediction results and an ETL process controller to trigger prediction generation for eachnew period.

3.3 Implementation

In this section, we overview the implementation we made for the Predspot framework.

The first part of the implementation is in the form of a python-package that supports

model selection operations. The package is organized into modules according to the steps

described. The second part of the implementation deals with a prototype of a web service,

providing some API routes and its respective inputs and responses.

3.3.1 Predspot python-package

As a solution for our second research objective (design an open-source software and de-

tails its procedures to estimate future hotspots), in this section we describe the implemen-

tation of the Predspot regarding the model selection phase. The predspot python-package

is based on heavily used data analytics libraries, such as Numpy (WALT; COLBERT; VARO-

44

QUAUX, 2011), Pandas (MCKINNEY et al., 2010), Shapely (GILLIES, 2013), GeoPandas

(JORDAHL, 2014) and Scikit-learn (PEDREGOSA et al., 2011). With such implementation,

we intend (i) to evaluate our framework, by testing different crime mapping methods and

machine learning algorithms, as well as (i) to support crime analysts that are their own

prediction service, by providing model selection routines we have implemented. Therefore,

it is also crucial to achieving our first objective since we need to evaluate our approach.

For more implementation details on the classes, methods and attributes, we refer to the

package documentation available online.5

3.3.2 Predspot service

For the implementation of a service capable of providing predictions for each new

period using the Predspot framework, here we detail some configurations and routines

needed. Aligned with the prediction service phase of the framework, there are some addi-

tional components one have to implement. The first thing to do is to a set configuration file

with the volume path, temporal sample frequency, crime scenarios tags and the methods’

parameters. For example, defining that the predictions are made monthly, for burglary

and drug crimes, the bandwidth of KDE etc.

Having set all parameters, it would be necessary to implement connections to data

sources and import crime data from the operational database. Because the connection to

the Natal Police Department database must be protected for privacy reasons, we do not

detail further procedures involved in this step nor provide source code to support this

step as a whole.

Still, the data flow must allow the automation of prediction generation using the ser-

vice with a simple interface. As we described, a controller is a component that must be

implement to automate the import and forecast generation for each new period, saving

the prediction results in the volume and retrieve them to the requests made. We have im-

plemented the service as an Application Programming Interface (API), with the following

routes.

• /import : to load new data;

• /train: to start the model selection process and save the trained models;

• /predict : to generate predictions for all crime scenarios and save them in the volume;5https://adaj.github.io/predspot

45

• /get_predictions : to return to the client the prediction results in geographic file

format, such as GeoJSON

46

4 Evaluation

The implementation of a predictive hotspot crime analysis system is based on the

premise that guiding policing using predictions can be more effective than ever relying on

the knowledge of police managers. To ensure that this happens, Moses and Chan (2018)

mentioned that two forms of assessment are required for predictive policing systems. First,

the assessment of the accuracy or error of predictive algorithms should guide theoretical

significance. Second, they mention that as important as the first would be to assess the

practical significance of using these predictions in crime reduction. In this study, we will

focus on assessing prediction performance as a first step in the timeline of our research.

Thus, in line with the purpose of this study, this chapter will present empirical evidence

of the efficiency of the Predspot methodology.

We derive twelve divisions of crime scenarios from two datasets as our experimental

samples. An exploration of datasets is done to illustrate spatial and temporal patterns

that have been somewhat obscured in the theoretical detail made in previous chapters.

Differences between sample sizes will bring interesting conclusions. PoI data extracted

from OpenStreetMap for both cities is also presented and the extraction result of some

geographic features is depicted. This will make the spatiotemporal modeling we describe

be contextualized in the analysis of the results.

In addition, in this chapter we describe our approach to experimentation, including a

description of the methods and metrics involved. The details of the baseline approach are

explained and an alternative metric is proposed for assessing the predictive efficiency of

models based on different crime mapping methods. Finally, we present the performance

results of the adjusted models and discuss the efficiency of the previously proposed vari-

ables, measuring their importance in the different predictive approaches applied in the

twelve crime scenarios.

47

4.1 Datasets

The evaluation of the Predspot methodology is conducted using datasets from two

different cities, Natal (Brazil) and Boston (US). To the best of our knowledge, no previous

study used datasets from more than one city on its evaluation. The first dataset is from

Natal, Brazil, which was made available by the Natal police department for our research.

Privacy terms do not allow us to display the spatial distribution of this data. On the other

hand, the second dataset we use is from Boston, US, which is available online from the

city’s open data portal.1 Natal data are from January 2016 to the end of November 2018

(35 months). The data we used from Boston begins in June 2015 and ends on September

2018 (37 months).

According to Andresen and Linning (2012), hotspot analyses should consider disso-

ciating crime types of different sources, as we discussed in Section 3.1.1. The description

of the types of crimes used in this work is as follows. In the Natal data, there are groups

of similar crime types. According to direct consultation with police officers in Natal, the

most useful crime type groups for the police department are (i) property crimes (CVP),

such as robbery, theft, burglary etc., (ii) violent or lethal crimes (CVLI) or, and (iii) drug-

related crimes (TRED). On the other hand, the Boston data has no similar groupings.

Thus, we selected three types of crimes to analyze, namely Residential Burglary, Simple

Assault, and Drug Violaton.

Moreover, in consultation with the police department of Natal, we found the need to

analyze the patterns of crimes that happen morning and night separately. Assuming that

there is a difference in the pattern of criminal incidence between these periods, we divide

for each type of crime mentioned its day and night versions. From now on, we refer to

crime scenario the samples of a given type of crime at a specific part of the day. This

division is used in the remainder of this paper to delimit the sample sets used in our

experiments.

4.2 Experiment Methods and Metrics

Assuming that the performance of predictive approaches can be influenced by several

factors, experimental analysis needs to comprise more than one dimension. A first factor

analysed is the crime mapping method, among which we describe in Section 2.2.1 two1https://data.boston.gov

48

of them, namely KGrid and KDE. For each of these methods, different patterns may

emerge and we need to test their capabilities to help predictive algorithms produce better

results. The choice of the machine learning algorithm is also another aspect worth making

comparisons. We analyze two decision tree ensemble machine learning algorithms, Random

Forest and Gradient Boosting. As mentioned in Section 2.3, it is not our goal to draw

conclusions about which one is best among the various machine learning algorithms for

each problem. Instead, we intend to measure whether there is a significant difference in

the use of a particular algorithm for a given crime mapping method. Thus, with these two

dimensions of analysis, we evaluate KGrid, as well as KDE, with Random Forest

and Gradient Boosting. We refer to the four corresponding prediction approaches as

KDE-RF, KDE-GB, KGrid-RF and KGrid-GB.

To manage a more robust evaluation, we compare our methodology with a baseline

method reference. For this work,we consider as the baseline method estimating the

next period using the previous. The reason is that we want to compare a traditional

approach used by police managers against prediction models based on machine learn-

ing. This approach were applied in many related studies with adaptations, e.g. (BROWN;

OXFORD, 2001; COHEN; GORR; OLLIGSCHLAEGER, 2007). As we mentioned, each crime

mapping method produces time series with different units of measures (KGrid counts

crime events within grid cells and KDE weights them by their distance to the grid cell

center). Fortunately, the baseline approach we describe can be applied to both crime

mapping methods, since it consists of a simple autoregressive lag. Therefore, we use the

performance of the baseline to normalize the performance of the prediction algorithm,

retrieving a dimensionless and relative metric of performance.

The evaluation metrics frequently used in related studies often depends on the pre-

diction task involved. For prediction tasks based on the classification of hotspot or non-

hotspot, related studies, e.g. (HART; ZANDBERGEN, 2014; MOHLER et al., 2015), have used

hit rate and prediction accuracy index (PAI), which are based on the relationship between

predicted places as hotspots, their areas and the amount of events within them (CHAINEY;

TOMPSON; UHLIG, 2008). Since we are predicting ordinal values (crime incidence levels

within an area), we are applying regression, thus using error metrics such as mean squared

error (MSE) that is used in several related studies, e.g. (BROWN; OXFORD, 2001; KADAR;

MACULAN; FEUERRIEGEL, 2019; ARAUJO et al., 2018). However, error metrics are strongly

dependent on the unit of measurement of the method applied. One exception is the mean

absolute percentage error (MAPE), a percentage based metric, but its formula is not

applicable in case zeros the samples contain zeros, which is the case of some of our KGrid

49

time series. Another one is the coefficient of determination (R2), but we argue it is not

suitable when comparing two crime mapping methods because each one involve differ-

ent scales of aggregations, producing different variance patterns. Using R2 to compare

two crime mapping methods would be similar to compare explained variance of a model

adjusted in a log base and another in square root base.

In this work, we introduce the prediction ratio (PR) as an alternative approach to

compare different predictive approaches, specially crime mapping methods. It is based

on using an error metric to derive the ratio between the performances of the prediction

approach and the baseline. In our experiments we use root mean squared error (RMSE) as

the base metric because it produces estimates in the original scale of the samples, instead

of MSE that produces squared results. Thus, we define the prediction ratio of root mean

squared error (PRRMSE) as in Equation 4.1. We formulate the PRRMSE to answer in

a single indicator to: "how many times the predictive approach is better or worse than

the baseline?". In case PRRMSE > 1, the prediction algorithm produces better estimates.

Also, we argue that having a single indicator that measures how good the predictions

are in a relative way may be important to evaluate whether the practical policing impact

matches the expectations.

PRRMSE =

√MSEbaseline√MSEpredictions

(4.1)

To investigate whether there is a significant difference in PRRMSE from the four ap-

proaches mentioned, we conduct a Friedman test with post-hoc analysis. The Friedman

test is a non-parametrical statistical test that involves ranking groups of samples to mea-

sure whether there are significant differences between them. Unlike ANOVA, it does not

require sample normality or variance equality. For post-hoc analysis, we perform the Ne-

menyi test for all pairwise combinations to retrieve the best configuration among these

we are testing. Since the algorithms we are using are not deterministic, our evaluation

samples are taken from five trials of CV evaluation using the selected models. Therefore,

each evaluation sample is a CV score of a prediction approach trial in one of the twelve

crime scenarios (six from each city). In the following section, we present the datasets and

its crime scenarios, as well as the models performances.

50

4.3 Results

In this section we present the empirical materials of our study. Starting with the

exploration of the two datasets used, we capture some patterns in the categorical, spatial

and temporal distributions of the analyzed crime scenarios. This will be important for

interpreting the results of the parameter selection we apply to the crime mapping method,

geographic feature extraction, and machine learning algorithms. With these parameters

chosen, we will also be able to analyze the results of the four predictive approaches adopted

to meet the objectives outlined at the beginning of this study. The results presented

in this section are important to validate the Predspot methodology presented in terms

of efficiency when compared to the baseline approach. In addition, we seek to show in

this section empirical evidence about the difference in the use of the two types of crime

mapping mentioned, as well as the exploration of the difference in the use of two machine

learning algorithms. These differences are explored through Friedman’s test and Nemenyi

test post-hoc analysis, which will support important contributions of this work. Another

contribution of this work is derived from a feature importance analysis, where we discuss

which feature among those defined in the previous chapter most contribute to efficient

predictions.

4.3.1 Exploratory Data Analysis

Exploring the categorical, temporal and spatial distributions of data is essential for

a few reasons. First, examining the number of samples and looking for outliers helps in

interpreting the results. It is difficult to understand the reason for the poor performance

of a given model without knowing if there are enough samples to fit it or without knowing

how the data is distributed. Second, exploiting the data allows more perspectives to be

considered by the analyst. According to Yu (1977), an exploratory analysis of data does

not consist of fishing or torturing the data until it is confessed, but of investigating the

information from multiple perspectives. Third, greater familiarity with data can enable

new ideas for data pre-processing and cleansing.

To start with, we show the amount of samples of each crime scenario in Figure 14. In

Natal, the crime scenarios have different sample sizes, having a much higher number of

CVP crimes than CVLI, and TRED crimes in between. On the other hand, Boston crime

scenarios are more equally distributed, not exceeding an order of magnitude. Also, there

is not a clear pattern on day and night crime intensity. Some crime scenarios are more

51

frequent in the day and others in the night. The differences in the crime scenario sample

sizes will allow a better interpretation of the influence of sample size on the performance

of the adjusted models as well as in the parameter selection results.

CVP@day

CVP@night

CVLI@day

CVLI@night

TRED@day

TRED@night

Nat

al’s

Cri

me

Sce

nar

ios 11814

15277

650

730

2215

2173

0 2500 5000 7500 10000 12500 15000 17500

Count

Residential-Burglary@day

Residential-Burglary@night

Simple-Assault@day

Simple-Assault@night

Drug-Violation@day

Drug-Violation@night

Bos

ton

’sC

rim

eS

cen

ario

s 3096

2313

6598

6639

6142

4528

Figure 14: The sample sizes of the twelve crime scenarios. Natal has more heterogeneouscrime scenarios compared to Boston.

In addition to categorical division by crime scenarios, we discussed in Section 3.1.1

the need to define the temporal sample frequency as well as the spatial unit, usually

represented by the crime mapping method. In this paper, we use the monthly sampling

frequency to avoid sparse time series problems. Figure 15 shows time series of the amount

of crime in each scenario per month in each city. Overall, each crime scenario time series

present its peculiarities, but without significant outliers. As we use monthly sampled time

series, we expect more parsimony samples.

In Section 2.2.2, we discussed methods for decomposing time series. To illustrate the

decomposition used as temporal features, we show in Figure 16 the trend and seasonality

components extracted using STL, as well as the differentiation of the original time series

of Residential Burglary crimes across the city, during the day (left) and night (right)

periods. It can be seen that the trend component in the two different periods follows a

similar downward behavior, but the seasonality components differ considerably from one

scenario to another. Differentiation expresses the positive and negative variations of each

series, and so does not assimilate with each other either.

52

Jan2016

Jan2017

Jan2018

Jul Jul Jul

t

0

100

200

300

400

500

600

Nat

al’s

Am

ount

ofcr

imes

CVP@day

CVP@night

CVLI@day

CVLI@night

TRED@day

TRED@night

Jan2016

Jan2017

Jan2018

Jul Jul Jul Jul

Time (months)

0

50

100

150

200

250

300

350

Bos

ton

’sA

mou

nt

ofcr

imes



Simple-Assault@day


Drug-Violation@day


Figure 15: Monthly sampled time series of the twelve crime scenarios.

To manage the spatial aggregation, we discussed in Section 2.2.1 the two crime map-

ping methods that are used in our experiments. With the same day and night scenarios

as Residential Burglary in Boston, we illustrated in Figure 17 the difference in spatial

distributions using KDE and KGrid. Also, grid resolution have to be set, and we define

our KDE grids with the resolution of 500m (distance between cells), and the KGrid with

100 cells, following previous recommendations (ARAUJO et al., 2018).

53

Jul Jul Jul Jul0

50

100

150

200

Ori

gin

alResidential-Burglary@day

Jul Jul Jul Jul


Jul Jul Jul Jul0

50

100

150

200

Tre

nd

Jul Jul Jul Jul

Jul Jul Jul Jul−20

−10

0

10

20

Sea

son

al

Jul Jul Jul Jul

Jan2016

Jan2017

Jan2018

Jul Jul Jul Jul

Time (months)

−50

0

50

Diff

eren

tiat

ion

Jan2016

Jan2017

Jan2018

Jul Jul Jul Jul

Time (months)

Figure 16: Time series decomposition for Residential-Burglary (in Boston) daytime andnighttime scenarios. The second row is composed of trend series, which are a smoothedversion of the original. Third row are the seasonal patterns, clearly distinct from day andnight. Fourth row represent the differentiated component.

Residential-Burglary@day Residential-Burglary@night

Figure 17: Spatial representations of the two crime mapping methods for Residential-Burglary (in Boston) daytime and nighttime crime scenarios. In the first row, KGridaggregation, and in the second row KDE.

54

As ancillary data, we use PoI data extracted from OpenStreetMap available for both

Natal and Boston. Data was extracted using the Overpass API, as detailed in Section 3.1.1.

The choice of the PoI categories considered was in part based on the outcome of the Natal

police opinion poll presented in Section 2.2.3. Figure 18 indicates which PoI categories we

use, and shows the location of the PoI data in both cities. Note that we are considering

line based items (primary, secondary and residential from the highway tag) as sets of

points. As we mentioned, not all PoI categories will be selected as important in all crime

scenarios, since we apply a feature selection step before model fitting. Still, we have to

extract features for all of them.

Figure 18: PoI data from Natal (left) and Boston (right) extracted from OpenStreetMap.

In Section 3.1.2, we propose the use of KDE as a spatial aggregation methodology

that considers both the fact of density and spatial decay over distance through its param-

eters (See Section 2.2.1). The results of feature extraction can be illustrated with contour

plots as in Figure 19, for hospitals, residential streets, and tourism places in both cities.

Note that the feature extraction process takes a smooth surface of PoI density in the city,

thus considering spatial autocorrelation. Still, it is important to define the bandwidths

and kernel for each PoI category. We detail the parameters used in the extraction of geo-

graphic features below, as well as the parameters of crime mapping and machine learning

algorithms.

55

Hospitals(Amenity) Hospitals

(Amenity)

Residential(Highway) Residential

(Highway)

*(Tourism) *

(Tourism)

Figure 19: A representation of the geographic features taken from three PoI categories,hospitals (first row), residential streets (second row) and touristic places (third row) ofNatal (left) and Boston (right).

56

4.3.2 Parameter Selection

The methods we described have several parameters to optimise their uses. For the

crime mapping methods presented, KGrid have the K parameter to control grid resolution

and KDE is greatly influenced by the selected bandwidth and kernel. On the other hand,

we chose to operate in three hyperparameters of the machine learning algorithms chosen

for this work (Random Forest and Gradient Boosting) that may influence its performance.

These are the number of trees (n_estimators), the depth of the forest (max_depth) and

the percentage of features distributed to each tree (max_features).

As recommended in the model selection phase, we tuned these methods by selecting

the parameters that optimize their performances. We apply the grid-search strategy with

10-Fold Time Series Cross-Validator (BERGMEIR; HYNDMAN; KOO, 2018) using the pa-

rameter space defined in Table 1. For KDE, the optimal setting selected is considered

by the maximum log-likelihood criteria. Different kernel and bandwidths from 0.01km

to 1km were tested. In the case of KGrid, we follow the results of our previous experi-

ments (ARAUJO et al., 2018), which suggested values of K in the order of 100. We found

that the average cell area with this resolution of KGrid is in the same order of magni-

tude as optimized settings of KDE cells, considering the selected bandwidth. So for all

crime scenarios, we take K = 100 for KGrid. For supervised algorithms, Random Forest

and Gradient Boosting, the optimal setting is the one with the lowest MSE among all

combinations.

Table 1: KDE parameters and machine learning hyperparameters considered in the gridsearch tuning.

Algorithm Hyperparameter Values

KDEbandwidth from 0.01 km to 1 km, n=200

kernel gaussian, linear, exponential,tophat, epanechnikov

Random Forest,Gradient Boosting

n_estimators 500, 1500, 2500, 3500, 4500max_depth 3, 6, 9, 12max_features 0.4, 0.6, 0.8, 1

The results of KDE parameter selection applied to crime scenario mapping are de-

scribed in Table 2. Only the two CVLI crime scenarios in Natal did not have their band-

widths within the range defined in the parameter space, reaching a maximum of 1km as

57

the selected parameter. This may be related to the fact that CVLI crimes were the set

with the lowest sample size we take. In the case of Boston crime scenarios, bandwidths

were relatively close, ranging from 0.229 to 0.522. One of the reasons for this is the sam-

ple size equality of the considered samples (see Figure 14), since smaller samples end up

being aggregated by larger bandwidths, as suggested by Silverman (2018). On the other

hand, the kernel selection presented surprises by the regularity in the choice of exponential

kernel. In only one of the twelve crime scenarios has the Gaussian kernel been selected.

Interestingly, no previous related studies that experimented optimal KDE configurations

have not proposed the exponential kernel as an effective option for crime mapping. For

example, Hart and Zandbergen (2014) proposed using linear kernels.

Table 2: Selected KDE parameters for the crime mapping methods applied in each crimescenario.

City Crime Scenario bandwidth (km) kernel

Natal

CVP@day 0.289 exponentialCVP@night 0.199 exponentialCVLI@day 1 exponentialCVLI@night 1 exponentialTRED@day 0.821 gaussianTRED@night 0.607 exponential

Boston

Residential-Burglary@day 0.522 exponentialResidential-Burglary@night 0.488 exponentialSimple-Assault@day 0.353 exponentialSimple-Assault@night 0.353 exponentialDrug-Violation@day 0.229 exponentialDrug-Violation@night 0.274 exponential

Similarly, we apply KDE parameter selection to geographic feature extraction. The

parameters selected for each feature in both cities are described in Table 3. Again, the

exponential kernel has superiority in the overwhelming majority of cases. In hospital-

related PoIs, the 1km bandwidth was chosen for both cities. Still, we note that most

PoI categories can be mapped using bandwidths in the order of 100m. It is noteworthy

that the selected bandwidth values have a dependency on the selected kernel, and if the

exponential kernel were not used, other values would be selected.

The results of the machine learning model hyperparameter selection applied to each

58

crime scenario are presented in Appendice A. Unlike the KDE bandwidth, the Random

Forest and Gradient Boosting hyperparameters did not explicitly depend on the size of

the samples. Still, we observe an express dependence of the parameters according to the

crime mapping method used. The results of model evaluation configured with this set of

hyperparameters are presented next.

Table 3: Selected KDE parameters for geographic feature extraction applied for PoI dataaggregation.

FeatureNatal Boston

bandwidth (km) kernel bandwidth (km) kernel

amenity_hospital 1 gaussian 1 exponentialamenity_school 0.299 exponential 0.343 exponentialamenity_police 0.920 exponential 1 exponentialamenity_place_of_worship 0.269 exponential 0.338 exponentialamenity_restaurant 0.184 exponential 0.124 exponentialleisure_* 0.085 exponential 0.120 exponentialtourism_* 0.189 exponential 0.279 exponentialhighway_primary 0.065 exponential 0.090 exponentialhighway_secondary 0.105 exponential 0.239 exponentialhighway_residential 0.299 exponential 0.433 exponential

4.3.3 Model Performance

According to the selected parameters, we first evaluate the predictive performance

of the adjusted models according to their cross-validation MSE scores. Figures 20 and

21, for KGrid and KDE respectively, show the performances of the baseline, the Random

Forest (RF) the and Gradient Boosting (GB) algorithms for the twelve crime scenarios

described. Note that each scenario is represented on a different scale that depends on the

amount of samples available. Also note that the performance of the algorithms in the Natal

crime scenarios (in red) behave similarly to the Boston crime scenarios (in blue) in both

crime mapping methods. However, in KGrid, it is possible to observe a slight superiority

of RF, which happens consistently in all twelve crime scenarios analyzed. In contrast, we

observe the same constant superiority of GB in KDE. Moreover, in a comparison between

Figures 20 and 21, one can notice that the difference between baseline and predictive

algorithms (RF and GB) is greater in approaches using KDE. Later on this section, we

59

investigate whether this difference is statistically significant.

Baseline RF GB

CVP@day

0

2

4

6

8

MSE

7.236

2.514 2.527

Baseline RF GB

CVLI@day

0.00

0.05

0.10

0.15

0.200.197

0.06 0.073

Baseline RF GB

TRED@day

0.00

0.25

0.50

0.75

1.00

1.251.039

0.353 0.407

Baseline RF GB

CVP@night

0.0

2.5

5.0

7.5

10.0

MSE

10.057

3.334 3.419

Baseline RF GB

CVLI@night

0.0

0.1

0.2

0.3 0.262

0.079 0.091

Baseline RF GB

TRED@night

0.0

0.2

0.4

0.6

0.8

1.00.993

0.317 0.376

Baseline RF GB


0.0

0.5

1.0

1.5

MSE

1.446

0.389 0.461

Baseline RF GB

Simple-Assault@day

0

1

2

3 2.819

0.974 1.028

Baseline RF GB

Drug-Violation@day

0

1

2

3

44.029

1.589 1.802

Baseline RF GB


0.0

0.2

0.4

0.6

0.8

1.0

MSE

0.949

0.337 0.376

Baseline RF GB


0

1

2

33.273

1.241 1.295

Baseline RF GB


0

1

2

3 2.715

1.268 1.361

Figure 20: Cross-validation MSE results for KGrid-based models for Natal (in red) and

Boston (in blue) crime scenarios.

60

Baseline RF GB

CVP@day

0.00

0.25

0.50

0.75

1.00

MSE

1.004

0.294 0.245

Baseline RF GB

CVLI@day

0.0

0.5

1.0

1.5

2.0 1.766

0.171 0.081

Baseline RF GB

TRED@day

0

1

2

3

4

54.269

0.536 0.334

Baseline RF GB

CVP@night

0.0

0.5

1.0

1.5

MSE

1.636

0.475 0.402

Baseline RF GB

CVLI@night

0.0

0.5

1.0

1.5

2.01.992

0.154 0.076

Baseline RF GB

TRED@night

0.00

0.25

0.50

0.75

1.00

1.25 1.151

0.2070.127

Baseline RF GB


0

1

2

3

MSE

3.107

0.5530.308

Baseline RF GB

Simple-Assault@day

0.0

0.5

1.0

1.5

2.0

2.5 2.397

0.5510.354

Baseline RF GB

Drug-Violation@day

0

2

4

6

86.86

1.52 1.082

Baseline RF GB


0

1

2

3

4

5

MSE

4.625

0.740.37

Baseline RF GB


0

1

2

3 2.877

0.6390.415

Baseline RF GB


0

2

4

6

8

10 9.674

1.527 1.016

Figure 21: Cross-validation MSE results for KDE-based models for Natal (in red) and

Boston (in blue) crime scenarios.

61

We discussed in the previous section that PRRMSE is an alternative approach to

analyzing how many times the predictive model is better than the baseline estimates. To

account for PRRMSE, we reassess the trained models for each crime scenario five times

(again with 10-Fold Time Series Cross-Validator) to ensure that error variances were

captured. Since we want to compare crime mapping methods and predictive algorithm

performance, we plot the results of PRRMSE in Figure 22. On the left, we note that KDE

is clearly superior in all crime scenarios, but less intensely in the CVP in Natal crimes.

Besides, the crime scenarios with the highest PRRMSE were CVLI, also in Natal.

Interestingly, note that these two crime scenarios, CVP and CVLI, are the ones with

the largest and smallest samples respectively (See Figure 14). We deduce that using KDE

seems to be most effective when a small number of samples are available. When having

a significant amount of samples (considering CVP@day and CVP@night, n > 10000

samples), KDE’s performance becomes closer to KGrid’s in terms of PRRMSE, but it is

still better. The reason may be that with more samples, KGrid forms less sparse and more

representative time series.

0 1 2 3 4 5 6PR-RMSE

CVP@day

CVP@night

CVLI@day

CVLI@night

TRED@day

TRED@night



Simple-Assault@day


Drug-Violation@day


Cri

me

Scen

ario

s

KDEKGrid

0 1 2 3 4 5 6PR-RMSE

RFGB

Figure 22: PRRMSE results of five trials evaluating trained models for each crime scenario.In the left side, the two crime mapping methods are compared and in the right, machinelearning algorithms. Note that KDE outperforms KGrid in all crime scenarios, but moresharply in crime scenarios that have fewer data points, such as CVLI. On the other side,one can note that GB models have higher percentiles, but with much more variance.

On the other hand, the right side of Figure 22 shows that there is a balance between

the RF and GB predictive algorithms, as we noted earlier. Note that the median of GB

is higher in all scenarios, but its performance is clearly more variant than RF. One may

notice the predictive performance of the algorithms is affected by the size of the available

62

samples, just as before. With more samples, for example in the CV P crime scenarios, the

PRRMSE appears smaller, as does the variance of your estimate. Although it is intuitive

to think that with more samples, machine learning algorithms tend to estimate better,

interpretation of these results leads us to conclude that baseline also benefits from the

increased amount of data available, since PRRMSE is a relative metric.

The averages and standard deviations of PRRMSE from the four predictive approaches

are presented in Table 4. We note that on average, predictive approaches are 1.6 to 3.1

times better than baseline. To find out if there is a statistically significant difference

between these four treatments, we run the Friedman test. Considering a significance level

of α = 0.05, the result indicated the rejection of the null hypothesis and that there

is indeed a significant difference between the approaches (p � 0.001). Thus, since the

most notable difference in averages is between crime mapping methods, we empirically

show that choosing the crime mapping method can greatly influence prediction

performance.

Table 4: Average and standard deviation of PRRMSE for each predictive approach, con-sidering five trials of the twelve crime scenarios.

PredictiveApproach

PRRMSE

Average Standard Deviation

KGrid-RF 1.704 0.118KGrid-GB 1.622 0.089KDE-RF 2.446 0.512KDE-GB 3.123 0.882

Moreover, it is important to assess whether there is a difference between the predictive

algorithms used. We evaluate the approaches in a paired way using post hoc analysis with

Nemenyi test. The results in Table 5 show that the difference we believed to be subtle

between KGrid-RF and KGrid-GB, as well as KDE-RF and KDE-GB actually turns

out to be statistically significant (p ≤ 0.001). It means that choosing the predictive

algorithm has a considerable impact on predictive performance. Thus, among

the predictive approaches tested in this study, the combination of KDE and GB actually

provides the best estimates (average PRRMSE = 3.12) across all crime scenarios. This

difference may be caused by the fact that KGrid generate sparser time series compared to

KDE, thus more difficult to be predicted, as we discussed in Chapter 2. KDE provides more

homogeneous time series formation, as its spatial aggregations consider events weighted

63

by distance from the center of the grid, while KGrid counts events within a cell, without

considering the neighboring one. We observe that this spatial autocorrelation effect is

important when translating crime events into spatiotemporal variables.

Table 5: A pairwise statistical comparison of the four predictive approaches, consideringthe results from post-hoc analysis.

Difference of levels Difference of means SE of means p-Value

KGrid-RF − KGrid-GB 0.082 0.066 0.006KGrid-RF − KDE-RF -0.742 0.480 0.001KGrid-RF − KDE-GB -1.419 0.838 0.001KGrid-GB − KDE-RF -0.823 0.516 0.001KGrid-GB − KDE-GB -1.501 0.881 0.001KDE-GB − KDE-RF 0.678 0.388 0.001

However, with superior predictive performance also comes the computational cost

of generating estimates. We use a complete 64-core node (Intel Xeon processor, 64MB

of RAM) from the UFRN supercomputer (NPAD / IMD). We measure the time spent

throughout the model selection step for each dataset and crime mapping method, and

show it in Table 6. Please note that using KDE, given the 500m resolution and parameter

setting, is up to fifteen times slower than using 100 cell KGrid. However, it is noteworthy

that the time required for the model selection phase does not equal the time required to

generate on-the-fly estimates in the prediction service. Since models are already trained, a

prediction is generated within seconds. Thus, while KDE requires a larger computational

budget for training, its estimates may be much better, especially in cases where data are

scarce, as seen in the CVLI crime scenario.

Table 6: Wall time spent on model selection phase for each dataset.

Total Time forModel Selection (s)

Dataset KGrid KDE

Natal 2893 40911Boston 4705 50457

64

4.3.4 Feature Importance Analysis

To evaluate which variables most influence the predictions made by the fitted models,

we conducted a feature importance analysis for each crime scenario in the four predictive

approaches used. The goal is to understand what features the models are basing on and

evaluate which features we model that really contribute to the most efficient models. The

so-called feature importance metric is an important explanation tool to show which factors

the models are mostly relying on. 2 It can be calculated by successive permutations of

the values of a given feature. In this method called feature permutation, the feature with

the smallest variation in predictive performance is considered the less important one to

generate predictions. Fortunately, the implementation of scikit-learn (PEDREGOSA et al.,

2011) that we use from the RF and GB estimators already provides the feature importance

values after the models have been adjusted.

We plot the feature importances of the four predictive approaches for Natal crime

scenarios in Figures 23 − 26 regarding KGrid-RF, KGrid-GB, KDE-RF and KDE-GB

models respectively. Among the feature set we presented were those based on tempo-

ral components (trend, seasonal and difference), geographic information (coming from

OpenStreetMap extracted PoIs) and others such as the corresponding month and year.

Each temporal feature has a number that corresponds to the respective lag. For example,

KGrid-RF models (Figure 23) feature the most important feature in four of the six crime

scenarios (and second most important in the other two) the seasonal_12 component,

corresponding to the seasonal factor twelve months ago. This is aligned with our expec-

tations, since it suggests that the same month last year greatly influences the predictions

of these models. In the other two crime scenarios which seasonal_12 is not the top fea-

ture, the most important one is the trend_1, which corresponds to the trend value of the

previous month. We also realize that the features diff_11 and diff_12 are also important

in most crime scenarios. Interestingly, the other three set of models also have preferences

in temporal features, having seasonal_12 as top feature in KGrid-GB, and trend_1 in

KDE-RF and KDE-GB. We observe that these features we mentioned are frequent in

almost all models.

Unfortunately, if we inspect the geographic features, we find that they have little2We understand that there is a growing demand for an explanation of predictions made. by machine

learning models, and a public safety decision-making tool should provide such explanations to the policingmanager. Currently, model-agnostic machine learning model interpretation frameworks have emerged,such as LIME (RIBEIRO; SINGH; GUESTRIN, 2016) and SHAP (LUNDBERG; LEE, 2017), and we intend toinvestigate their usefulness in the context of prediction of criminal hotspots in future work.

65

impact on predictive performance. There are few cases where a set of features were selected

by the feature selection algorithm, but they mostly have an importance close to zero when

they were selected. We believe this may be due to factors associated with geographical

heterogeneity cite. Our models were modeled to fit a global behavior in the city. However,

geographic patterns can happen locally, combined with contributions from other factors.

For example, the density of schools in one region may attract drug-related crimes because

of poor demographic rates, and drive these crimes away in another region for other reasons.

To manage the geographic patterns that influence crime in machine learning models,

further study should be conducted. Considering demographic features may be a solution,

but it will not solve the problem of local heterogeneity.

66

trend_1seasonal_12

trend_2trend_3trend_4diff_12diff_11

trend_9seasonal_2

trend_8seasonal_7seasonal_3

seasonal_10seasonal_6seasonal_1

diff_6diff_8diff_3diff_2

diff_10diff_4

highway_residentialmonthdiff_9

CVP@dayseasonal_12diff_12diff_11seasonal_2seasonal_5seasonal_11seasonal_7seasonal_8trend_12seasonal_10seasonal_9seasonal_3monthdiff_7diff_9highway_residentialdiff_1diff_2diff_6highway_secondaryyeardiff_3diff_4tourism_*

CVLI@day

seasonal_12trend_1trend_2diff_11diff_12

trend_3trend_5

seasonal_11seasonal_1seasonal_2seasonal_6

trend_9seasonal_10


trend_11trend_12

seasonal_4seasonal_8

diff_5diff_1

highway_primary

TRED@daytrend_1seasonal_12trend_2trend_3diff_12diff_11trend_6trend_12seasonal_5seasonal_1seasonal_9seasonal_11seasonal_6seasonal_7seasonal_8seasonal_2diff_3diff_8diff_5seasonal_4diff_1seasonal_10diff_9month

CVP@night

0.0 0.2 0.4 0.6 0.8 1.0Importance

seasonal_12diff_12



amenity_place_of_worshipseasonal_3

diff_2seasonal_1seasonal_2

yearhighway_residential

seasonal_11highway_secondary

trend_12diff_7diff_9diff_5

monthdiff_8

highway_primaryleisure_*

CVLI@night

0.0 0.2 0.4 0.6 0.8 1.0Importance

seasonal_12trend_1trend_2trend_3diff_11diff_12trend_4seasonal_11seasonal_9seasonal_3seasonal_7highway_residentialseasonal_5seasonal_4seasonal_10seasonal_6seasonal_8diff_4seasonal_2diff_1diff_6highway_secondarymonthyear

TRED@night

KGrid-RF

Figure 23: Feature importance of KGrid-RF models.

67

seasonal_12trend_3trend_1trend_9trend_8trend_2trend_4diff_11diff_12


diff_10diff_6diff_3diff_8


monthdiff_9diff_2diff_4

highway_residential

CVP@dayseasonal_12diff_11seasonal_7diff_12seasonal_5seasonal_8seasonal_11seasonal_9trend_12seasonal_2seasonal_3seasonal_10monthhighway_secondaryhighway_residentialdiff_1diff_2diff_6diff_7diff_9yeardiff_3diff_4tourism_*

CVLI@day

seasonal_12trend_1trend_2diff_11

trend_3diff_12

trend_5trend_9

seasonal_6trend_12trend_11




diff_1diff_5

highway_primary

TRED@dayseasonal_12trend_2trend_1trend_3trend_6trend_12diff_12diff_11seasonal_1seasonal_5diff_1diff_5seasonal_6seasonal_9diff_3seasonal_7diff_8diff_9seasonal_11seasonal_8seasonal_2monthseasonal_4seasonal_10

CVP@night

0.0 0.2 0.4 0.6 0.8 1.0Importance

seasonal_12diff_12

seasonal_4seasonal_7seasonal_3seasonal_2seasonal_8



monthhighway_secondary

trend_12diff_7diff_2

highway_residentialdiff_9diff_5year

diff_8amenity_place_of_worship

highway_primaryleisure_*

CVLI@night

0.0 0.2 0.4 0.6 0.8 1.0Importance

seasonal_12trend_1trend_3diff_11trend_2diff_12trend_4seasonal_9seasonal_7seasonal_11seasonal_6diff_4diff_1seasonal_8seasonal_3seasonal_4seasonal_10seasonal_5seasonal_2yeardiff_6highway_residentialmonthhighway_secondary

TRED@night

KGrid-GB

Figure 24: Feature importance of KGrid-GB models.

68

trend_1trend_2

seasonal_12trend_3diff_12diff_11

kde:highway_residentialdiff_4diff_1

seasonal_10diff_6diff_9


diff_3diff_5


diff_8diff_2

diff_10diff_7


CVP@daytrend_1trend_2seasonal_12trend_3diff_3seasonal_3diff_4diff_11diff_12diff_7diff_2seasonal_4diff_9diff_1seasonal_6diff_5diff_10diff_6monthseasonal_8diff_8seasonal_9seasonal_5kde:amenity_hospital

CVLI@day

trend_1seasonal_12

trend_2trend_3diff_11diff_12

diff_6diff_1


diff_10diff_2diff_5

seasonal_1diff_7


diff_8diff_3

kde:amenity_hospitalmonth

year

TRED@daytrend_1trend_2seasonal_12diff_12diff_11seasonal_8kde:amenity_place_of_worshipkde:amenity_hospitaldiff_7diff_3diff_2seasonal_1diff_10seasonal_6diff_6diff_5diff_4seasonal_2diff_1diff_9seasonal_5seasonal_10seasonal_3kde:amenity_school

CVP@night

0.0 0.2 0.4 0.6 0.8 1.0Importance

trend_1seasonal_12

trend_2diff_1

diff_10trend_12

diff_12diff_5diff_4

diff_11diff_9

seasonal_6diff_8

seasonal_4diff_6

kde:amenity_hospitaldiff_7

seasonal_7month


seasonal_8year

CVLI@night

0.0 0.2 0.4 0.6 0.8 1.0Importance

trend_1trend_2seasonal_12diff_12kde:TRED@nightdiff_11diff_10diff_2diff_1diff_5diff_6diff_9diff_7diff_4seasonal_3seasonal_2seasonal_11seasonal_9seasonal_1diff_8kde:amenity_policediff_3seasonal_5kde:tourism_*

TRED@night

KDE-RF

Figure 25: Feature importance of KDE-RF models.

69

trend_1trend_2trend_3


diff_1diff_10

seasonal_11diff_4diff_6diff_9

seasonal_10diff_3diff_8diff_5

seasonal_6kde:highway_residential

diff_7seasonal_1

diff_2seasonal_4seasonal_3seasonal_7

CVP@daytrend_3trend_2trend_1seasonal_12seasonal_6seasonal_3diff_3diff_4diff_1diff_8diff_2seasonal_4diff_11diff_9seasonal_8diff_12diff_6monthseasonal_9diff_10diff_7diff_5kde:amenity_hospitalseasonal_5

CVLI@day

trend_1seasonal_12

trend_3trend_2diff_11diff_12

diff_6diff_1

monthdiff_2diff_4diff_8

seasonal_3diff_9

seasonal_1diff_7

diff_10seasonal_7


diff_3diff_5

kde:amenity_hospitalyear

TRED@daytrend_2trend_1seasonal_12diff_12diff_11kde:amenity_hospitalseasonal_1diff_3kde:amenity_place_of_worshipdiff_10seasonal_8diff_7diff_1kde:amenity_schooldiff_9diff_6seasonal_10diff_2diff_5seasonal_2seasonal_6diff_4seasonal_5seasonal_3

CVP@night

0.0 0.2 0.4 0.6 0.8 1.0Importance

trend_1trend_12

seasonal_12trend_2

diff_1diff_12

diff_8diff_5

diff_10diff_4

diff_11month


diff_9diff_6


seasonal_7kde:amenity_hospital

diff_7year

seasonal_8

CVLI@night

0.0 0.2 0.4 0.6 0.8 1.0Importance

trend_1trend_2seasonal_12kde:TRED@nightdiff_12diff_11kde:amenity_policediff_2diff_1diff_5diff_4diff_7diff_6diff_10diff_9diff_3seasonal_5seasonal_1diff_8seasonal_9seasonal_3seasonal_2seasonal_11kde:tourism_*

TRED@night

KDE-GB

Figure 26: Feature importance of KDE-GB models.

70

5 Concluding Remarks

Public safety is one of the significant challenges of smart cities, especially in underde-

veloped countries with high crime rates. To reduce crime incidence, scientists have found

evidence to propose the effectiveness of hotspot policing strategies (BRAGA; PAPACHRIS-

TOS; HUREAU, 2014; WEISBURD; ECK, 2004). Towards innovation, predictive algorithms

have been used to estimate crime hotspots. Although these predictive policing approaches

have been increasingly applied in police departments, we find that previous studies lack in

several vital aspects. First, it is difficult to find a study that at the same time, transpar-

ently presents the processing steps involved in producing future crime hotspots and shows

their effectiveness against traditional strategies. Second, the tools that support such stud-

ies are proprietary, and thus the results are difficult to reproduce. Third, using existing

literature, it is difficult to measure how the choice of a particular crime mapping method

or machine learning algorithm affects predictive performance.

This study first sought to improve our previously proposed prediction framework

through alternative crime mapping and feature engineering approaches, filling the gaps

mentioned. In the previous version of our framework, we detailed the processing steps

involved to implement a hotspot prediction system (ARAUJO et al., 2018). The proposed

approach required some adaptations to include (i) the adjustment of a single city-wide

model, (ii) the possibility of using KDE as a crime mapping method, as well as (iii) the

extraction of trend and seasonality features, and (iv) data from exogenous sources such

as OpenStreetMap. This study also sought to turn our implementation into a python

package called predspot to reproduce our approach in different cities that do not have

access to the expensive tools available.

In Chapter 3, we presented the Predspot methodology as the new version of our

framework divided into two processing phases. First, model selection preprocesses data,

models criminal incidence in spatiotemporal patterns (features) and trains machine learn-

ing models using them. We also included geographic features from OpenStreetMap open

data, seeking to improve predictive performance. The second phase of Predspot comprises

71

the prediction service, in which predictions are generated in a web service workflow, in

contrast to the adoption of a separate dashboard application. For this work, the predspot

package supports only model selection analysis, but we provided some advice for imple-

menting the web service using it.

The empirical investigation found from our approach were explored in Chapter 4. We

evaluated two crime mapping methods, namely KGrid and KDE, and two extensively used

machine learning algorithms, Random Forest and Gradient Boosting, against a baseline

method based on a single autoregressive aggregation. The purposes of our experiments

were (i) to measure how the predictions are compared to the baseline in the performance of

estimates, and (ii) to measure whether there is a significant difference in the performance

of estimates using different predictive methods. To better interpret the results in a single

measure, we proposed PRRMSE as an alternative regression metric that indicates how

better predictions are compared to the baseline.

In our experiments, we considered twelve crime scenarios from datasets of two cities,

Natal in Brazil and Boston in the US. For each crime scenario, we fitted four predictive

models (KGrid-RF, KGrid-GB, KDE-RF and KDE-GB) and the baseline. The results

indicated that KDE-GB was the best approach in all crime scenarios (p < 0.001), with an

average PRRMSE of 3.123, followed by KDE-RF, KGrid-RF and finally KGrid-GB. In

addition, crime scenarios had different sample sizes, which helped us draw other patterns

from the results. We observed that KDE-based models showed interesting results in crime

scenarios with few samples. Besides, we found that as the size of the samples increased,

the performance of the KDE and KGrid-based models began to get closer.

Although these findings were within the range of our expectations, some results were

surprising. First, feature importance analysis revealed that temporal-based features were

massively preferred in almost all models. Seasonal and trend components have been con-

sidered efficient features for all models adjusted. On the other hand, geographic features

had little participation, even with our KDE proposed modelling approach. We argued that

this might be connected with the fact that our predictive models do not consider local

geographic heterogeneity. In a more sceptical way, this can be explained by considering

the fact that crime incidence itself is much more important to determine future crime

incidence than geographic factors does. We advise future work to improve geographic

features, for instance, reducing the dimensionality of the geographic feature space.

Another limitation of our study was the fact that we tested only a few machine learning

algorithms. As we mentioned, our model-agnostic approach can be benefited by choosing

72

better algorithms. Future work could analyze the role of automated machine learning

in our Predspot methodology since it seeks the optimal algorithm and hyperparameter

configuration (FEURER et al., 2015).

The horizon of our research indicates practical applications of the framework pre-

sented. These applications will be based on the deployment of a web service connected

to a database of criminal occurrences to automatically provide hotspot maps for a pe-

riod in the future. To support it, we provide open-source software that may help criminal

analysts adjusting machine learning models, encapsulating and deploying them in web

services. Besides, police managers that rely on our methodology that estimate 1.6 to 3.1

times better than the traditional hotspot estimation approach and perhaps producing

higher impact with the patrol resources available.

5.1 Future work

Considering the limitations of our study, we examined that some improvements may

be the subject of future work. First, the implementation of the Predspot service in Natal

is immediate demand, so we can evaluate our approach to practical crime reduction, as

suggested by Moses and Chan (2018). Working in this direction must take into account the

real difficulties in translating predictions into police operations. Hunt, Saunders and Hol-

lywood (2014) suggests that without proper planning, the results of a predictive policing

operation can be ineffective.

Second, the evolution of Predspot’s model selection can follow the integration of ex-

planation models to interpret the predictions made. Interpretable machine learning frame-

works have emerged as a requirement for deploying predictive systems (RIBEIRO; SINGH;

GUESTRIN, 2016; LUNDBERG; LEE, 2017). The explanation of why a given place has been

predicted as a hotspot can shed light on patrol manager regarding various ethical aspects

of prediction. For example, if a poor community is always considered a hotspot, these

explanatory models can bring demographic coefficients into the analysis.

Third, integration between the Predspot framework and police operations can take

place in many ways, and a study of patrol vehicle routing can be based on predictions

made. A patrol program can use the Predspot output as input data and generate the

sequence of places that vehicles should visit. Integrating this solution with the Predspot

framework can facilitate many patrol police operations.

Still, the development of predictive systems applied to produce future crime hotspots

73

will pass through many advances. The rise of data-driven methods and large investments

in artificial intelligence will bring significant innovations to aid crime reduction, and pre-

diction performance can improve dramatically. We argue that Predspot is an effort to

standardize vocabulary of the routines involved in modeling machine learning for such

problems, to keep the knowledge produced reusably and continually growing. We believe

our methodology may help formulate even more innovative policing strategies.

74

References

ANDRESEN, M. A.; LINNING, S. J. The (in) appropriateness of aggregating acrosscrime types. Applied Geography, Elsevier, v. 35, n. 1-2, p. 275–282, 2012.

ANDRESEN, M. A.; WEISBURD, D. Place-based policing: new directions, newchallenges. Policing: An International Journal of Police Strategies & Management,Emerald Publishing Limited, v. 41, n. 3, p. 310–313, 2018.

ANGELIDOU, M. Smart city policies: A spatial approach. Cities, Elsevier, v. 41, p.S3–S11, 2014.

ARAUJO, A. et al. Towards a crime hotspot detection framework for patrol planning.In: IEEE. IEEE 16th International Conference on Smart City. [S.l.], 2018. p. 1256–1263.

ARAUJO, A. et al. A predictive policing application to support patrol planning in smartcities. In: IEEE. Smart Cities Conference (ISC2), 2017 International. [S.l.], 2017.

BATTY, M. The new science of cities. [S.l.]: Mit Press, 2013.

BBC. Estas são as 50 cidades mais violentas do mundo (e 17 estão no Brasil). 2018.http://www.bbc.com/portuguese/brasil-43309946.

BERGMEIR, C.; HYNDMAN, R. J.; KOO, B. A note on the validity of cross-validationfor evaluating autoregressive time series prediction. Computational Statistics & DataAnalysis, Elsevier, v. 120, p. 70–83, 2018.

BERGSTRA, J.; BENGIO, Y. Random search for hyper-parameter optimization.Journal of Machine Learning Research, v. 13, n. Feb, p. 281–305, 2012.

BOGOMOLOV, A. et al. Once upon a crime: towards crime prediction fromdemographics and mobile data. In: ACM. Proceedings of the 16th internationalconference on multimodal interaction. [S.l.], 2014. p. 427–434.

BORGES, J. et al. Feature engineering for crime hotspot detection. In: IEEE. 2017 IEEESmartWorld, Ubiquitous Intelligence & Computing, Advanced & Trusted Computed,Scalable Computing & Communications, Cloud & Big Data Computing, Internet of Peopleand Smart City Innovation (SmartWorld/SCALCOM/UIC/ATC/CBDCom/IOP/SCI).[S.l.], 2017. p. 1–8.

BOX, G. E. et al. Time series analysis: forecasting and control. [S.l.]: John Wiley &Sons, 2015.

BRAGA, A. A. The effects of hot spots policing on crime. The ANNALS of the AmericanAcademy of Political and Social Science, Sage Publications Sage CA: Thousand Oaks,CA, v. 578, n. 1, p. 104–125, 2001.

75

BRAGA, A. A.; PAPACHRISTOS, A. V.; HUREAU, D. M. The effects of hot spotspolicing on crime: An updated systematic review and meta-analysis. Justice quarterly,Taylor & Francis, v. 31, n. 4, p. 633–663, 2014.

BREIMAN, L. Random forests. Machine learning, Springer, v. 45, n. 1, p. 5–32, 2001.

BROWN, D. E.; OXFORD, R. B. Data mining time series with applications tocrime analysis. In: IEEE. 2001 IEEE International Conference on Systems, Man andCybernetics. e-Systems and e-Man for Cybernetics in Cyberspace (Cat. No. 01CH37236).[S.l.], 2001. v. 3, p. 1453–1458.

CAPLAN, J. M.; KENNEDY, L. W. Risk terrain modeling compendium. Rutgers Centeron Public Security, Newark, 2011.

CARAGLIU, A.; BO, C. D.; NIJKAMP, P. Smart cities in europe. Journal of urbantechnology, Taylor & Francis, v. 18, n. 2, p. 65–82, 2011.

CHAINEY, S. Examining the influence of cell size and bandwidth size on kernel densityestimation crime hotspot maps for predicting spatial patterns of crime. Bulletin of theGeographical Society of Liege, v. 60, p. 7–19, 2013.

CHAINEY, S.; TOMPSON, L.; UHLIG, S. The utility of hotspot mapping for predictingspatial patterns of crime. Security journal, Springer, v. 21, n. 1-2, p. 4–28, 2008.

CLEVELAND, R. B. et al. Stl: A seasonal-trend decomposition. Journal of OfficialStatistics, v. 6, n. 1, p. 3–73, 1990.

COHEN, J.; GORR, W. L.; OLLIGSCHLAEGER, A. M. Leading indicators and spatialinteractions: A crime-forecasting model for proactive police deployment. GeographicalAnalysis, Wiley Online Library, v. 39, n. 1, p. 105–127, 2007.

DAVIES, T.; JOHNSON, S. D. Examining the relationship between road structure andburglary risk via quantitative network analysis. Journal of Quantitative Criminology,Springer, v. 31, n. 3, p. 481–507, 2015.

DOMINGOS, P. A few useful things to know about machine learning. Communicationsof the ACM, Association for Computing Machinery, v. 55, n. 10, p. 78–87, 2012.

ECK, J. et al. Mapping crime: Understanding hotspots. National Institute of Justice,2005.

FEURER, M. et al. Efficient and robust automated machine learning. In: Advances inneural information processing systems. [S.l.: s.n.], 2015. p. 2962–2970.

GAU, J. M.; BRUNSON, R. K. Procedural justice and order maintenance policing: Astudy of inner-city young men’s perceptions of police legitimacy. Justice quarterly, Taylor& Francis, v. 27, n. 2, p. 255–279, 2010.

GERBER, M. S. Predicting crime using twitter and kernel density estimation. DecisionSupport Systems, Elsevier, v. 61, p. 115–125, 2014.

GILLIES, S. The Shapely user manual. [S.l.]: Version, 2013.

76

GORR, W.; HARRIES, R. Introduction to crime forecasting. International Journal ofForecasting, Elsevier, v. 19, n. 4, p. 551–555, 2003.

HART, T.; ZANDBERGEN, P. Kernel density estimation and hotspot mapping:Examining the influence of interpolation method, grid cell size, and bandwidth on crimeforecasting. Policing: An International Journal of Police Strategies & Management,Emerald Group Publishing Limited, v. 37, n. 2, p. 305–323, 2014.

HUNT, P.; SAUNDERS, J.; HOLLYWOOD, J. S. Evaluation of the shreveport predictivepolicing experiment. [S.l.]: Rand Corporation, 2014.

JORDAHL, K. Geopandas: Python tools for geographic data. URL: https://github.com/geopandas/geopandas, 2014.

KADAR, C.; MACULAN, R.; FEUERRIEGEL, S. Public decision support for lowpopulation density areas: An imbalance-aware hyper-ensemble for spatio-temporal crimeprediction. arXiv preprint arXiv:1902.03237, 2019.

KNIBERG, A.; NOKTO, D. A Benchmark of Prevalent Feature Selection Algorithms ona Diverse Set of Classification Problems. 2018.

KOCHEL, T. R. Constructing hot spots policing: Unexamined consequences fordisadvantaged populations and for police legitimacy. Criminal justice policy review, SagePublications Sage CA: Los Angeles, CA, v. 22, n. 3, p. 350–374, 2011.

KOCHEL, T. R.; WEISBURD, D. Assessing community consequences of implementinghot spots policing in residential areas: findings from a randomized field trial. Journal ofExperimental Criminology, Springer, v. 13, n. 2, p. 143–170, 2017.

LEE, S. H. et al. Towards ubiquitous city: concept, planning, and experiences in therepublic of korea. In: Knowledge-based urban development: Planning and applications inthe information era. [S.l.]: IGI Global, 2008. p. 148–170.

LIN, Y.-L.; YEN, M.-F.; YU, L.-C. Grid-based crime prediction using geographicalfeatures. ISPRS International Journal of Geo-Information, Multidisciplinary DigitalPublishing Institute, v. 7, n. 8, p. 298, 2018.

LUNDBERG, S. M.; LEE, S.-I. A unified approach to interpreting model predictions. In:GUYON, I. et al. (Ed.). Advances in Neural Information Processing Systems 30. CurranAssociates, Inc., 2017. p. 4765–4774. Disponível em: <http://papers.nips.cc/paper/7062-a-unified-approach-to-interpreting-model-predictions.pdf>.

MALIK, A. et al. Proactive spatiotemporal resource allocation and predictive visualanalytics for community policing and law enforcement. IEEE Transactions onVisualization & Computer Graphics, IEEE, n. 12, p. 1863–1872, 2014.

MCKINNEY, W. et al. Data structures for statistical computing in python. In: AUSTIN,TX. Proceedings of the 9th Python in Science Conference. [S.l.], 2010. v. 445, p. 51–56.

MOHLER, G. O. et al. Self-exciting point process modeling of crime. Journal of theAmerican Statistical Association, Taylor & Francis, v. 106, n. 493, p. 100–108, 2011.

77

MOHLER, G. O. et al. Randomized controlled field trials of predictive policing. Journalof the American Statistical Association, Taylor & Francis, v. 110, n. 512, p. 1399–1411,2015.

MOSES, L. B.; CHAN, J. Algorithmic prediction in policing: assumptions, evaluation,and accountability. Policing and Society, Taylor & Francis, v. 28, n. 7, p. 806–822, 2018.

OLSON, R. S. et al. Data-driven advice for applying machine learning to bioinformaticsproblems. arXiv preprint arXiv:1708.05070, World Scientific, 2017.

PEDREGOSA, F. et al. Scikit-learn: Machine learning in python. Journal of machinelearning research, v. 12, n. Oct, p. 2825–2830, 2011.

PERRY, W. L. Predictive policing: The role of crime forecasting in law enforcementoperations. [S.l.]: Rand Corporation, 2013.

RATCLIFFE, J. What is the future. . . of predictive policing. Practice, v. 6, n. 2, p.151–166, 2015.

REPPETTO, T. A. Crime prevention and the displacement phenomenon. Crime &Delinquency, Sage Publications Sage CA: Thousand Oaks, CA, v. 22, n. 2, p. 166–177,1976.

RIBEIRO, M. T.; SINGH, S.; GUESTRIN, C. Why should i trust you?: Explainingthe predictions of any classifier. In: ACM. Proceedings of the 22nd ACM SIGKDDinternational conference on knowledge discovery and data mining. [S.l.], 2016. p.1135–1144.

ROSENBAUM, D. P. The limits of hot spots policing. Police innovation: Contrastingperspectives, Cambridge University Press New York, NY, p. 245–263, 2006.

SHERMAN, L. W.; GARTIN, P. R.; BUERGER, M. E. Hot spots of predatory crime:Routine activities and the criminology of place. Criminology, Wiley Online Library,v. 27, n. 1, p. 27–56, 1989.

SILVERMAN, B. W. Density estimation for statistics and data analysis. [S.l.]: Routledge,2018.

UNODC. Global Study on Homicide. 2013. https://www.unodc.org/documents/gsh/pdfs/2014_GLOBAL_HOMICIDE_BOOK_web.pdf.

VERLEYSEN, M.; FRANÇOIS, D. The curse of dimensionality in data mining and timeseries prediction. In: SPRINGER. International Work-Conference on Artificial NeuralNetworks. [S.l.], 2005. p. 758–770.

VOMFELL, L.; HÄRDLE, W. K.; LESSMANN, S. Improving crime count forecastsusing twitter and taxi data. Decision Support Systems, Elsevier, v. 113, p. 73–85, 2018.

WALT, S. V. D.; COLBERT, S. C.; VAROQUAUX, G. The numpy array: a structure forefficient numerical computation. Computing in Science & Engineering, IEEE ComputerSociety, v. 13, n. 2, p. 22, 2011.

78

WANG, X.; BROWN, D. E.; GERBER, M. S. Spatio-temporal modeling of criminalincidents using geographic, demographic, and twitter-derived information. In: IEEE.Intelligence and Security Informatics (ISI), 2012 IEEE International Conference on.[S.l.], 2012. p. 36–41.

WEISBURD, D. The law of crime concentration and the criminology of place.Criminology, Wiley Online Library, v. 53, n. 2, p. 133–157, 2015.

WEISBURD, D.; BRAGA, A. A. Hot spots policing as a model for police innovation.Police innovation: Contrasting perspectives, Cambridge University Press Cambridge, p.225–244, 2006.

WEISBURD, D.; ECK, J. E. What can police do to reduce crime, disorder, and fear?The Annals of the American Academy of Political and Social Science, Sage Publications,v. 593, n. 1, p. 42–65, 2004.

WEISBURD, D. et al. Does crime just move around the corner? a controlled study ofspatial displacement and diffusion of crime control benefits. Criminology, Wiley OnlineLibrary, v. 44, n. 3, p. 549–592, 2006.

YU, C. H. Exploratory data analysis. Methods, v. 2, p. 131–160, 1977.

YU, C.-H. et al. Crime forecasting using data mining techniques. In: IEEE. 2011 IEEE11th international conference on data mining workshops. [S.l.], 2011. p. 779–786.

ZIEHR, D. Leveraging Spatio-Temporal Features for Improving Predictive Policing. [S.l.]:Msc thesis, Karlsruhe Intitute of Technology, Germany, 2017.

79

APPENDIX A -- Selected Parameters of Machine Learning Models

Table 7: Selected hyperparameters of the machine learning models for the Natal crime scenarios.

Random Forest Gradient BoostingMappingMethod

CrimeScenario n_estimators max_depth max_features n_estimators max_depth max_depth

KGrid

CVP@day 1500 9 0.8 500 3 0.2CVP@night 500 9 0.8 500 3 1CVLI@day 1500 3 0.8 2500 6 0.4CVLI@night 500 3 0.8 500 3 0.4TRED@day 500 6 0.8 500 3 1TRED@night 500 6 0.8 500 3 0.8

KDE

CVP@day 4500 15 0.8 1500 3 0.4CVP@night 3500 15 0.8 500 3 0.6CVLI@day 1500 15 0.8 1500 6 0.2CVLI@night 1500 15 0.8 2500 6 0.4TRED@day 2500 15 0.8 2500 6 0.8TRED@night 3500 15 0.8 1500 6 0.4

80

Table 8: Selected hyperparameters of the machine learning models for the Boston crime scenarios.

Random Forest Gradient BoostingMappingMethod

CrimeScenario n_estimators max_depth max_features n_estimators max_depth max_depth

KGrid

Residential-Burglary@day 500 6 0.6 4500 9 0.4Residential-Burglary@night 1500 6 0.6 4500 9 0.4Simple-Assault@day 1500 6 0.6 500 3 0.2Simple-Assault@night 1500 6 0.8 500 3 0.4Drug-Violation@day 500 6 0.8 500 3 0.2Drug-Violation@night 500 9 0.4 500 3 0.2

KDE

Residential-Burglary@day 4500 15 0.8 2500 6 0.4Residential-Burglary@night 4500 15 0.2 2500 6 0.2Simple-Assault@day 2500 15 0.8 4500 6 0.4Simple-Assault@night 3500 15 0.2 2500 6 0.4Drug-Violation@day 2500 15 0.6 3500 6 0.2Drug-Violation@night 2500 15 0.6 2500 6 0.4

Predspot: Predicting Crime Hotspots with Machine Learning€¦ · Araújo jr, Adelson. Predspot:...

Documents

Transcript of Predspot: Predicting Crime Hotspots with Machine Learning€¦ · Araújo jr, Adelson. Predspot:...