Predspot: Predicting Crime Hotspots with Machine Learning€¦ · Araújo jr, Adelson. Predspot:...
Transcript of Predspot: Predicting Crime Hotspots with Machine Learning€¦ · Araújo jr, Adelson. Predspot:...
Federal University of Rio Grande do NorteCenter of Exact and Earth Sciences
Systems and Computing - PPgSC/UFRN
Predspot: Predicting Crime Hotspotswith Machine Learning
Adelson Araújo jr
Natal-RN
September 2019
Adelson Araújo jr
Predspot: Predicting Crime Hotspotswith Machine Learning
Master dissertation presented to the Pro-gram of Postgraduate Studies in Systemsand Computing (PPgSC) of the FederalUniversity of Rio Grande do Norte as arequirement to the M.Sc. degree.
Supervisor
Prof. Dr. Nélio Alessandro Azevedo Cacho
Federal University of Rio Grande do Norte – UFRN
Natal-RN
September 2019
Araújo jr, Adelson. Predspot: predicting crime hotspots with machine learning /Adelson Dias de Araújo Júnior. - 2019. 80f.: il.
Dissertação (Mestrado) - Universidade Federal do Rio Grandedo Norte, Centro de Ciências Exatas e da Terra, Programa de Pós-Graduação em Sistemas e Computação. Natal, 2019. Orientador: Nélio Alessandro Azevedo Cacho. Coorientador: Leonardo Bezerra.
1. Computação - Dissertação. 2. Policiamento preditivo -Dissertação. 3. Manchas criminais - Dissertação. 4. Aprendizadode máquina - Dissertação. I. Cacho, Nélio Alessandro Azevedo.II. Bezerra, Leonardo. III. Título.
RN/UF/CCET CDU 004
Universidade Federal do Rio Grande do Norte - UFRNSistema de Bibliotecas - SISBI
Catalogação de Publicação na Fonte. UFRN - Biblioteca Setorial Prof. Ronaldo Xavier de Arruda - CCET
Elaborado por Joseneide Ferreira Dantas - CRB-15/324
Dedicated to all potiguar people.
Acknowledgment
I thank God.
I thank the support of my extraordinary and close family, Flávia Bezerril, Adelson
Araújo, Fernando Bezerril, Graça Bezerril, Paula Sette, Fernando Bezerril Neto, Matheus
Coutinho, Fernanda Bezerril, Marina Bezerril, Maria Fernanda Bezerril, Mariana Bezerril
and Alexandre Serafim.
I thank my closest academic educators, Nélio Cacho, Leonardo Bezerra, Renzo Tor-
recuso and many others. Also those that financially contributed with my research, the
SmartMetropolis project and Google Latin America.
I thank my closest friends Giovani Tasso, Pedro Araújo, José Lucas Ribeiro, Mickael
Figueredo, João Marcos do Valle, Beatriz Vieira, Júlio Freire, Ottony Chamberlaine and
Lúcio Soares.
It is sometimes difficult to avoid the impression that
there is a sort of foreknowledge of the coming series of events.
Carl Jung
Predspot: Predicting Crime Hotspotswith Machine Learning
Author: Adelson Dias de Araújo Júnior
Supervisor: Prof. Dr. Nélio Alessandro Azevedo Cacho
Abstract
Smart cities are increasingly adopting data infrastructure and analysis to improve the
decision-making process for public safety issues. Although traditional hotspot policing
methods have shown benefits in reducing crime, previous studies suggest that the adop-
tion of predictive techniques can produce more accurate estimates for future crime con-
centration. In previous work, we proposed a framework to generate near-future hotspots
using spatiotemporal features. In this work, we redesign the framework to support (i) the
widely used crime mapping method kernel density estimation (KDE); (ii) geographic fea-
ture extraction with data from OpenStreetMap; (iii) feature selection, and; (iv) gradient
boosting regression. Furthermore, we have provided an open-source implementation of
the framework to support efficient hotspot prediction for police departments that can-
not afford proprietary solutions. To evaluate the framework, we consider data from two
cities, namely Natal (Brazil) and Boston (US), comprising twelve crime scenarios. We take
as baseline the common police prediction methodology also employed in Natal. Results
indicate that our predictive approach estimates hotspots 1.6-3.1 times better than the
baseline, depending on the crime mapping method and machine learning algorithm used.
From a feature importance analysis, we found that features from trend and seasonality
were the most essential components to achieve better predictions.
Keywords : predictive policing, hotspot prediction, machine learning, crime forecasting.
Predspot: Predizendo Hotspots Criminaiscom Aprendizado de Máquina
Autor: Adelson Dias de Araújo Júnior
Orientador: Prof. Dr. Nélio Alessandro Azevedo Cacho
Resumo
As cidades inteligentes estão adotando cada vez mais infraestrutura e análise de da-
dos para melhorar o processo de tomada de decisões em questões de segurança pública.
Embora os métodos tradicionais de policiamento de hotspot tenham se mostrado eficazes
na redução do crime, estudos anteriores sugerem que a adoção de técnicas preditivas
pode produzir estimativas mais precisas para a concentração espacial de crimes de um
futuro próximo. Em nossas pesquisas anteriores, propusemos uma metodologia para gerar
hotspots do futuro usando variáveis espaço-temporais. Neste trabalho, redesenhamos a es-
trutura do framework para suportar (i) o método de mapeamento de crimes amplamente
utilizado - estimativa de densidade de kernel (KDE); (ii) extração de características ge-
ográficas com dados do OpenStreetMap; (iii) seleção de atributos e; (iv) regressão com
o algoritmo Gradient Boosting. Além disso, fornecemos uma implementação de código
aberto da estrutura para suportar a predição eficiente de hotspots. Para avaliar nossa
abordagem, consideramos dados de duas cidades, Natal (Brasil) e Boston (EUA), com-
preendendo doze divisões de tipo de crime. Tomamos como método base de comparação
uma metodologia comumente utilizada e também empregada em Natal. Os resultados
indicam que nossa abordagem preditiva estima hotspots em média 1,6 a 3,1 vezes mel-
hor que a abordagem tradicional, dependendo do método de mapeamento do crime e do
algoritmo de aprendizado de máquina usado. A partir de uma análise de importância de
atributos, descobrimos que tendência e sazonalidade eram os componentes mais essenciais
para obter melhores previsões.
Palavras-chave: policiamento preditivo, previsão de hotspot, aprendizado de máquina,
previsão de crimes.
List of Figures
1 An illustration of KGrid. The events are clustered by using the K-Means
algorithm. Each cluster has external edges, which form a convex polygon.
These polygons are the topological separation of the city into subregions
or cells of a grid. By aggregating the count values in each cell, one can
map hotspots. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . p. 22
2 An illustration of Kernel Density Estimation and its parameters. Each
bold point (grid cell) represents an arbitrary place in which a kernel
function applies a density estimation around a bandwidth. For a set of
events/points, this procedure returns an array of KDE values indexed by
the cells identifier. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . p. 23
3 KDE results for different bandwidth values. The clear difference in res-
olution is observed between 0.1 and 0.01 miles (bottom). We observe
underfitting in the top left and underfitting in the bottom right situations. p. 24
4 Seasonal and trend components of a time series. In blue, the original
series show a varying behaviour which can be further explained by a
trend (in black) and a seasonality (in orange). The trend follows the
moving average of the series and the seasonality represents the cyclical
aspect of the original series in monthly oscillation. . . . . . . . . . . . . p. 25
5 Results of the survey with 54 police sergeants about the rating of different
landmarks and demographic aspects for determining hotspots. At the
top, the rating for property (CVP), in the middle for lethal or violent
(CVLI) and for drug-related crimes (TRED) at the bottom. . . . . . . p. 27
6 Geographic feature layers from Natal generated using KDE of residential
streets (left) and schools (right). Note that residential streets are denser
and concentrated in the north of the city, but still widespread in other
places. Schools are more concentrated in the center of the city, following
to the south, but with some concentration in the north. . . . . . . . . . p. 29
7 Model selection begins with loading and preparing datasets. Required
data sources include a crime database for model training and connection
to the OpenStreetMap API to load PoI data. Also, the city’s shapefile is
important for filtering data entered within its borders. . . . . . . . . . p. 35
8 Between data loading and model building, the feature ingest process is
responsible for assembling the independent variables and the variable to
be predicted. This starts with the spatiotemporal aggregations of crimes,
through crime mapping method and time series manipulation, and PoI
data spatial aggregation. The grid is a supporting element in this step
and will be used as the places where criminal incidence will be predicted.
Time series Cij extracted from grid cells indicate a set of values from a
grid cell i in period j, and PoI features Gki represent a the density of a k
PoI category located in a grid cell i. . . . . . . . . . . . . . . . . . . . . p. 37
9 The extraction of the independent variables (features) of the time series
in Predspot is conducted through a trend and seasonality decomposition
(through STL) and series differentiation. Each decomposition will gener-
ate a new series, and the variables will be composed of k lags from each
of these series. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . p. 38
10 An artificial example of temporal features for two places and three time
intervals. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . p. 38
11 In the machine learning modeling step, the best qualified features are
selected and feed into the algorithms. By adjusting various algorithms,
we can evaluate the models to use the one that has the best predictive
performance. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . p. 39
12 In the prediction pipeline, one must tailor the model selection steps to
return predictions using previously trained models. This means no longer
loading the entire dataset, just the crimes from the last period. Then,
extracting temporal features, filtering previously selected features, and
requesting the trained model for predictions. . . . . . . . . . . . . . . . p. 41
13 The web service comprises managing the prediction pipeline to attend
online requests. This requires the use of a file system, which we call
Volume, for caching the prediction results and an ETL process controller
to trigger prediction generation for each new period. . . . . . . . . . . . p. 43
14 The sample sizes of the twelve crime scenarios. Natal has more hetero-
geneous crime scenarios compared to Boston. . . . . . . . . . . . . . . . p. 51
15 Monthly sampled time series of the twelve crime scenarios. . . . . . . . p. 52
16 Time series decomposition for Residential-Burglary (in Boston) daytime
and nighttime scenarios. The second row is composed of trend series,
which are a smoothed version of the original. Third row are the seasonal
patterns, clearly distinct from day and night. Fourth row represent the
differentiated component. . . . . . . . . . . . . . . . . . . . . . . . . . . p. 53
17 Spatial representations of the two crime mapping methods for Residential-
Burglary (in Boston) daytime and nighttime crime scenarios. In the first
row, KGrid aggregation, and in the second row KDE. . . . . . . . . . . p. 53
18 PoI data from Natal (left) and Boston (right) extracted from Open-
StreetMap. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . p. 54
19 A representation of the geographic features taken from three PoI cate-
gories, hospitals (first row), residential streets (second row) and touristic
places (third row) of Natal (left) and Boston (right). . . . . . . . . . . p. 55
20 Cross-validation MSE results for KGrid-based models for Natal (in red)
and Boston (in blue) crime scenarios. . . . . . . . . . . . . . . . . . . . p. 59
21 Cross-validation MSE results for KDE-based models for Natal (in red)
and Boston (in blue) crime scenarios. . . . . . . . . . . . . . . . . . . . p. 60
22 PRRMSE results of five trials evaluating trained models for each crime
scenario. In the left side, the two crime mapping methods are compared
and in the right, machine learning algorithms. Note that KDE outper-
forms KGrid in all crime scenarios, but more sharply in crime scenarios
that have fewer data points, such as CVLI. On the other side, one can
note that GB models have higher percentiles, but with much more variance. p. 61
23 Feature importance of KGrid-RF models. . . . . . . . . . . . . . . . . . p. 66
24 Feature importance of KGrid-GB models. . . . . . . . . . . . . . . . . . p. 67
25 Feature importance of KDE-RF models. . . . . . . . . . . . . . . . . . p. 68
26 Feature importance of KDE-GB models. . . . . . . . . . . . . . . . . . p. 69
List of Tables
1 KDE parameters and machine learning hyperparameters considered in
the grid search tuning. . . . . . . . . . . . . . . . . . . . . . . . . . . . p. 56
2 Selected KDE parameters for the crime mapping methods applied in each
crime scenario. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . p. 57
3 Selected KDE parameters for geographic feature extraction applied for
PoI data aggregation. . . . . . . . . . . . . . . . . . . . . . . . . . . . . p. 58
4 Average and standard deviation of PRRMSE for each predictive approach,
considering five trials of the twelve crime scenarios. . . . . . . . . . . . p. 62
5 A pairwise statistical comparison of the four predictive approaches, con-
sidering the results from post-hoc analysis. . . . . . . . . . . . . . . . . p. 63
6 Wall time spent on model selection phase for each dataset. . . . . . . . p. 63
7 Selected hyperparameters of the machine learning models for the Natal
crime scenarios. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . p. 79
8 Selected hyperparameters of the machine learning models for theBoston
crime scenarios. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . p. 80
Contents
1 Introduction p. 13
1.1 Problem Statement . . . . . . . . . . . . . . . . . . . . . . . . . . . . . p. 14
1.2 Objectives . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . p. 15
1.3 Contributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . p. 16
2 Background p. 18
2.1 Predictive Hotspot Policing . . . . . . . . . . . . . . . . . . . . . . . . p. 18
2.2 Spatiotemporal Modelling . . . . . . . . . . . . . . . . . . . . . . . . . p. 21
2.2.1 Crime Mapping Methods . . . . . . . . . . . . . . . . . . . . . . p. 21
KGrid . . . . . . . . . . . . . . . . . . . . . . . . . . . . p. 21
Kernel Density Estimation (KDE) . . . . . . . . . . . . . p. 22
2.2.2 Time Series Decompositions . . . . . . . . . . . . . . . . . . . . p. 23
2.2.3 Geographic Features . . . . . . . . . . . . . . . . . . . . . . . . p. 26
2.3 Machine Learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . p. 28
2.3.1 Prediction Task . . . . . . . . . . . . . . . . . . . . . . . . . . . p. 30
2.3.2 Feature Selection . . . . . . . . . . . . . . . . . . . . . . . . . . p. 31
3 The Predspot Framework p. 33
3.1 Model Selection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . p. 34
3.1.1 Dataset Preparation . . . . . . . . . . . . . . . . . . . . . . . . p. 34
3.1.2 Feature Ingest . . . . . . . . . . . . . . . . . . . . . . . . . . . . p. 36
3.1.3 Machine Learning Modelling . . . . . . . . . . . . . . . . . . . . p. 39
3.2 Prediction Service . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . p. 40
3.2.1 Prediction Pipeline . . . . . . . . . . . . . . . . . . . . . . . . . p. 41
3.2.2 Web Service . . . . . . . . . . . . . . . . . . . . . . . . . . . . . p. 42
3.3 Implementation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . p. 43
3.3.1 Predspot python-package . . . . . . . . . . . . . . . . . . . . . . p. 43
3.3.2 Predspot service . . . . . . . . . . . . . . . . . . . . . . . . . . . p. 44
4 Evaluation p. 46
4.1 Datasets . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . p. 47
4.2 Experiment Methods and Metrics . . . . . . . . . . . . . . . . . . . . . p. 47
4.3 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . p. 50
4.3.1 Exploratory Data Analysis . . . . . . . . . . . . . . . . . . . . . p. 50
4.3.2 Parameter Selection . . . . . . . . . . . . . . . . . . . . . . . . . p. 56
4.3.3 Model Performance . . . . . . . . . . . . . . . . . . . . . . . . . p. 58
4.3.4 Feature Importance Analysis . . . . . . . . . . . . . . . . . . . . p. 64
5 Concluding Remarks p. 70
5.1 Future work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . p. 72
References p. 74
Appendix A -- Selected Parameters of Machine Learning Models p. 79
13
1 Introduction
Cities are the habitat of the increasing majority of human people, and the management
of their resources has become a complex task. With the growing impact of Information and
Communication Technologies (ICT), coupled with the fact that information has become as
valuable as energy (BATTY, 2013), city governance is undergoing a technological paradigm
shift to the so-called smart cities. This demand, justified by the current belief that ICT can
be a great facilitator of city management (ANGELIDOU, 2014; LEE et al., 2008), expresses
the need for the constitution of more sustainable city models. We agree with Caragliu, Bo
and Nijkamp (2011) who define a smart city as one where investments in social and human
capital, as well as in traditional and modern infrastructure (ICT), lead to sustainable
economic growth, high quality of life and the management of city’s resources through
participatory governance.
However, improvements in quality of life are unlikely to be effective if they disregard
crime incidence levels. According to a recent report of BBC (2018), among the 50 most
dangerous cities in the world (accounting by homicides per capita), only eight are not in
Latin America. Brazilian cities stand out in this ranking, with 17 cities. The city that hosts
this study, Natal, Brazil, holds the fourth position with slightly more than 100 homicides
per 100,000 inhabitants, which is a rate fifteen times higher than the global average rate
measured by the United Nations Office on Crime and Drugs (UNODC, 2013). Certainly,
these regions claim for changes in the way police resources are allocated.
Using Geographical Information Systems (GIS) to support the patrol vehicle dispatch
has become routine in many law enforcement agencies in the world, in a so-called hotspot
policing manner (SHERMAN; GARTIN; BUERGER, 1989). Based on the empirical evidence
that many crimes concentrate in few places (WEISBURD, 2015), an increasing body of
research has explored the effects of sending patrol units to spots of high crime incidence.
Braga, Papachristos and Hureau (2014) have found strong evidence that it does help
reduce crimes and Weisburd et al. (2006) suggested other benefits aside crime reduction.
Traditional hotspot policing literature focused on the historical aggregation of crime data
14
until the first prediction efforts appeared in America in the late 1990s (GORR; HARRIES,
2003).
Nowadays, smart cities are gradually adopting predictive data analysis to enhance
decision-making in public safety. Related works have reported many examples of such
quantitative techniques to support patrol planning as "predictive policing" applications
(PERRY, 2013; MOSES; CHAN, 2018). Machine learning is one of the techniques that has
gained momentum in such context due to the accuracy of its estimation and the flexibility
to explore patterns from a range of data, such as geographic (LIN; YEN; YU, 2018), de-
mographic (BOGOMOLOV et al., 2014) and social media (GERBER, 2014). However, crime
mapping methods (ECK et al., 2005) and spatiotemporal modeling may be crucial to make
efficient predictive models (ECK et al., 2005).
1.1 Problem Statement
Previous studies have shown that statistical models can make estimates that exceed
in terms of accuracy the human capacity to predict crime incidence over space and time
(MOHLER et al., 2015). Nevertheless, Moses and Chan (2018) suggest a considerable body
of research lacks accountability and transparency, and that these models should not be
implemented without an adequate description of the processing steps involved. Indeed,
we have found no research that at the same time (i) evaluates its efficiency against a
traditional hotspot policing approach implemented by the police and (ii) provides a clear
breakdown of the processing steps involved to implement such a predictive system.
A considerable body of research (LIN; YEN; YU, 2018; MALIK et al., 2014) developed
steps, guidelines and frameworks to implement a crime hotspot prediction model, with a
variety of standards. In a previous work (ARAUJO et al., 2018), we designed a framework to
model machine learning with time series autoregressive features based on a crime mapping
method named KGrid (BORGES et al., 2017; ZIEHR, 2017). Our ambitious purpose was to
create a standard of the processing steps involved in modelling machine learning models
to predict hotspots, but this first effort lacked in some points. First, despite having tested
several machine learning algorithms, there was still room for improvement in the results
found, perhaps because we considered a separate model for each place. Second, we did
not consider the kernel density estimation (KDE), extensively recommended by related
works (ECK et al., 2005; CHAINEY, 2013; HART; ZANDBERGEN, 2014), in our experiments.
We also consider as part of the transparency problem mentioned by Moses and Chan
15
(2018) the lack of open-source tools to leverage efficient crime hotspot prediction. Large
police departments may not feel this much as they have the budget to purchase commercial
solutions that meet their needs. However, small police departments, which often have more
worrying demands for violence, may not be able to provide more efficient tools. If they
want to build a prediction system, it can cost even more than buying one and they can
take much time to build. We argue that an open-source programming interface can ease
the implementation of web service to be deployed in a low budget police department.
1.2 Objectives
The purpose of this work is to improve our previously proposed prediction framework
through alternative crime mapping and feature engineering approaches, and provide an
open-source implementation that police analysts can use to deploy more effective predictive
policing.
Our first specific objective is to improve the efficacy of our previously proposed frame-
work. To do so, we consider alternative crime mapping and prediction algorithms, respec-
tively kernel density estimation (KDE) and gradient boosting regression (GB). We eval-
uate our expanded framework on two datasets, from Natal (Brazil) and Boston (US). We
compare our results with the traditional approach used by criminal analysts to generate
hotspots. Natal’s police department uses a data aggregation in the time window of the pre-
vious month to build their patrol plans, and we noted that this is a common baseline prac-
tice in related works (BROWN; OXFORD, 2001; COHEN; GORR; OLLIGSCHLAEGER, 2007).
By improving our framework, we also intend to describe better the process of producing
a crime hotspot predictive pipeline, from the model training to its online deployment.
We also investigate challenges concerning the tools available and the methodology
involved. The procedures of translating criminal events into attributes that spatially de-
scribe the concentration of crime and adjusting a predictive algorithm may involve many
different tools. Often, proprietary solutions as PredPol and HunchLab are robust, but
they mainly serve cities with higher purchasing power. Also, their tools are focused on
being easy to use and sometimes they can be rigid to configure particular procedures, such
as crime mapping and prediction algorithms. The democratization of such technologies
emerges as a real necessity since many of the poorer and more dangerous cities cannot
afford such solutions. Therefore, our second objective is to design an open-source software
and details its procedures to estimate future hotspots.
16
1.3 Contributions
In this study, we present a set of relevant contributions to predictive policing litera-
ture. First, we present qualitative improvements towards a transparent process necessary
to model machine learning for future hotspot crime estimation. We connect a broad set
of techniques and organize them into purpose-related processing steps for adjusting effi-
cient models and deploy them with a web service. As a second contribution, we provide
an open-source python-package implementation of our Predspot framework, to ensure re-
producibility of our approach and code reuse. The structure of the Predspot framework
enables the use of different crime mapping approaches and machine learning algorithms.
Thus, we argue that future work may use our standard framework to improve model
performance even more.
A third contribution is that we show empirical evidence that the predictive we modeled
estimate better than the traditional baseline implemented in Natal police. Depending on
the crime mapping method and machine learning algorithm chosen, predictive approaches
can average 1.6 to 3.1 times better than baseline. Fourth, we find that KDE is a robust
technique to predict crime incidence when fewer samples are available. In the smallest
samples we analyzed (lethal crimes in Natal), the predictive approaches with KDE were
much better than baseline compared to the cases we had more samples (property crimes
in Natal). Conversely, KGrid approaches have estimated better when more samples are
available.
Finally, a fifth contribution is related to the importance of analyzing the spatiotem-
poral features used. We modelled features based on trend and seasonality components of
time series and also geographic features from OpenStreetMap data. In our feature impor-
tance analysis, we find that trend and seasonality lags were the features that contributed
most in the adjusted models. From our knowledge, no previous work has investigated such
temporal patterns alongside machine learning algorithms for crime hotspot modelling.
17
This study is divided as follows. In Chapter 2, we present the necessary theoretical
framework to follow the proposals made in this work, from the hotspot policing discus-
sions and critique, through the spatiotemporal modeling methods to finally the machine
learning background necessary for this work. In Chapter 3, we introduce our framework
presenting the adaptations we made to the previous version, dividing the purpose-related
steps involved and present the open-source implementation we made. In Chapter 4, we
show the datasets used in our experiments, explaining the evaluation process conducted
and present the results. In Chapter 5, we conclude our study with a resumption of our
objectives, methods and experimental results, and provide further recommendations for
future work.
18
2 Background
A broad and complex spectrum of concepts underlies a predictive policing solution.
In this chapter, we review some theoretical aspects, starting from traditional hotspot
policing literature and the application of predictive algorithms to estimate crime hotspots.
Specifically, Weisburd’s law of crime concentration in places is used as a premise for the
implementation of spatially focused policing strategies, and we explore related discussions
in criminology and hotspot prediction to implement predictive policing.
Proper modeling of crime variables can impact prediction model success and efforts
to translate crime events into independent variables (features) are necessary. This spa-
tiotemporal modeling process, also called feature engineering, starts with a crime mapping
method and can be assisted temporally with time series decomposition. In crime hotspot
prediction literature, it is also common to use ancillary variables to help describe crime
spatial patterns, such as demographic (BOGOMOLOV et al., 2014), geographic (LIN; YEN;
YU, 2018) and from social media data (GERBER, 2014). Particularly for geographic vari-
ables, we present some strategies to use data from points-of-interest (PoI) in the city as
an alternative to help in the predictions.
Last, an increasing body of research (VOMFELL; HÄRDLE; LESSMANN, 2018; PERRY,
2013; BOGOMOLOV et al., 2014) suggest that machine learning algorithms fit very well with
predictive policing. Still, the many facets of machine learning, such as the algorithms and
prediction tasks, make it necessary to discuss its application more thoroughly in the
context of spatiotemporal crime analysis.
2.1 Predictive Hotspot Policing
Criminologists point to different strategies for reducing crime, disorder and fear (WEIS-
BURD; ECK, 2004). Among the methodologies addressed, there is strong evidence of the
effectiveness in patrolling micro-regions of crime concentration (WEISBURD; ECK, 2004;
19
PERRY, 2013; BRAGA, 2001). Perhaps this can be justified by the fact that policing strate-
gies have considered Weisburd’s law of crime concentration in places (WEISBURD, 2015),
which states that "for a defined measure of crime at a specific microgeographic unit, the
concentration of crime will fall within a narrow bandwidth of percentages for a defined
cumulative proportion of crime". For example, experiments suggest that criminal occur-
rences are concentrated in around 10% of places in the cities (ANDRESEN; LINNING, 2012;
ANDRESEN; WEISBURD, 2018), with variations for each crime type and study region. A
2006 US national survey (KOCHEL, 2011) reported that 90% of large policing departments
have considered this crime concentration pattern to draw the so-called hotspot policing
operations (SHERMAN; GARTIN; BUERGER, 1989). Researchers are continually reviewing
experiments on hotspot policing efficiency (BRAGA; PAPACHRISTOS; HUREAU, 2014), sug-
gesting that 20 out of 25 of the experiments have observed benefits on reducing crimes,
social disorder and an overall improvement in the perception of community safety.
Hotspot policing has received substantial interest and criticism, such as the claim
that crime displacement is a straight consequence of the former (REPPETTO, 1976). Con-
versely, according to Weisburd et al. (2006), the inevitable crime displacement idea has
been questioned because displacement is rarely total, and most of the times irrelevant.
Moreover, the author argued that focused patrolling leads to the diffusion of "other bene-
fits not related to crime" and also to the criminal’s discouragement. Another criticism on
hotspot policing is that most hotspots are related to poverty and race issues, hence increas-
ing inequality and even creating an environment of lowered police legitimacy (KOCHEL,
2011). In addition, Rosenbaum (2006) suggested that most police activity in hotspots is
enforcement-oriented and that aggressive strategies can increase negative contact with
citizens, mostly where perceptions of crime tend to be worse (GAU; BRUNSON, 2010). Al-
though reporting some short-term adverse effects, and pointing guidelines on minimizing
the latter, Kochel and Weisburd (2017) presented experimental results showing that there
is no long-term harm to communities’ public opinion when supported by continuous polic-
ing. However, the impossibility of storing every crime in databases is still a problem to
be solved, and victims of such data gathering limitation will often continue to be ignored
by law enforcement (MOSES; CHAN, 2018).
In contrast to the criticism discussed above, police scholars have argued that hotspot
policing is a model for police innovation (WEISBURD; BRAGA, 2006). Further, in the pursuit
of innovation in hotspot policing, predictive algorithms have been used to support preciser
estimators (MOHLER et al., 2015). According to Gorr and Harries (2003), the role of crime
forecasting had stopped being considered infeasible at the beginning of the 2000s, after
20
a major success of crime mapping systems, when the US National Institute of Justice
(NIJ) awarded five grants for studies to extended accuracy of short-term forecasts. The
aim was to estimate precisely spatial crime concentration, as the first step to practical
intervention. After a decade, the term predictive policing was coined and become a trend,
reflecting the role of "the application of analytical techniques to identify promising targets
for police intervention and prevent or solve crimes", according to Perry (2013). Stated in
another perspective, Ratcliffe (2015) suggests that predictive policing involves "the use
of historical data to create a spatiotemporal forecast of areas of criminality or crime hot
spots that will be the basis for police resource allocation decisions with the expectation
that having officers at the proposed places and time will deter or detect criminal activity".
Still, predictive policing may require practical policing planning, in contrast to the role
of spatial crime forecast per se, as discussed by Gorr and Harries (2003).
Indeed, few studies have assessed the effect of predictive hotspots against traditional
crime mapping with GIS to reduce crime incidence (HUNT; SAUNDERS; HOLLYWOOD,
2014), and Moses and Chan (2018) suggested that there are two ways of evaluating a
predictive policing solution. The first is by reporting the "drops in particular categories
of crime in particular jurisdictions employing its software" and the second by measuring
the "predictive accuracy of particular tools". One may argue that the former is more
prone to the predictive policing definition and the second to spatial crime forecasting
analysis. Some studies have evaluated both, e.g. Hunt, Saunders and Hollywood (2014)
have shown a null effect on applying predictive hotspot policing, suggesting an important
role for patrol program implementation failures and low statistical power of the tests
conducted. On the other hand, an experiment in the Los Angeles Police Department
(MOHLER et al., 2015) showed promising results on both incidence decrease in particular
categories of crimes and predictive performance. In their study, models predicted 1.4
to 2.2 times better than trained crime analysts (accuracy evaluation), leading to 7.4%
crime reduction, compared with 3.5% of the treatment effect (impact evaluation). Their
treatment approach, or baseline, was to let criminal analysts manually indicate a risk
value for the delimited regions to their knowledge.
In consultation with the Natal Police Department, we note that they use a strategy
for estimating hotspots using historical data. They apply a spatial aggregation using data
from the previous month to plan the next patrol. Such a methodology assumes that the
immediate past may indicate a good measure of the future, and has already been used in
related studies (BROWN; OXFORD, 2001; COHEN; GORR; OLLIGSCHLAEGER, 2007).
21
2.2 Spatiotemporal Modelling
To deliver accurate predictions, a hotspot prediction analysis must be based on solid
spatiotemporal modeling. Previous studies (ECK et al., 2005; CHAINEY; TOMPSON; UHLIG,
2008) have reported different methods of aggregating crime spatially, using crime mapping
methods. Here, we review two of such methods, namely KGrid and KDE. We then discuss
strategies for representing temporal patterns using time series decomposition methods
and how the results of such transformation can be useful to map crimes in a rich set
of spatiotemporal variables. To complement them, ancillary geographic variables can be
derived from OpenStreetMap data of the city, and we discuss an alternative strategy to
do so.
2.2.1 Crime Mapping Methods
The literature of criminal spatial analysis commonly refers to the procedures of di-
viding the city into subregions and aggregating criminal events as crime mapping. This
aggregation creates the relationship between a subregion and a crime incidence level. In
a prediction task, this is the first step to generate the feature set, composed by crime
incidence levels to each place in each time interval of available data.
Describing several techniques, Eck et al. (2005) discusses their qualitative pros and
cons. Some of these methods divide the city from the distribution of criminal events (e.g.
forming spatial ellipses) and others from geometries regularly spaced (e.g. rectangles or
points). The forms of aggregation of crimes may be divided into (1) counting crimes within
the area bounded by each subregion, and (2) calculating the weighted sum of the crimes
by their distances to the centre of each subregion. In this work, we compare two crime
mapping methods, KGrid and KDE, when translating crime events into features to fit
machine learning models.
KGrid The techniques of dividing crime into spatial ellipses assume that crime groups
spatially into geographic units over a time window. One of the ways to build these groups
of places is by using clustering. Recently, Borges et al. (2017) proposed a division of
the city based on the construction of convex polygons drawn from the spatial grouping of
criminal events. This technique, called KGrid, consists in applying the K-Means algorithm
in the crimes location attribute to define a grid. In Figure 1, we present an illustration of
crime mapping in KGrid. The spatial aggregation of crimes is made by counting events
22
Event
Grid cell
Figure 1: An illustration of KGrid. The events are clustered by using the K-Means algo-rithm. Each cluster has external edges, which form a convex polygon. These polygons arethe topological separation of the city into subregions or cells of a grid. By aggregating thecount values in each cell, one can map hotspots.
within the polygons, also called grid cells. The author also mentioned that this method has
the advantage of considering the topology of criminal incidence to project the regions and
that the resolution can be adjusted to meet different analytic demands. The parameter
K controls this resolution and can cause effects on the performance of the algorithms, as
verified by Araujo et al. (2018).
Kernel Density Estimation (KDE) Among crime mapping methods, there is evi-
dence that KDE is the most appropriate technique for hotspot mapping (CHAINEY; TOMP-
SON; UHLIG, 2008). The practical reasons for that are the visual effect that simulates
meaningful heatmaps, and the inherent spatial correlation considered on data aggrega-
tion (HART; ZANDBERGEN, 2014). Illustrated by Figure 2, it consists of creating a grid
of points regularly spaced, and applying a kernel function to return a density estimation
for each point. Equation 2.1 formally defines it for bidimensional analysis, where h is the
bandwidth parameter of the kernel function K applied, and dx,y(i) is the distance between
all the incident i and the centre of the grid point described by its coordinates x, y. To
analyze it temporally, one can (i) add the third dimension to the formula, or (ii) generate
a density estimation for each time window, forming a time series.
f(x, y | h) = 1
nh2
n∑i=1
K(dx,y(i)
h) (2.1)
23
Kernel function (Gaussian)
Bandwidth Grid cell
Event
Figure 2: An illustration of Kernel Density Estimation and its parameters. Each boldpoint (grid cell) represents an arbitrary place in which a kernel function applies a densityestimation around a bandwidth. For a set of events/points, this procedure returns anarray of KDE values indexed by the cells identifier.
As discussed, KDE presents user-defined parameters to be configured. Among them
are the kernel function, its bandwidth and the grid resolution. Previous studies have in-
vestigated the effect of these parameters on crime hotspot mapping precision (CHAINEY,
2013; HART; ZANDBERGEN, 2014), suggesting that kernel and bandwidth are the most rel-
evant factors to be analyzed. Figure 3 illustrates that selecting an appropriate bandwidth
has severe implications for crime incidence representation when using KDE. We can see
that when bandwidth is equal to 1 mile, it results in an underfitting situation and the
bandwidth of 0.01 mile gives an overfitting distribution.
A simple way to select bandwidth is to use Silverman’s rule of thumb (SILVERMAN,
2018). However, it is only applicable to estimations using the Gaussian kernel. By ana-
lyzing both bandwidth and kernel, Hart and Zandbergen (2014) showed that Gaussian
kernels are not the optimal choices for mapping crime occurrences, suggesting linear ker-
nels instead. To retrieve a multiple parameter combinations that maximizes likelihood
estimation, Mohler et al. (2011) suggests running a grid search with cross-validation.
2.2.2 Time Series Decompositions
Temporal patterns have been explored by the crime prediction literature since re-
searchers started to forecast crime one period ahead, and Gorr and Harries (2003) date it
from the 1990s. Roughly speaking, when humans want to predict something, they look for
24
B = 1 mile B = 0.5 mile
B = 0.1 mile B = 0.01 mile
Figure 3: KDE results for different bandwidth values. The clear difference in resolutionis observed between 0.1 and 0.01 miles (bottom). We observe underfitting in the top leftand underfitting in the bottom right situations.
different items in the past and estimate what they think will happen next. The forecasting
methods we discuss here were based on a similar intuition, using past observations (lags) of
time series to estimate a value one or more steps ahead. In this sense, autoregressive (AR)
models were extensively used, and Brown and Oxford (2001) suggested them as suitable
methods for baseline comparison. In a more comprehensive formulation, autoregressive
integrated moving average (ARIMA) models were proposed to deal with non-stationary
series (BOX et al., 2015). Besides extracting p AR lags, ARIMA works with q moving av-
erage components (smoothed versions of the original series) and d series differentiations
(see Eq. 2.2) that are much more prone to be stationary. After decomposing the series
into these components, Box et al. (2015) suggest to apply parameter estimation using
non-linear methods.
Yd(t) = Y (t)− Y (t− 1) (2.2)
Seasonal-trend decomposition by loess (STL) (CLEVELAND et al., 1990) was also used
in forecasting models for crime prediction (BORGES et al., 2017; MALIK et al., 2014). By rep-
resenting series in an additive configuration of trend, seasonality and residuals (Equation
2.3), this decomposition method can reveal other temporal patterns. We depict seasonal
25
and trend components of an illustrative time series in Figure 4.
Y (v) = T (v) + S(v) +R(v) (2.3)
Jun2014
Jul Aug Sep Oct Nov Dec
Time
100
120
140
160
180
Crim
es
Original seriesSeasonalityTrend
Figure 4: Seasonal and trend components of a time series. In blue, the original seriesshow a varying behaviour which can be further explained by a trend (in black) anda seasonality (in orange). The trend follows the moving average of the series and theseasonality represents the cyclical aspect of the original series in monthly oscillation.
Further, temporal modelling that uses previous observations requires an adequate
selection of the number of lags to be considered. In the case of ARIMA components, we did
not find an optimal methodology behind the selection of its lags, only practical guidelines
considering autocorrelation functions, but the seminal study of Box et al. (2015) suggested
to prioritize parsimony (less complex models) to avoid overfitting. It is reasonable to
think that lag selection should depend on the time series sample frequency. STL theorists
recommend the number of lags of the seasonal component to be related with time series
sample frequency, e.g. to take 12 lags in monthly series (CLEVELAND et al., 1990). The
assumption underlying it is that the 13th month may be very correlated to the 1st, and
the features introduced will not present seasonal contributions proportionally with the
complexity of adding one more feature. Still, with fewer features, one may keep sufficient
information, but we do not know what specific lags contribute more to each time series.
A particular crime type would benefit from the first three lags and the other benefit most
from the last. To solve this, we will discuss later in this chapter feature selection techniques
26
based on machine learning.
2.2.3 Geographic Features
After applying crime mapping and extracting spatiotemporal features from crime
points, other secondary factors can help to explain the urban place in which the subre-
gions were derived. Related studies have proposed joining many exogenous information
to help in crime prediction tasks, such as social media traffic (GERBER, 2014) demogra-
phy aspects (BOGOMOLOV et al., 2014) and geographic location of PoI data (LIN; YEN;
YU, 2018). We argue that these strategies are difficult to reproduce in an arbitrary crime
prediction application since the data availability can be a problem. Among the three
types of information mentioned, we find that the latter can be acquired using the volun-
teered geographic information systems such as OpenStreetMap. Thus, our secondary set
of features, namely geographic features, will be explored under the PoI data available on
OpenStreetMap to ensure more reproducibility potential across other cities.
To identify relevant PoI categories that help describe hotspots, we conducted an opin-
ion survey in the Natal’s police department (SESED/RN) with police cops that work in
patrols. A total of 54 interviewed cops were asked to assign a value for the relative impor-
tance (an integer between 1 and 5) of different PoI categories that may spatially explain
crime incidence of three different types: property (CVP in the nomenclature of Natal’s
Police Department), violent or lethal (CVLI) and drug-related crimes (TRED). We in-
structed them to assign 5 to the items that most contribute to (attracting or repulsing)
crime incidence and 1 to items that do not influence in their opinion. The results are
shown in Figure 5.
27
0 1 2 3 4
Street lighting levelGangs location
Public squares and sports playgroundNeighborhood population
SchoolsResidential streets
Neighborhood's per capita incomePrimary streets
Touristic placesBars and restaurants
Night clubsHospitals or clinicsPolice departments
Places of worshipApartment concentration
Gated community
CVP
0 1 2 3 4
Gangs locationStreet lighting level
Public squares and sports playgroundNeighborhood population
Neighborhood's per capita incomeBars and restaurants
Touristic placesPrimary streets
SchoolsResidential streets
Night clubsHospitals or clinicsPolice departments
Places of worshipApartment concentration
Gated community
CVLI
0 1 2 3 4 5
Importance
Gangs locationSchools
Public squares and sports playgroundStreet lighting level
Touristic placesNeighborhood population
Bars and restaurantsNeighborhood's per capita income
Night clubsResidential streets
Primary streetsPolice departmentsHospitals or clinics
Gated communityApartment concentration
Places of worship
TRED
Figure 5: Results of the survey with 54 police sergeants about the rating of differentlandmarks and demographic aspects for determining hotspots. At the top, the rating forproperty (CVP), in the middle for lethal or violent (CVLI) and for drug-related crimes(TRED) at the bottom.
28
They believe "Gangs location", "Street lighting level" and "Public squares" are cru-
cial aspects for the three crime categories. Particularly for TRED crimes, "Schools" and
"Touristic places" arise as essential features for them. Demographic elements, such as the
population of the neighborhood and its per capita income, are other aspects in which
they regard relevance when describing dangerous places. Another interesting fact is that
residential streets are highlighted for CVP crimes, as found in a related work (DAVIES;
JOHNSON, 2015). From our perspective, the results of the questionnaire do not necessarily
reflect the aspects that determine which places are dangerous or not but give us a direc-
tion to select PoI categories in the broad number of OpenStreetMap features. Also, we
must say that the purpose of the survey conducted was not to compare the cops’ opinion
with algorithm results, but to consult cops’ opinion regarding geographic risk factors and
then model features considering data availability.
Related works have considered geographic features to model crime incidence using
PoI data aggregation. Caplan and Kennedy (2011) suggested that the density of some
facilities in city blocks, considering a bandwidth, can represent their spatial concentration.
They also indicated that in violent crimes, the distance between the closest facilities may
be another correlated spatial pattern. Wang, Brown and Gerber (2012) have used both
mentioned methods, counting PoI within city blocks and taking the distance from the
city block center to the closest PoI, generating spatial information regarding each layer of
PoI. Differently, Lin, Yen and Yu (2018) used the counting strategy but also considering
weighting neighbor city blocks, to increase the spatial autocorrelation in their aggregation.
We argue that these approaches can be adapted to consider in a single variable both
density and distance decay by using KDE (example in Figure 6), as we will discuss later.
Also, we will present in the next chapter our approach to select the subset of PoI to be
included in the predictions of each crime type, considering the particular correlations that
each facility can correspondingly present.
2.3 Machine Learning
In the previous section, we explored the methods behind translating crime events
and PoI locations into independent variables (features) for describing crime incidence in
space and time. We mentioned that the crime mapping method is the spatial aggregation
approach and using a temporal sample frequency, one can generate time series for each
grid cell. Further, to extract a more diverse set of temporal patterns (such as trend
and seasonality), we described methods of time series decompositions, which we argue
29
Figure 6: Geographic feature layers from Natal generated using KDE of residential streets(left) and schools (right). Note that residential streets are denser and concentrated in thenorth of the city, but still widespread in other places. Schools are more concentrated inthe center of the city, following to the south, but with some concentration in the north.
being useful also as feature extraction methods. To complement features with external
information, we suggested using OpenStreetMap data and extract geographic features
based on PoI density, instead of the current practice of related studies. Nonetheless, the
prediction methodology was not discussed yet.
To forecast crime incidence levels periods ahead using spatiotemporal features, super-
vised machine learning methods have been aroused as efficient tools in many recent related
studies (LIN; YEN; YU, 2018; VOMFELL; HÄRDLE; LESSMANN, 2018; ZIEHR, 2017; ARAUJO
et al., 2017; BORGES et al., 2017). In such class of heuristic algorithms, there is a set of
them specifically designed to learn relationships between a group of inputs or features X
and outputs or target variable y. This set of algorithms are called supervised because they
iteratively adjust internal weights to minimize the error between the predictions and the
actual value. This process is called training and involves adjusting internal parameters
using features extracted from the dataset. After training a model using an algorithm, one
can use it to predict values for new data.
In crime prediction studies, researchers have used many supervised machine learning
algorithms, such as Support Vector Machines (YU et al., 2011), Random Forest (VOMFELL;
HÄRDLE; LESSMANN, 2018), Multilayer Perceptron (ARAUJO et al., 2018), including based
on Deep Neural Networks (LIN; YEN; YU, 2018), and several others. To the best of our
30
knowledge, there is not a consensus on the best algorithm for crime prediction tasks. In
this work, we do not intend to search for the best algorithm among those mentioned, but
we aim to evaluate how different the performances can be with different algorithms in
different crime mapping approaches. In our experiments, we will choose to consider Ran-
dom Forest and Gradient Boosting for two reasons. First, because they were empirically
suggested as efficient algorithms in crime prediction studies, respectively by Borges et
al. (2017), Vomfell, Härdle and Lessmann (2018). Second, because both have similarities
among each other, such as being ensemble algorithms based on Decision Tree, i.e. they are
constituted by a finite set of Decision Trees combined to provide a better prediction. The
assumption behind ensemble algorithms is that a group of weak learners forms a stronger
one (BREIMAN, 2001).
Still, each of these ensemble algorithms has its proper way to combine learners, namely
"bagging" in Random Forest and "boosting" in Gradient Boosting. Bagging is when a
model randomly chooses subsets of data, with replacement, to give training samples for
Decision Trees, fits them and then retrieve the average of their predictions. In addition to
this process, the so-called bootstrap aggregation, the Random Forest algorithm trains each
of its trees with different features, randomly selected for each one. On the other hand,
boosting is when a model incrementally adds a new learner, updating weights (using
gradient descent in the case of Gradient Boosting) of the samples in which there were
more mispredictions (VOMFELL; HÄRDLE; LESSMANN, 2018)/.
Besides the algorithm choice, there are other concerns when applying machine learning
for modelling hotspot predictions. First, supervised machine learning is often distinguished
between classification and regression tasks, and in crime analysis, this can change the
output considerably, as we will discuss. Second, we present feature selection methods also
based on machine learning to filter lag components of each temporal decomposition, as
well as filtering PoI data layers, that are most important for a more accurate estimation
in each type of crime.
2.3.1 Prediction Task
A prediction task is defined accordingly with the target variable, which can be a
class (hotspot or coldspot) or ordinal values (crime incidence levels). Previous work im-
plemented both, classifiers (BOGOMOLOV et al., 2014; ARAUJO et al., 2018; LIN; YEN; YU,
2018) and regressors (MALIK et al., 2014; BORGES et al., 2017; ARAUJO et al., 2017) to es-
timate dangerous places in the future. From our perspective, the latter strategy is more
31
appropriate, since classifying a place as a hotspot or not, or even as "low", "medium"
and "high" dangerousness, must follow an aggregation on an ordinal hierarchy derived
from crime incidence values. Thus, aggregating this value into classes would hide inherent
variance present in the data.
Another concern related to such aggregation is that depending on the particular quan-
titative definition of a hotspot (e.g. more incidences than the average of four last observa-
tions), class unbalance may be a problem (ARAUJO et al., 2018). One can define a hotspot
threshold in which few samples are delimited, and it may generate too much of a class.
Also, much of the samples variance is lost in this discretization. On the other hand, mod-
elling regressors in a highly variant samples requires further inspection on outlier filtering
to prevent the model from biased predictions. For instance, if samples are concentrated
in lower values, the model will prefer to predict less values to get better overall perfor-
mance. We argue that crime mapping method parameter selection, described in Section
2.2 is crucial to obtain parsimony target variables and consequently, more efficient mod-
els. Thus, the choice of the prediction task involves a trade-off analysis, between biased
samples with regards to the choice of a threshold and variant samples when considering
raw crime incidence levels.
2.3.2 Feature Selection
As we discussed, the selection of temporal lags and PoI layers leads to a leaner repre-
sentation of spatiotemporal and geographic patterns. For instance, the trend component
extracted from time series of crimes would have more correlated information with the
firsts lags, and the seasonal component with the lasts. Also, hospitals density may be
relevant to predict burglary crimes, but not to violent crimes. When a model uses many
variables to predict a target, the algorithm fitting process becomes more complex, since
the model will have much more parameters than inputs, which will result in lack of model
stability and overfitting (VERLEYSEN; FRANÇOIS, 2005). Sometimes adding variables can
even disturb algorithm performance, because the algorithm will try to fit variables that
can even be noise to the predicted variable. Therefore, applying feature selection is an
essential step to ensure all the models will use all variables available to predict.
The feature selection task can be performed by machine learning algorithms in three
different ways, using wrapper, embedded or filter methods. Wrapper methods combine a
search strategy with a predictor to select the best subset of features, training a machine
learning algorithm for a subset of features randomly take, producing a set of models.
32
The model with the best performance represents the best set of features to be selected.
Embedded methods differ from the wrapper ones because they analyze the model structure
instead of the performance. They consider the weights assigned by the predictor for each
feature as a measure of importance, excluding the least important ones. Finally, filter
methods consider feature importance by using a measure for correlation (e.g. χ2) with the
target variable, and also calculate feature-to-feature correlation to avoid redundancy.
In this work, we followed the suggestion of Kniberg and Nokto (2018), that systemati-
cally evaluated several feature selection algorithms, providing useful guidelines. Although
not having a single algorithm performing choice for both runtime and predictive per-
formance, they suggested that a Decision Tree modelled as an embedded method had
reasonable results in the two aspects. The idea of such embedded is to calculate feature
importance applying for each feature random permutations and measuring the average
performance drop when fitting the Decision Tree. The features with the lower loss are
assumed as not important because it is not influencing the predictions as the others do.
33
3 The Predspot Framework
Given that data-driven predictive analysis varies according to the developers’ exper-
tise, one can find different methodologies and frameworks to predict crime hotspots. For
instance, Malik et al. (2014) divided the stages of its processing into (1) geospatial divi-
sion in subregions, (2) generation of time series, (3) prediction and (4) visualization of
results. Similarly, Lin, Yen and Yu (2018) proposed to (1) create a grid, (2) intersect the
grid in the city map, (3 to 6) extract data and grid features, (7) train a machine learning
algorithm and (8) assess the latter. The similarities across methodologies motivated us to
pursue a more generalized approach, namely Predspot, which we discuss in this chapter.
In previous work, we proposed a framework detailing the steps of spatiotemporal
modelling and machine learning for crime hotspot prediction (ARAUJO et al., 2018). Our
purpose was to improve the tasks’ transparency and the parameter selection involved.
In this chapter, we introduce a redesign of this framework to include a more generic ap-
proach for applying efficient crime mapping, and a more detailed feature ingest procedure.
We agree with Domingos (2012) that suggests the success of a machine learning solution
is on feature engineering endeavors. The framework is divided into two phases, namely
model selection and prediction service, analogously to the training and prediction steps of
machine learning algorithms. Each phase has its steps to achieve the final goal. This divi-
sion has the purpose of differentiating the model adjustment and its usage in operational
policing software.
Furthermore, in this chapter, we explain how our methodology was implemented.
We detail our python-package software to support model selection operations and a web
service interface to illustrate how the prediction service can be managed. These elements
shall guide the software routines involving deploying the Predspot framework in a police
department.
34
3.1 Model Selection
As we discuss in Section 2.3, supervised machine learning algorithms need a training
step to adjust internal parameters and to predict based on the patterns found in the train-
ing data. In this section, we overview how to prepare the dataset, apply a crime mapping
method, extract features from time series and fit a model considering hyperparameter
tuning. This workflow is comprised of three steps, namely "dataset preparation", "feature
ingest" and "machine learning modelling". These three steps compose the so-called model
selection phase, which purpose is to train, evaluate and save an efficient model that shall
be used operationally. We describe each step providing explanations regarding the neces-
sary inputs and parameters to be configured, as well as the corresponding workflow and
how to evaluate the model selected. It is worthy mentioning that the evaluation of the
model selection phase is limited to assess the predictions of the model adjusted. Thus,
it is measured in terms of error or accuracy ratios and not directly concerned with the
practical impact of policing operations.
3.1.1 Dataset Preparation
From a systemic point of view, the inputs of the model selection phase are a vectorized
file of city map (e.g. shapefile), a sufficiently large database of crime records provided by
the police department, and auxiliary data sources. To apply the procedures, the crime
records must have been registered at least with latitude, longitude, timestamp and crime
type. Also, in the Predspot framework, we propose using OpenStreetMap as the auxiliary
data source. It provides data of Points-of-Interest (PoI) of many cities in an open-source
manner, thus increasing the reproducibility potential of our approach. The data can be
extracted through the OverPass API.1
The first processing step of model selection is "dataset preparation", illustrated in
Figure 7. It concerns (1) loading the data from Crimes DB and external sources, (2)
applying spatial filters using the City Shape and (3) separating crime data into Crime
Scenarios, according to a crime types division. The spatial filtering process consists
of applying a spatial join operation taking georeferenced data of crimes and PoI that
are within the city boundaries. It is also important to drop duplicate records, and look if
there are "default" location values which events without proper registration are improperly
assigned to. Since crime data can be mostly acquired through human interactions, spatial1https://wiki.openstreetmap.org/wiki/Overpass_API
35
bias can arise (KOCHEL; WEISBURD, 2017). We argue that previously exploring the dataset
to clean invalid records is crucial to avoid harming the model with invalid data.
SHAPEFILE
CRIMESDB
PoIDATA
CITYSHAPE
FILTERS
CRIMESCRIMESCRIMESCENARIO
OpenStreetMapAPI
Figure 7: Model selection begins with loading and preparing datasets. Required datasources include a crime database for model training and connection to the OpenStreetMapAPI to load PoI data. Also, the city’s shapefile is important for filtering data enteredwithin its borders.
In addition, to manage the separation of crime scenarios, it is important to follow
the division made by the local authorities presented in the data. For example, if the
police department is concerned with burglary crimes and the data has many types of
burglaries, the aggregation of all types of burglary must be made carefully and aligned
with the police department’s opinion. Otherwise, we suggest using the default division
of the data rigorously. Supported by empirical evidence (ANDRESEN; LINNING, 2012), we
argue that it is not appropriate to aggregate crime types into a broader category. The
sum of spatial contributions of different sources of crime, such as residential burglary and
drug-related offenses, can generate areas in the middle of these two types of events where
no crime happens at all. Also, the Natal’s police department suggested dividing the data
into daytime and nighttime scenarios for each crime type, since different patterns can
arise. In the following steps of Predspot, we use each of these crime scenarios (crime type
and day period) as separate datasets.
Regarding PoI data extraction, one can easily download data from OpenStreetMap
querying from the Overpass API or using the web-based tool Overpass Turbo.2 The
data is categorized into map features: streets, traffic signs, and intersections belonging
to the "highway" category; hospitals, schools, restaurants and other facilities belonging
to "amenity"; other interesting categories are "leisure", "tourism" and "nature".3 Still,
choosing PoI categories may not be a trivial task, and we argue that it can be made by
analyzing geographic risk factors with the help of policing experts, as presented in Section2https://overpass-turbo.eu/3https://wiki.openstreetmap.org/wiki/Map_Features
36
2.2.3 or by analyzing related studies. For example, Davies and Johnson (2015) argues that
the street network has shown relevance in his crime prediction study. The selection of the
most relevant PoI layers is discussed later in this chapter as a feature selection problem.
3.1.2 Feature Ingest
Before the "feature ingest" step begins, it is necessary to choose the spatial and
temporal units of predictions. First, the choice of the spatial unit of analysis can be made
according to police department patrolling policies, or according to current practices of
crime mapping. If the police department wants predictions by neighborhood, the grid is
made up of neighborhoods in the city. If there is no such restriction and spatial resolution
is a priority, artificial grids from a crime mapping method may be a suitable alternative.
For example, KDE works with a grid of points, and KGrid with a set of convex hull
polygons. Artificial grids have the advantage of configuring spatial resolution using a
parameter. For example with K in KGrid or the space between grid points in KDE. As
a drawback of high-resolution grids, it generates sparser time series more challenging to
predict (MALIK et al., 2014; BORGES et al., 2017). Therefore, one cannot merely increase
the grid resolution without further inspecting the results. On the other hand, the second
choice is to select the temporal sample frequency. Similarly to grid cells, too small time
intervals of aggregation may also result in sparse time series. Even that police may prefer
hourly sampled predictions, we agree with Malik et al. (2014) that suggested weekly, or
monthly aggregates are more appropriate, depending on the sample size.
With the spatial and temporal units defined, the "feature ingest" step’s workflow con-
sists of the following procedure (illustrated in Figure 8). The first data item necessary is
the set of crime events of a given crime scenario (described as tuples of latitude, longitude
and timestamp). As it has spatiotemporal attributes, we start the ingest by aggregat-
ing crimes spatially, accordingly with the Grid disposition derived from the Mapping
Method, and then temporally by sampling time series for each place and time interval,
completing the so-called Spatiotemporal Aggreg.. This result in Time Series Cij,
indexed by the grid cell i and the time interval j.
Then, we use these time series to start the Temporal Feature Extraction, illus-
trated in Figure 9. In Section 2.2.2 we described two time series decomposition methods,
namely ARIMA and STL. Although each method has its way of setting parameters to
make predictions, we take advantage of the derived components as the basis for the feature
37
DATASETCRIME
SCENARIO
DATASETCRIME
SCENARIO
FEATURESET
X y
CRIMESCENARIO TIME
SERIESCij
CITYSHAPE
SPATIO-TEMPORALAGGREG.
TEMPORALFEATURE
EXTRACTION
PoIDATA
GRID
SPATIALAGGREG.
PoIDENSITY
Ghi
MAPPINGMETHOD JOIN
Figure 8: Between data loading and model building, the feature ingest process is responsi-ble for assembling the independent variables and the variable to be predicted. This startswith the spatiotemporal aggregations of crimes, through crime mapping method and timeseries manipulation, and PoI data spatial aggregation. The grid is a supporting element inthis step and will be used as the places where criminal incidence will be predicted. Timeseries Cij extracted from grid cells indicate a set of values from a grid cell i in period j,and PoI features Gk
i represent a the density of a k PoI category located in a grid cell i.
extraction process to feed supervised machine learning algorithms. Then, we consider as
learning target the time series of original values one period ahead. Assuming that each
region i of the Grid has particular trend, seasonality and differentiation aspects, we apply
STL and Diff. operations for C1j, C2j, ..., Cnj (being n the size of the Grid) separately.
Finally, we take k lags accordingly with the temporal sample frequency to represent past
observations of each component as T kij, Sk
ij, Dkij. The objective is to represent the corre-
sponding crime incidence level of time j (target) with independent variables (features)
expressed in past observations of trend, seasonality and differentiation in times j − 1,
j − 2, ..., j − k.
Further, we propose to complement such temporal representation of crime incidence,
using PoI data to help to describe each place i of the Grid geographically in terms of
the facilities nearby. To do so, it is necessary to apply a Spatial Aggreg. operation.
In Section 2.2.3, we discussed that related studies have considered counting PoI items
within grid cells or the distance between the closes PoI (WANG; BROWN; GERBER, 2012;
LIN; YEN; YU, 2018). We argued that KDE might include these both aspects by weighting
items with a kernel that considers distance decay. Therefore, we propose using KDE as
the Spatial Aggreg. method. The objective of such aggregation is to produce PoI
density Ghi values for each region i and each PoI category h extracted.
38
STL
TRENDTij
SEAS.Sij
DIFF. DIFF.Dij
Tkij
Skij
Dkij
kLAGS
TEMPORALFEATUREEXTRACTION
TIMESERIESCij
Figure 9: The extraction of the independent variables (features) of the time series inPredspot is conducted through a trend and seasonality decomposition (through STL) andseries differentiation. Each decomposition will generate a new series, and the variableswill be composed of k lags from each of these series.
Ultimately, the "feature ingest" step ends by joining T , S, D and G to produce the
feature matrix X, and the corresponding crime incidence levels target y that can be used
to fit a machine learning model. In Figure 10, we illustrate an artificial example of how
the temporal features can be disposed. Besides, we also consider timestamp attributes,
such as the corresponding year and month, and the geographic features to compose the
final Feature Set.
53
48
55
32
31
34
14
44
43
46
28
27
29
14
43
46
45
27
29
30
14
46
45
47
29
30
26
14
5
7
-4
1
3
1
14
7
-4
-3
3
1
-2
14
-4
-3
2
1
-2
-1
14
3
6
-2
2
4
-1
14
6
-2
-4
4
-1
5
14
T1 T2 T3 S1 S2 S3 D1 D2
1
1
1
2
2
2
3
0
1
2
0
1
2
0
-2
-4
1
-1
5
3
14
D3
13
13
13
2
2
2
14
G1
22
22
22
5
5
5
14
G2
6
6
6
16
16
16
14
G3
2
2
2
11
11
1
14
G4
15
15
15
9
9
9
14
G5Place, Time y
Figure 10: An artificial example of temporal features for two places and three time inter-vals.
39
3.1.3 Machine Learning Modelling
The input required to start the machine learning modelling step is the feature set.
As we discussed in Section 2.3.3, feature dimensionality can influence the performance
of the predictive algorithm, introducing complexity in adjusting the model weights and
even adding noise to the data. Thus, to filter the features of each crime scenario, feature
selection is necessary before adjusting the models. We propose it to be done as part of
the "machine learning modeling" step because we use machine learning-based strategies
for feature selection, as suggested by Kniberg and Nokto (2018).
With noisy features properly filtered, different supervised learning algorithms can be
adjusted to predict crime hotspots, following the flow illustrated in Figure 11. There is not
a clear consensus on the best machine learning algorithm for predicting crime hotspots.
As in the previous version of our framework, we do not propose the use of a single learning
algorithm, but experimenting with several of them to select the one with the best score
for each crime scenario. In this sense, our framework takes a model-agnostic approach
to use the one that better fits, according to appropriate assessment. Considering that
a substantial volume of machine learning algorithms have recently emerged, we argue
that the search for the best algorithm can displace the focus of the modelling approach
proposed in this study.4
SUPERVISEDLEARNINGALGORITHM
TUNING
SUPERVISEDLEARNINGALGORITHM
TUNING
SUPERVISEDLEARNINGALGORITHM
TUNING
FEATURESELECTIONALGORITHM
EVALUATION
FEATURESET
X y GOLDENMODEL
Figure 11: In the machine learning modeling step, the best qualified features are selectedand feed into the algorithms. By adjusting various algorithms, we can evaluate the modelsto use the one that has the best predictive performance.
Still, we consider as relevant using a Tuning strategy to optimize machine learning
hyperparameters. Previous studies have shown the improvement of algorithm performance
when applying it (BERGSTRA; BENGIO, 2012). Olson et al. (2017) suggested hyperparam-
eter tuning led up to 50% improvements in CV score. In the Predspot, as we are training4Automated machine learning (FEURER et al., 2015) has shown to be a promising model-agnostic
approach which we will explore in future work.
40
a model with lag components of time series, we propose using the K-Fold Time Series
Cross-Validator (BERGMEIR; HYNDMAN; KOO, 2018), a variation of K-Fold, for each hy-
perparameter configuration experimented. The feature set extracted from the training
data is divided into K folds, in the kth split, it returns first k folds as train set and the
(k + 1)th fold as test set.
Finally, the selection of the so-called Golden Model is made by considering the one
with the best cross-validation score. This golden model should be saved for operational use
in the prediction phase that we discuss next. Another information that should be saved in
this process is which features were selected for each scenario by the Feature Selection
Algorithm, since the Golden Models were trained with separate Feature Sets. It
is worthy to recall that all the steps on "feature ingest" and "machine learning modelling"
should be repeated for each crime scenario separately.
The score retrieved in the model selection phase indicates how good predictions can
be, but it does not represent the effectiveness of patrolling policies that use them. Also,
this score alone does not indicate how good it is to use predictions made by models in
contrast to using traditional approaches. To evaluate the models produced, we suggest
adopting a comparison with the naïve approach of single lag autoregressive models, as
considered in related studies (BROWN; OXFORD, 2001; COHEN; GORR; OLLIGSCHLAEGER,
2007).
3.2 Prediction Service
Towards predictive hotspot policing operations, a robust software must connect the
crime database to the pipeline we described, and retrieve geographical information of fu-
ture crime hotspots. In themodel selection phase, we trained an efficient model to generate
predictions. In the next phase, namely prediction service, we describe how the predictions
can be taken for new data using the trained model (the "prediction pipeline" step) and
how this processing can be structured in a conceptual software architecture (the "web
service" step). We provide a high-level description in which the interaction between the
service and the prediction consumer is detailed using the routines we described previously
in this chapter. With the service architecture detailed, we do not aim to provide a highly
scalable processing workflow, but rather to yield a software interface for predicting crime
hotspots. Additionally, we discuss the reasons why a service may be more suitable than a
dashboard tool and indicate prediction output format encoding.
41
3.2.1 Prediction Pipeline
The process of extracting features and using the trained model to return predictions
for each new period automatically depends on a connection to the crime database. In
contrast to the "dataset preparation" step that uses a static training dataset, the first
task of the "prediction pipeline" step (illustrated in Figure 12) should process incoming
data, loaded from such updated crime database. For each new period, the pipeline must
trigger a data load, incorporating new crime events to the previously stored data. Then,
the same preparations described in the "dataset preparation" phase should be applied,
such as spatially filtering the data and dividing the crimes into crime scenarios.
DATASETPREPARATION
FEATUREINGEST PREDICTION
DATABASECONECTION
PREDICTIONPIPELINE
DATA X y TRAINEDMODEL
PREDICTIONLAYER
Figure 12: In the prediction pipeline, one must tailor the model selection steps to returnpredictions using previously trained models. This means no longer loading the entiredataset, just the crimes from the last period. Then, extracting temporal features, filteringpreviously selected features, and requesting the trained model for predictions.
After that, to apply the feature ingest procedures, there must be a sufficient subset
of data to extract the previously configured k lags. For example, if the prediction is made
monthly and k = 12 lags were extracted, the data needed should include all criminal events
in the given scenario from the 12 months before that period. With enough set of data, the
only "feature ingest" task that must be repeated is spatiotemporal aggregation because
it is not necessary to apply spatial aggregation of PoI, given their temporal immutability.
Still, the geographic features must be rejoined with the temporal feature extracted.
After extracting all the features described, it is necessary to filter the same set of
features selected previously for each crime scenario and use them to feed the trained
models. No training tasks should be applied during this process, such as tuning and model
42
evaluation since the adjusted model has already been considered the most efficient. To
complete the "prediction pipeline" step, one may assure (i) the prediction output consists
of crime incidence levels yi for each place i one step ahead in time and also that (ii) this
process can be repeated for each new period.
3.2.2 Web Service
In the next step, the methods applied before are organized in software architecture.
Although no previous work has discussed such an implementation issue, at least from the
best of our knowledge, we argue that efforts for improving predictive policing algorithms
transparency should consider how these systems would be deployed in the pool of existing
software of the police department.
In this step, we have put together in a high-level description the model selection
phase with the "prediction pipeline" step, detailing how this articulation can be assisted
by auxiliary components (see Figure 13). A first component to support the described
operations is the Volume. It consists of a file system where it should be stored (i) the
recently loaded crime data from the updated database, (ii) the trained model, (iii) the list
of features selected for each crime scenario, and (iv) all the predictions retrieved, so that
queries can request them again readily. The second component consists of an Extraction,
Transform and Load (ETL) Controller routine dedicated to (i) trigger the prediction
pipeline for each new period, as well as (ii) managing the GIS client requests and the
access to the Volume.
As mentioned, we do not intend to describe an optimized architecture, but rather
to outline how information can flow from the crime database to the trained model. We
propose this workflow to be designed in the form of a web service, decoupled from other
police department software. The reason for this is that we note that a police department
may have several dashboard tools for interacting with georeferenced data. Thus, adding
one more separate tool only visualize to hotspot predictions may isolate information in
different environments, difficulting the decision-making process. Therefore, we argue that a
decoupled service that makes available predictions on demand to another software through
an application interface increases the usefulness of the latter without competing with it
in terms of usability. We suggest that the output format of the predictions be GeoJSON
so that requests are met with information already encoded in a simple representation
geographic format and readily be visualized in a GIS user interface.
43
PREDICTIONPIPELINE
FEEDS
FEEDSDATA
SOURCES
COMMUNICATES
ETLCONTROLLER
GISCLIENT
FEEDS
FEEDSVOLUME
WEBSERVICE
STORESTHETRAINEDMODEL
MODELSELECTION
Figure 13: The web service comprises managing the prediction pipeline to attend onlinerequests. This requires the use of a file system, which we call Volume, for caching theprediction results and an ETL process controller to trigger prediction generation for eachnew period.
3.3 Implementation
In this section, we overview the implementation we made for the Predspot framework.
The first part of the implementation is in the form of a python-package that supports
model selection operations. The package is organized into modules according to the steps
described. The second part of the implementation deals with a prototype of a web service,
providing some API routes and its respective inputs and responses.
3.3.1 Predspot python-package
As a solution for our second research objective (design an open-source software and de-
tails its procedures to estimate future hotspots), in this section we describe the implemen-
tation of the Predspot regarding the model selection phase. The predspot python-package
is based on heavily used data analytics libraries, such as Numpy (WALT; COLBERT; VARO-
44
QUAUX, 2011), Pandas (MCKINNEY et al., 2010), Shapely (GILLIES, 2013), GeoPandas
(JORDAHL, 2014) and Scikit-learn (PEDREGOSA et al., 2011). With such implementation,
we intend (i) to evaluate our framework, by testing different crime mapping methods and
machine learning algorithms, as well as (i) to support crime analysts that are their own
prediction service, by providing model selection routines we have implemented. Therefore,
it is also crucial to achieving our first objective since we need to evaluate our approach.
For more implementation details on the classes, methods and attributes, we refer to the
package documentation available online.5
3.3.2 Predspot service
For the implementation of a service capable of providing predictions for each new
period using the Predspot framework, here we detail some configurations and routines
needed. Aligned with the prediction service phase of the framework, there are some addi-
tional components one have to implement. The first thing to do is to a set configuration file
with the volume path, temporal sample frequency, crime scenarios tags and the methods’
parameters. For example, defining that the predictions are made monthly, for burglary
and drug crimes, the bandwidth of KDE etc.
Having set all parameters, it would be necessary to implement connections to data
sources and import crime data from the operational database. Because the connection to
the Natal Police Department database must be protected for privacy reasons, we do not
detail further procedures involved in this step nor provide source code to support this
step as a whole.
Still, the data flow must allow the automation of prediction generation using the ser-
vice with a simple interface. As we described, a controller is a component that must be
implement to automate the import and forecast generation for each new period, saving
the prediction results in the volume and retrieve them to the requests made. We have im-
plemented the service as an Application Programming Interface (API), with the following
routes.
• /import : to load new data;
• /train: to start the model selection process and save the trained models;
• /predict : to generate predictions for all crime scenarios and save them in the volume;5https://adaj.github.io/predspot
45
• /get_predictions : to return to the client the prediction results in geographic file
format, such as GeoJSON
46
4 Evaluation
The implementation of a predictive hotspot crime analysis system is based on the
premise that guiding policing using predictions can be more effective than ever relying on
the knowledge of police managers. To ensure that this happens, Moses and Chan (2018)
mentioned that two forms of assessment are required for predictive policing systems. First,
the assessment of the accuracy or error of predictive algorithms should guide theoretical
significance. Second, they mention that as important as the first would be to assess the
practical significance of using these predictions in crime reduction. In this study, we will
focus on assessing prediction performance as a first step in the timeline of our research.
Thus, in line with the purpose of this study, this chapter will present empirical evidence
of the efficiency of the Predspot methodology.
We derive twelve divisions of crime scenarios from two datasets as our experimental
samples. An exploration of datasets is done to illustrate spatial and temporal patterns
that have been somewhat obscured in the theoretical detail made in previous chapters.
Differences between sample sizes will bring interesting conclusions. PoI data extracted
from OpenStreetMap for both cities is also presented and the extraction result of some
geographic features is depicted. This will make the spatiotemporal modeling we describe
be contextualized in the analysis of the results.
In addition, in this chapter we describe our approach to experimentation, including a
description of the methods and metrics involved. The details of the baseline approach are
explained and an alternative metric is proposed for assessing the predictive efficiency of
models based on different crime mapping methods. Finally, we present the performance
results of the adjusted models and discuss the efficiency of the previously proposed vari-
ables, measuring their importance in the different predictive approaches applied in the
twelve crime scenarios.
47
4.1 Datasets
The evaluation of the Predspot methodology is conducted using datasets from two
different cities, Natal (Brazil) and Boston (US). To the best of our knowledge, no previous
study used datasets from more than one city on its evaluation. The first dataset is from
Natal, Brazil, which was made available by the Natal police department for our research.
Privacy terms do not allow us to display the spatial distribution of this data. On the other
hand, the second dataset we use is from Boston, US, which is available online from the
city’s open data portal.1 Natal data are from January 2016 to the end of November 2018
(35 months). The data we used from Boston begins in June 2015 and ends on September
2018 (37 months).
According to Andresen and Linning (2012), hotspot analyses should consider disso-
ciating crime types of different sources, as we discussed in Section 3.1.1. The description
of the types of crimes used in this work is as follows. In the Natal data, there are groups
of similar crime types. According to direct consultation with police officers in Natal, the
most useful crime type groups for the police department are (i) property crimes (CVP),
such as robbery, theft, burglary etc., (ii) violent or lethal crimes (CVLI) or, and (iii) drug-
related crimes (TRED). On the other hand, the Boston data has no similar groupings.
Thus, we selected three types of crimes to analyze, namely Residential Burglary, Simple
Assault, and Drug Violaton.
Moreover, in consultation with the police department of Natal, we found the need to
analyze the patterns of crimes that happen morning and night separately. Assuming that
there is a difference in the pattern of criminal incidence between these periods, we divide
for each type of crime mentioned its day and night versions. From now on, we refer to
crime scenario the samples of a given type of crime at a specific part of the day. This
division is used in the remainder of this paper to delimit the sample sets used in our
experiments.
4.2 Experiment Methods and Metrics
Assuming that the performance of predictive approaches can be influenced by several
factors, experimental analysis needs to comprise more than one dimension. A first factor
analysed is the crime mapping method, among which we describe in Section 2.2.1 two1https://data.boston.gov
48
of them, namely KGrid and KDE. For each of these methods, different patterns may
emerge and we need to test their capabilities to help predictive algorithms produce better
results. The choice of the machine learning algorithm is also another aspect worth making
comparisons. We analyze two decision tree ensemble machine learning algorithms, Random
Forest and Gradient Boosting. As mentioned in Section 2.3, it is not our goal to draw
conclusions about which one is best among the various machine learning algorithms for
each problem. Instead, we intend to measure whether there is a significant difference in
the use of a particular algorithm for a given crime mapping method. Thus, with these two
dimensions of analysis, we evaluate KGrid, as well as KDE, with Random Forest
and Gradient Boosting. We refer to the four corresponding prediction approaches as
KDE-RF, KDE-GB, KGrid-RF and KGrid-GB.
To manage a more robust evaluation, we compare our methodology with a baseline
method reference. For this work,we consider as the baseline method estimating the
next period using the previous. The reason is that we want to compare a traditional
approach used by police managers against prediction models based on machine learn-
ing. This approach were applied in many related studies with adaptations, e.g. (BROWN;
OXFORD, 2001; COHEN; GORR; OLLIGSCHLAEGER, 2007). As we mentioned, each crime
mapping method produces time series with different units of measures (KGrid counts
crime events within grid cells and KDE weights them by their distance to the grid cell
center). Fortunately, the baseline approach we describe can be applied to both crime
mapping methods, since it consists of a simple autoregressive lag. Therefore, we use the
performance of the baseline to normalize the performance of the prediction algorithm,
retrieving a dimensionless and relative metric of performance.
The evaluation metrics frequently used in related studies often depends on the pre-
diction task involved. For prediction tasks based on the classification of hotspot or non-
hotspot, related studies, e.g. (HART; ZANDBERGEN, 2014; MOHLER et al., 2015), have used
hit rate and prediction accuracy index (PAI), which are based on the relationship between
predicted places as hotspots, their areas and the amount of events within them (CHAINEY;
TOMPSON; UHLIG, 2008). Since we are predicting ordinal values (crime incidence levels
within an area), we are applying regression, thus using error metrics such as mean squared
error (MSE) that is used in several related studies, e.g. (BROWN; OXFORD, 2001; KADAR;
MACULAN; FEUERRIEGEL, 2019; ARAUJO et al., 2018). However, error metrics are strongly
dependent on the unit of measurement of the method applied. One exception is the mean
absolute percentage error (MAPE), a percentage based metric, but its formula is not
applicable in case zeros the samples contain zeros, which is the case of some of our KGrid
49
time series. Another one is the coefficient of determination (R2), but we argue it is not
suitable when comparing two crime mapping methods because each one involve differ-
ent scales of aggregations, producing different variance patterns. Using R2 to compare
two crime mapping methods would be similar to compare explained variance of a model
adjusted in a log base and another in square root base.
In this work, we introduce the prediction ratio (PR) as an alternative approach to
compare different predictive approaches, specially crime mapping methods. It is based
on using an error metric to derive the ratio between the performances of the prediction
approach and the baseline. In our experiments we use root mean squared error (RMSE) as
the base metric because it produces estimates in the original scale of the samples, instead
of MSE that produces squared results. Thus, we define the prediction ratio of root mean
squared error (PRRMSE) as in Equation 4.1. We formulate the PRRMSE to answer in
a single indicator to: "how many times the predictive approach is better or worse than
the baseline?". In case PRRMSE > 1, the prediction algorithm produces better estimates.
Also, we argue that having a single indicator that measures how good the predictions
are in a relative way may be important to evaluate whether the practical policing impact
matches the expectations.
PRRMSE =
√MSEbaseline√MSEpredictions
(4.1)
To investigate whether there is a significant difference in PRRMSE from the four ap-
proaches mentioned, we conduct a Friedman test with post-hoc analysis. The Friedman
test is a non-parametrical statistical test that involves ranking groups of samples to mea-
sure whether there are significant differences between them. Unlike ANOVA, it does not
require sample normality or variance equality. For post-hoc analysis, we perform the Ne-
menyi test for all pairwise combinations to retrieve the best configuration among these
we are testing. Since the algorithms we are using are not deterministic, our evaluation
samples are taken from five trials of CV evaluation using the selected models. Therefore,
each evaluation sample is a CV score of a prediction approach trial in one of the twelve
crime scenarios (six from each city). In the following section, we present the datasets and
its crime scenarios, as well as the models performances.
50
4.3 Results
In this section we present the empirical materials of our study. Starting with the
exploration of the two datasets used, we capture some patterns in the categorical, spatial
and temporal distributions of the analyzed crime scenarios. This will be important for
interpreting the results of the parameter selection we apply to the crime mapping method,
geographic feature extraction, and machine learning algorithms. With these parameters
chosen, we will also be able to analyze the results of the four predictive approaches adopted
to meet the objectives outlined at the beginning of this study. The results presented
in this section are important to validate the Predspot methodology presented in terms
of efficiency when compared to the baseline approach. In addition, we seek to show in
this section empirical evidence about the difference in the use of the two types of crime
mapping mentioned, as well as the exploration of the difference in the use of two machine
learning algorithms. These differences are explored through Friedman’s test and Nemenyi
test post-hoc analysis, which will support important contributions of this work. Another
contribution of this work is derived from a feature importance analysis, where we discuss
which feature among those defined in the previous chapter most contribute to efficient
predictions.
4.3.1 Exploratory Data Analysis
Exploring the categorical, temporal and spatial distributions of data is essential for
a few reasons. First, examining the number of samples and looking for outliers helps in
interpreting the results. It is difficult to understand the reason for the poor performance
of a given model without knowing if there are enough samples to fit it or without knowing
how the data is distributed. Second, exploiting the data allows more perspectives to be
considered by the analyst. According to Yu (1977), an exploratory analysis of data does
not consist of fishing or torturing the data until it is confessed, but of investigating the
information from multiple perspectives. Third, greater familiarity with data can enable
new ideas for data pre-processing and cleansing.
To start with, we show the amount of samples of each crime scenario in Figure 14. In
Natal, the crime scenarios have different sample sizes, having a much higher number of
CVP crimes than CVLI, and TRED crimes in between. On the other hand, Boston crime
scenarios are more equally distributed, not exceeding an order of magnitude. Also, there
is not a clear pattern on day and night crime intensity. Some crime scenarios are more
51
frequent in the day and others in the night. The differences in the crime scenario sample
sizes will allow a better interpretation of the influence of sample size on the performance
of the adjusted models as well as in the parameter selection results.
CVP@day
CVP@night
CVLI@day
CVLI@night
TRED@day
TRED@night
Nat
al’s
Cri
me
Sce
nar
ios 11814
15277
650
730
2215
2173
0 2500 5000 7500 10000 12500 15000 17500
Count
Residential-Burglary@day
Residential-Burglary@night
Simple-Assault@day
Simple-Assault@night
Drug-Violation@day
Drug-Violation@night
Bos
ton
’sC
rim
eS
cen
ario
s 3096
2313
6598
6639
6142
4528
Figure 14: The sample sizes of the twelve crime scenarios. Natal has more heterogeneouscrime scenarios compared to Boston.
In addition to categorical division by crime scenarios, we discussed in Section 3.1.1
the need to define the temporal sample frequency as well as the spatial unit, usually
represented by the crime mapping method. In this paper, we use the monthly sampling
frequency to avoid sparse time series problems. Figure 15 shows time series of the amount
of crime in each scenario per month in each city. Overall, each crime scenario time series
present its peculiarities, but without significant outliers. As we use monthly sampled time
series, we expect more parsimony samples.
In Section 2.2.2, we discussed methods for decomposing time series. To illustrate the
decomposition used as temporal features, we show in Figure 16 the trend and seasonality
components extracted using STL, as well as the differentiation of the original time series
of Residential Burglary crimes across the city, during the day (left) and night (right)
periods. It can be seen that the trend component in the two different periods follows a
similar downward behavior, but the seasonality components differ considerably from one
scenario to another. Differentiation expresses the positive and negative variations of each
series, and so does not assimilate with each other either.
52
Jan2016
Jan2017
Jan2018
Jul Jul Jul
t
0
100
200
300
400
500
600
Nat
al’s
Am
ount
ofcr
imes
CVP@day
CVP@night
CVLI@day
CVLI@night
TRED@day
TRED@night
Jan2016
Jan2017
Jan2018
Jul Jul Jul Jul
Time (months)
0
50
100
150
200
250
300
350
Bos
ton
’sA
mou
nt
ofcr
imes
Residential-Burglary@day
Residential-Burglary@night
Simple-Assault@day
Simple-Assault@night
Drug-Violation@day
Drug-Violation@night
Figure 15: Monthly sampled time series of the twelve crime scenarios.
To manage the spatial aggregation, we discussed in Section 2.2.1 the two crime map-
ping methods that are used in our experiments. With the same day and night scenarios
as Residential Burglary in Boston, we illustrated in Figure 17 the difference in spatial
distributions using KDE and KGrid. Also, grid resolution have to be set, and we define
our KDE grids with the resolution of 500m (distance between cells), and the KGrid with
100 cells, following previous recommendations (ARAUJO et al., 2018).
53
Jul Jul Jul Jul0
50
100
150
200
Ori
gin
alResidential-Burglary@day
Jul Jul Jul Jul
Residential-Burglary@night
Jul Jul Jul Jul0
50
100
150
200
Tre
nd
Jul Jul Jul Jul
Jul Jul Jul Jul−20
−10
0
10
20
Sea
son
al
Jul Jul Jul Jul
Jan2016
Jan2017
Jan2018
Jul Jul Jul Jul
Time (months)
−50
0
50
Diff
eren
tiat
ion
Jan2016
Jan2017
Jan2018
Jul Jul Jul Jul
Time (months)
Figure 16: Time series decomposition for Residential-Burglary (in Boston) daytime andnighttime scenarios. The second row is composed of trend series, which are a smoothedversion of the original. Third row are the seasonal patterns, clearly distinct from day andnight. Fourth row represent the differentiated component.
Residential-Burglary@day Residential-Burglary@night
Figure 17: Spatial representations of the two crime mapping methods for Residential-Burglary (in Boston) daytime and nighttime crime scenarios. In the first row, KGridaggregation, and in the second row KDE.
54
As ancillary data, we use PoI data extracted from OpenStreetMap available for both
Natal and Boston. Data was extracted using the Overpass API, as detailed in Section 3.1.1.
The choice of the PoI categories considered was in part based on the outcome of the Natal
police opinion poll presented in Section 2.2.3. Figure 18 indicates which PoI categories we
use, and shows the location of the PoI data in both cities. Note that we are considering
line based items (primary, secondary and residential from the highway tag) as sets of
points. As we mentioned, not all PoI categories will be selected as important in all crime
scenarios, since we apply a feature selection step before model fitting. Still, we have to
extract features for all of them.
Figure 18: PoI data from Natal (left) and Boston (right) extracted from OpenStreetMap.
In Section 3.1.2, we propose the use of KDE as a spatial aggregation methodology
that considers both the fact of density and spatial decay over distance through its param-
eters (See Section 2.2.1). The results of feature extraction can be illustrated with contour
plots as in Figure 19, for hospitals, residential streets, and tourism places in both cities.
Note that the feature extraction process takes a smooth surface of PoI density in the city,
thus considering spatial autocorrelation. Still, it is important to define the bandwidths
and kernel for each PoI category. We detail the parameters used in the extraction of geo-
graphic features below, as well as the parameters of crime mapping and machine learning
algorithms.
55
Hospitals(Amenity) Hospitals
(Amenity)
Residential(Highway) Residential
(Highway)
*(Tourism) *
(Tourism)
Figure 19: A representation of the geographic features taken from three PoI categories,hospitals (first row), residential streets (second row) and touristic places (third row) ofNatal (left) and Boston (right).
56
4.3.2 Parameter Selection
The methods we described have several parameters to optimise their uses. For the
crime mapping methods presented, KGrid have the K parameter to control grid resolution
and KDE is greatly influenced by the selected bandwidth and kernel. On the other hand,
we chose to operate in three hyperparameters of the machine learning algorithms chosen
for this work (Random Forest and Gradient Boosting) that may influence its performance.
These are the number of trees (n_estimators), the depth of the forest (max_depth) and
the percentage of features distributed to each tree (max_features).
As recommended in the model selection phase, we tuned these methods by selecting
the parameters that optimize their performances. We apply the grid-search strategy with
10-Fold Time Series Cross-Validator (BERGMEIR; HYNDMAN; KOO, 2018) using the pa-
rameter space defined in Table 1. For KDE, the optimal setting selected is considered
by the maximum log-likelihood criteria. Different kernel and bandwidths from 0.01km
to 1km were tested. In the case of KGrid, we follow the results of our previous experi-
ments (ARAUJO et al., 2018), which suggested values of K in the order of 100. We found
that the average cell area with this resolution of KGrid is in the same order of magni-
tude as optimized settings of KDE cells, considering the selected bandwidth. So for all
crime scenarios, we take K = 100 for KGrid. For supervised algorithms, Random Forest
and Gradient Boosting, the optimal setting is the one with the lowest MSE among all
combinations.
Table 1: KDE parameters and machine learning hyperparameters considered in the gridsearch tuning.
Algorithm Hyperparameter Values
KDEbandwidth from 0.01 km to 1 km, n=200
kernel gaussian, linear, exponential,tophat, epanechnikov
Random Forest,Gradient Boosting
n_estimators 500, 1500, 2500, 3500, 4500max_depth 3, 6, 9, 12max_features 0.4, 0.6, 0.8, 1
The results of KDE parameter selection applied to crime scenario mapping are de-
scribed in Table 2. Only the two CVLI crime scenarios in Natal did not have their band-
widths within the range defined in the parameter space, reaching a maximum of 1km as
57
the selected parameter. This may be related to the fact that CVLI crimes were the set
with the lowest sample size we take. In the case of Boston crime scenarios, bandwidths
were relatively close, ranging from 0.229 to 0.522. One of the reasons for this is the sam-
ple size equality of the considered samples (see Figure 14), since smaller samples end up
being aggregated by larger bandwidths, as suggested by Silverman (2018). On the other
hand, the kernel selection presented surprises by the regularity in the choice of exponential
kernel. In only one of the twelve crime scenarios has the Gaussian kernel been selected.
Interestingly, no previous related studies that experimented optimal KDE configurations
have not proposed the exponential kernel as an effective option for crime mapping. For
example, Hart and Zandbergen (2014) proposed using linear kernels.
Table 2: Selected KDE parameters for the crime mapping methods applied in each crimescenario.
City Crime Scenario bandwidth (km) kernel
Natal
CVP@day 0.289 exponentialCVP@night 0.199 exponentialCVLI@day 1 exponentialCVLI@night 1 exponentialTRED@day 0.821 gaussianTRED@night 0.607 exponential
Boston
Residential-Burglary@day 0.522 exponentialResidential-Burglary@night 0.488 exponentialSimple-Assault@day 0.353 exponentialSimple-Assault@night 0.353 exponentialDrug-Violation@day 0.229 exponentialDrug-Violation@night 0.274 exponential
Similarly, we apply KDE parameter selection to geographic feature extraction. The
parameters selected for each feature in both cities are described in Table 3. Again, the
exponential kernel has superiority in the overwhelming majority of cases. In hospital-
related PoIs, the 1km bandwidth was chosen for both cities. Still, we note that most
PoI categories can be mapped using bandwidths in the order of 100m. It is noteworthy
that the selected bandwidth values have a dependency on the selected kernel, and if the
exponential kernel were not used, other values would be selected.
The results of the machine learning model hyperparameter selection applied to each
58
crime scenario are presented in Appendice A. Unlike the KDE bandwidth, the Random
Forest and Gradient Boosting hyperparameters did not explicitly depend on the size of
the samples. Still, we observe an express dependence of the parameters according to the
crime mapping method used. The results of model evaluation configured with this set of
hyperparameters are presented next.
Table 3: Selected KDE parameters for geographic feature extraction applied for PoI dataaggregation.
FeatureNatal Boston
bandwidth (km) kernel bandwidth (km) kernel
amenity_hospital 1 gaussian 1 exponentialamenity_school 0.299 exponential 0.343 exponentialamenity_police 0.920 exponential 1 exponentialamenity_place_of_worship 0.269 exponential 0.338 exponentialamenity_restaurant 0.184 exponential 0.124 exponentialleisure_* 0.085 exponential 0.120 exponentialtourism_* 0.189 exponential 0.279 exponentialhighway_primary 0.065 exponential 0.090 exponentialhighway_secondary 0.105 exponential 0.239 exponentialhighway_residential 0.299 exponential 0.433 exponential
4.3.3 Model Performance
According to the selected parameters, we first evaluate the predictive performance
of the adjusted models according to their cross-validation MSE scores. Figures 20 and
21, for KGrid and KDE respectively, show the performances of the baseline, the Random
Forest (RF) the and Gradient Boosting (GB) algorithms for the twelve crime scenarios
described. Note that each scenario is represented on a different scale that depends on the
amount of samples available. Also note that the performance of the algorithms in the Natal
crime scenarios (in red) behave similarly to the Boston crime scenarios (in blue) in both
crime mapping methods. However, in KGrid, it is possible to observe a slight superiority
of RF, which happens consistently in all twelve crime scenarios analyzed. In contrast, we
observe the same constant superiority of GB in KDE. Moreover, in a comparison between
Figures 20 and 21, one can notice that the difference between baseline and predictive
algorithms (RF and GB) is greater in approaches using KDE. Later on this section, we
59
investigate whether this difference is statistically significant.
Baseline RF GB
CVP@day
0
2
4
6
8
MSE
7.236
2.514 2.527
Baseline RF GB
CVLI@day
0.00
0.05
0.10
0.15
0.200.197
0.06 0.073
Baseline RF GB
TRED@day
0.00
0.25
0.50
0.75
1.00
1.251.039
0.353 0.407
Baseline RF GB
CVP@night
0.0
2.5
5.0
7.5
10.0
MSE
10.057
3.334 3.419
Baseline RF GB
CVLI@night
0.0
0.1
0.2
0.3 0.262
0.079 0.091
Baseline RF GB
TRED@night
0.0
0.2
0.4
0.6
0.8
1.00.993
0.317 0.376
Baseline RF GB
Residential-Burglary@day
0.0
0.5
1.0
1.5
MSE
1.446
0.389 0.461
Baseline RF GB
Simple-Assault@day
0
1
2
3 2.819
0.974 1.028
Baseline RF GB
Drug-Violation@day
0
1
2
3
44.029
1.589 1.802
Baseline RF GB
Residential-Burglary@night
0.0
0.2
0.4
0.6
0.8
1.0
MSE
0.949
0.337 0.376
Baseline RF GB
Simple-Assault@night
0
1
2
33.273
1.241 1.295
Baseline RF GB
Drug-Violation@night
0
1
2
3 2.715
1.268 1.361
Figure 20: Cross-validation MSE results for KGrid-based models for Natal (in red) and
Boston (in blue) crime scenarios.
60
Baseline RF GB
CVP@day
0.00
0.25
0.50
0.75
1.00
MSE
1.004
0.294 0.245
Baseline RF GB
CVLI@day
0.0
0.5
1.0
1.5
2.0 1.766
0.171 0.081
Baseline RF GB
TRED@day
0
1
2
3
4
54.269
0.536 0.334
Baseline RF GB
CVP@night
0.0
0.5
1.0
1.5
MSE
1.636
0.475 0.402
Baseline RF GB
CVLI@night
0.0
0.5
1.0
1.5
2.01.992
0.154 0.076
Baseline RF GB
TRED@night
0.00
0.25
0.50
0.75
1.00
1.25 1.151
0.2070.127
Baseline RF GB
Residential-Burglary@day
0
1
2
3
MSE
3.107
0.5530.308
Baseline RF GB
Simple-Assault@day
0.0
0.5
1.0
1.5
2.0
2.5 2.397
0.5510.354
Baseline RF GB
Drug-Violation@day
0
2
4
6
86.86
1.52 1.082
Baseline RF GB
Residential-Burglary@night
0
1
2
3
4
5
MSE
4.625
0.740.37
Baseline RF GB
Simple-Assault@night
0
1
2
3 2.877
0.6390.415
Baseline RF GB
Drug-Violation@night
0
2
4
6
8
10 9.674
1.527 1.016
Figure 21: Cross-validation MSE results for KDE-based models for Natal (in red) and
Boston (in blue) crime scenarios.
61
We discussed in the previous section that PRRMSE is an alternative approach to
analyzing how many times the predictive model is better than the baseline estimates. To
account for PRRMSE, we reassess the trained models for each crime scenario five times
(again with 10-Fold Time Series Cross-Validator) to ensure that error variances were
captured. Since we want to compare crime mapping methods and predictive algorithm
performance, we plot the results of PRRMSE in Figure 22. On the left, we note that KDE
is clearly superior in all crime scenarios, but less intensely in the CVP in Natal crimes.
Besides, the crime scenarios with the highest PRRMSE were CVLI, also in Natal.
Interestingly, note that these two crime scenarios, CVP and CVLI, are the ones with
the largest and smallest samples respectively (See Figure 14). We deduce that using KDE
seems to be most effective when a small number of samples are available. When having
a significant amount of samples (considering CVP@day and CVP@night, n > 10000
samples), KDE’s performance becomes closer to KGrid’s in terms of PRRMSE, but it is
still better. The reason may be that with more samples, KGrid forms less sparse and more
representative time series.
0 1 2 3 4 5 6PR-RMSE
CVP@day
CVP@night
CVLI@day
CVLI@night
TRED@day
TRED@night
Residential-Burglary@day
Residential-Burglary@night
Simple-Assault@day
Simple-Assault@night
Drug-Violation@day
Drug-Violation@night
Cri
me
Scen
ario
s
KDEKGrid
0 1 2 3 4 5 6PR-RMSE
RFGB
Figure 22: PRRMSE results of five trials evaluating trained models for each crime scenario.In the left side, the two crime mapping methods are compared and in the right, machinelearning algorithms. Note that KDE outperforms KGrid in all crime scenarios, but moresharply in crime scenarios that have fewer data points, such as CVLI. On the other side,one can note that GB models have higher percentiles, but with much more variance.
On the other hand, the right side of Figure 22 shows that there is a balance between
the RF and GB predictive algorithms, as we noted earlier. Note that the median of GB
is higher in all scenarios, but its performance is clearly more variant than RF. One may
notice the predictive performance of the algorithms is affected by the size of the available
62
samples, just as before. With more samples, for example in the CV P crime scenarios, the
PRRMSE appears smaller, as does the variance of your estimate. Although it is intuitive
to think that with more samples, machine learning algorithms tend to estimate better,
interpretation of these results leads us to conclude that baseline also benefits from the
increased amount of data available, since PRRMSE is a relative metric.
The averages and standard deviations of PRRMSE from the four predictive approaches
are presented in Table 4. We note that on average, predictive approaches are 1.6 to 3.1
times better than baseline. To find out if there is a statistically significant difference
between these four treatments, we run the Friedman test. Considering a significance level
of α = 0.05, the result indicated the rejection of the null hypothesis and that there
is indeed a significant difference between the approaches (p � 0.001). Thus, since the
most notable difference in averages is between crime mapping methods, we empirically
show that choosing the crime mapping method can greatly influence prediction
performance.
Table 4: Average and standard deviation of PRRMSE for each predictive approach, con-sidering five trials of the twelve crime scenarios.
PredictiveApproach
PRRMSE
Average Standard Deviation
KGrid-RF 1.704 0.118KGrid-GB 1.622 0.089KDE-RF 2.446 0.512KDE-GB 3.123 0.882
Moreover, it is important to assess whether there is a difference between the predictive
algorithms used. We evaluate the approaches in a paired way using post hoc analysis with
Nemenyi test. The results in Table 5 show that the difference we believed to be subtle
between KGrid-RF and KGrid-GB, as well as KDE-RF and KDE-GB actually turns
out to be statistically significant (p ≤ 0.001). It means that choosing the predictive
algorithm has a considerable impact on predictive performance. Thus, among
the predictive approaches tested in this study, the combination of KDE and GB actually
provides the best estimates (average PRRMSE = 3.12) across all crime scenarios. This
difference may be caused by the fact that KGrid generate sparser time series compared to
KDE, thus more difficult to be predicted, as we discussed in Chapter 2. KDE provides more
homogeneous time series formation, as its spatial aggregations consider events weighted
63
by distance from the center of the grid, while KGrid counts events within a cell, without
considering the neighboring one. We observe that this spatial autocorrelation effect is
important when translating crime events into spatiotemporal variables.
Table 5: A pairwise statistical comparison of the four predictive approaches, consideringthe results from post-hoc analysis.
Difference of levels Difference of means SE of means p-Value
KGrid-RF − KGrid-GB 0.082 0.066 0.006KGrid-RF − KDE-RF -0.742 0.480 0.001KGrid-RF − KDE-GB -1.419 0.838 0.001KGrid-GB − KDE-RF -0.823 0.516 0.001KGrid-GB − KDE-GB -1.501 0.881 0.001KDE-GB − KDE-RF 0.678 0.388 0.001
However, with superior predictive performance also comes the computational cost
of generating estimates. We use a complete 64-core node (Intel Xeon processor, 64MB
of RAM) from the UFRN supercomputer (NPAD / IMD). We measure the time spent
throughout the model selection step for each dataset and crime mapping method, and
show it in Table 6. Please note that using KDE, given the 500m resolution and parameter
setting, is up to fifteen times slower than using 100 cell KGrid. However, it is noteworthy
that the time required for the model selection phase does not equal the time required to
generate on-the-fly estimates in the prediction service. Since models are already trained, a
prediction is generated within seconds. Thus, while KDE requires a larger computational
budget for training, its estimates may be much better, especially in cases where data are
scarce, as seen in the CVLI crime scenario.
Table 6: Wall time spent on model selection phase for each dataset.
Total Time forModel Selection (s)
Dataset KGrid KDE
Natal 2893 40911Boston 4705 50457
64
4.3.4 Feature Importance Analysis
To evaluate which variables most influence the predictions made by the fitted models,
we conducted a feature importance analysis for each crime scenario in the four predictive
approaches used. The goal is to understand what features the models are basing on and
evaluate which features we model that really contribute to the most efficient models. The
so-called feature importance metric is an important explanation tool to show which factors
the models are mostly relying on. 2 It can be calculated by successive permutations of
the values of a given feature. In this method called feature permutation, the feature with
the smallest variation in predictive performance is considered the less important one to
generate predictions. Fortunately, the implementation of scikit-learn (PEDREGOSA et al.,
2011) that we use from the RF and GB estimators already provides the feature importance
values after the models have been adjusted.
We plot the feature importances of the four predictive approaches for Natal crime
scenarios in Figures 23 − 26 regarding KGrid-RF, KGrid-GB, KDE-RF and KDE-GB
models respectively. Among the feature set we presented were those based on tempo-
ral components (trend, seasonal and difference), geographic information (coming from
OpenStreetMap extracted PoIs) and others such as the corresponding month and year.
Each temporal feature has a number that corresponds to the respective lag. For example,
KGrid-RF models (Figure 23) feature the most important feature in four of the six crime
scenarios (and second most important in the other two) the seasonal_12 component,
corresponding to the seasonal factor twelve months ago. This is aligned with our expec-
tations, since it suggests that the same month last year greatly influences the predictions
of these models. In the other two crime scenarios which seasonal_12 is not the top fea-
ture, the most important one is the trend_1, which corresponds to the trend value of the
previous month. We also realize that the features diff_11 and diff_12 are also important
in most crime scenarios. Interestingly, the other three set of models also have preferences
in temporal features, having seasonal_12 as top feature in KGrid-GB, and trend_1 in
KDE-RF and KDE-GB. We observe that these features we mentioned are frequent in
almost all models.
Unfortunately, if we inspect the geographic features, we find that they have little2We understand that there is a growing demand for an explanation of predictions made. by machine
learning models, and a public safety decision-making tool should provide such explanations to the policingmanager. Currently, model-agnostic machine learning model interpretation frameworks have emerged,such as LIME (RIBEIRO; SINGH; GUESTRIN, 2016) and SHAP (LUNDBERG; LEE, 2017), and we intend toinvestigate their usefulness in the context of prediction of criminal hotspots in future work.
65
impact on predictive performance. There are few cases where a set of features were selected
by the feature selection algorithm, but they mostly have an importance close to zero when
they were selected. We believe this may be due to factors associated with geographical
heterogeneity cite. Our models were modeled to fit a global behavior in the city. However,
geographic patterns can happen locally, combined with contributions from other factors.
For example, the density of schools in one region may attract drug-related crimes because
of poor demographic rates, and drive these crimes away in another region for other reasons.
To manage the geographic patterns that influence crime in machine learning models,
further study should be conducted. Considering demographic features may be a solution,
but it will not solve the problem of local heterogeneity.
66
trend_1seasonal_12
trend_2trend_3trend_4diff_12diff_11
trend_9seasonal_2
trend_8seasonal_7seasonal_3
seasonal_10seasonal_6seasonal_1
diff_6diff_8diff_3diff_2
diff_10diff_4
highway_residentialmonthdiff_9
CVP@dayseasonal_12diff_12diff_11seasonal_2seasonal_5seasonal_11seasonal_7seasonal_8trend_12seasonal_10seasonal_9seasonal_3monthdiff_7diff_9highway_residentialdiff_1diff_2diff_6highway_secondaryyeardiff_3diff_4tourism_*
CVLI@day
seasonal_12trend_1trend_2diff_11diff_12
trend_3trend_5
seasonal_11seasonal_1seasonal_2seasonal_6
trend_9seasonal_10
seasonal_5seasonal_7seasonal_9seasonal_3
trend_11trend_12
seasonal_4seasonal_8
diff_5diff_1
highway_primary
TRED@daytrend_1seasonal_12trend_2trend_3diff_12diff_11trend_6trend_12seasonal_5seasonal_1seasonal_9seasonal_11seasonal_6seasonal_7seasonal_8seasonal_2diff_3diff_8diff_5seasonal_4diff_1seasonal_10diff_9month
CVP@night
0.0 0.2 0.4 0.6 0.8 1.0Importance
seasonal_12diff_12
seasonal_4seasonal_10
seasonal_7seasonal_9seasonal_8
amenity_place_of_worshipseasonal_3
diff_2seasonal_1seasonal_2
yearhighway_residential
seasonal_11highway_secondary
trend_12diff_7diff_9diff_5
monthdiff_8
highway_primaryleisure_*
CVLI@night
0.0 0.2 0.4 0.6 0.8 1.0Importance
seasonal_12trend_1trend_2trend_3diff_11diff_12trend_4seasonal_11seasonal_9seasonal_3seasonal_7highway_residentialseasonal_5seasonal_4seasonal_10seasonal_6seasonal_8diff_4seasonal_2diff_1diff_6highway_secondarymonthyear
TRED@night
KGrid-RF
Figure 23: Feature importance of KGrid-RF models.
67
seasonal_12trend_3trend_1trend_9trend_8trend_2trend_4diff_11diff_12
seasonal_3seasonal_6seasonal_2seasonal_7
diff_10diff_6diff_3diff_8
seasonal_10seasonal_1
monthdiff_9diff_2diff_4
highway_residential
CVP@dayseasonal_12diff_11seasonal_7diff_12seasonal_5seasonal_8seasonal_11seasonal_9trend_12seasonal_2seasonal_3seasonal_10monthhighway_secondaryhighway_residentialdiff_1diff_2diff_6diff_7diff_9yeardiff_3diff_4tourism_*
CVLI@day
seasonal_12trend_1trend_2diff_11
trend_3diff_12
trend_5trend_9
seasonal_6trend_12trend_11
seasonal_1seasonal_9seasonal_2
seasonal_11seasonal_4seasonal_5
seasonal_10seasonal_7seasonal_3seasonal_8
diff_1diff_5
highway_primary
TRED@dayseasonal_12trend_2trend_1trend_3trend_6trend_12diff_12diff_11seasonal_1seasonal_5diff_1diff_5seasonal_6seasonal_9diff_3seasonal_7diff_8diff_9seasonal_11seasonal_8seasonal_2monthseasonal_4seasonal_10
CVP@night
0.0 0.2 0.4 0.6 0.8 1.0Importance
seasonal_12diff_12
seasonal_4seasonal_7seasonal_3seasonal_2seasonal_8
seasonal_10seasonal_9
seasonal_11seasonal_1
monthhighway_secondary
trend_12diff_7diff_2
highway_residentialdiff_9diff_5year
diff_8amenity_place_of_worship
highway_primaryleisure_*
CVLI@night
0.0 0.2 0.4 0.6 0.8 1.0Importance
seasonal_12trend_1trend_3diff_11trend_2diff_12trend_4seasonal_9seasonal_7seasonal_11seasonal_6diff_4diff_1seasonal_8seasonal_3seasonal_4seasonal_10seasonal_5seasonal_2yeardiff_6highway_residentialmonthhighway_secondary
TRED@night
KGrid-GB
Figure 24: Feature importance of KGrid-GB models.
68
trend_1trend_2
seasonal_12trend_3diff_12diff_11
kde:highway_residentialdiff_4diff_1
seasonal_10diff_6diff_9
seasonal_4seasonal_1
diff_3diff_5
seasonal_11seasonal_6
diff_8diff_2
diff_10diff_7
seasonal_3seasonal_7
CVP@daytrend_1trend_2seasonal_12trend_3diff_3seasonal_3diff_4diff_11diff_12diff_7diff_2seasonal_4diff_9diff_1seasonal_6diff_5diff_10diff_6monthseasonal_8diff_8seasonal_9seasonal_5kde:amenity_hospital
CVLI@day
trend_1seasonal_12
trend_2trend_3diff_11diff_12
diff_6diff_1
seasonal_3diff_9diff_4
diff_10diff_2diff_5
seasonal_1diff_7
seasonal_11seasonal_7seasonal_2
diff_8diff_3
kde:amenity_hospitalmonth
year
TRED@daytrend_1trend_2seasonal_12diff_12diff_11seasonal_8kde:amenity_place_of_worshipkde:amenity_hospitaldiff_7diff_3diff_2seasonal_1diff_10seasonal_6diff_6diff_5diff_4seasonal_2diff_1diff_9seasonal_5seasonal_10seasonal_3kde:amenity_school
CVP@night
0.0 0.2 0.4 0.6 0.8 1.0Importance
trend_1seasonal_12
trend_2diff_1
diff_10trend_12
diff_12diff_5diff_4
diff_11diff_9
seasonal_6diff_8
seasonal_4diff_6
kde:amenity_hospitaldiff_7
seasonal_7month
seasonal_3diff_2diff_3
seasonal_8year
CVLI@night
0.0 0.2 0.4 0.6 0.8 1.0Importance
trend_1trend_2seasonal_12diff_12kde:TRED@nightdiff_11diff_10diff_2diff_1diff_5diff_6diff_9diff_7diff_4seasonal_3seasonal_2seasonal_11seasonal_9seasonal_1diff_8kde:amenity_policediff_3seasonal_5kde:tourism_*
TRED@night
KDE-RF
Figure 25: Feature importance of KDE-RF models.
69
trend_1trend_2trend_3
seasonal_12diff_11diff_12
diff_1diff_10
seasonal_11diff_4diff_6diff_9
seasonal_10diff_3diff_8diff_5
seasonal_6kde:highway_residential
diff_7seasonal_1
diff_2seasonal_4seasonal_3seasonal_7
CVP@daytrend_3trend_2trend_1seasonal_12seasonal_6seasonal_3diff_3diff_4diff_1diff_8diff_2seasonal_4diff_11diff_9seasonal_8diff_12diff_6monthseasonal_9diff_10diff_7diff_5kde:amenity_hospitalseasonal_5
CVLI@day
trend_1seasonal_12
trend_3trend_2diff_11diff_12
diff_6diff_1
monthdiff_2diff_4diff_8
seasonal_3diff_9
seasonal_1diff_7
diff_10seasonal_7
seasonal_11seasonal_2
diff_3diff_5
kde:amenity_hospitalyear
TRED@daytrend_2trend_1seasonal_12diff_12diff_11kde:amenity_hospitalseasonal_1diff_3kde:amenity_place_of_worshipdiff_10seasonal_8diff_7diff_1kde:amenity_schooldiff_9diff_6seasonal_10diff_2diff_5seasonal_2seasonal_6diff_4seasonal_5seasonal_3
CVP@night
0.0 0.2 0.4 0.6 0.8 1.0Importance
trend_1trend_12
seasonal_12trend_2
diff_1diff_12
diff_8diff_5
diff_10diff_4
diff_11month
seasonal_6seasonal_4
diff_9diff_6
seasonal_3diff_3diff_2
seasonal_7kde:amenity_hospital
diff_7year
seasonal_8
CVLI@night
0.0 0.2 0.4 0.6 0.8 1.0Importance
trend_1trend_2seasonal_12kde:TRED@nightdiff_12diff_11kde:amenity_policediff_2diff_1diff_5diff_4diff_7diff_6diff_10diff_9diff_3seasonal_5seasonal_1diff_8seasonal_9seasonal_3seasonal_2seasonal_11kde:tourism_*
TRED@night
KDE-GB
Figure 26: Feature importance of KDE-GB models.
70
5 Concluding Remarks
Public safety is one of the significant challenges of smart cities, especially in underde-
veloped countries with high crime rates. To reduce crime incidence, scientists have found
evidence to propose the effectiveness of hotspot policing strategies (BRAGA; PAPACHRIS-
TOS; HUREAU, 2014; WEISBURD; ECK, 2004). Towards innovation, predictive algorithms
have been used to estimate crime hotspots. Although these predictive policing approaches
have been increasingly applied in police departments, we find that previous studies lack in
several vital aspects. First, it is difficult to find a study that at the same time, transpar-
ently presents the processing steps involved in producing future crime hotspots and shows
their effectiveness against traditional strategies. Second, the tools that support such stud-
ies are proprietary, and thus the results are difficult to reproduce. Third, using existing
literature, it is difficult to measure how the choice of a particular crime mapping method
or machine learning algorithm affects predictive performance.
This study first sought to improve our previously proposed prediction framework
through alternative crime mapping and feature engineering approaches, filling the gaps
mentioned. In the previous version of our framework, we detailed the processing steps
involved to implement a hotspot prediction system (ARAUJO et al., 2018). The proposed
approach required some adaptations to include (i) the adjustment of a single city-wide
model, (ii) the possibility of using KDE as a crime mapping method, as well as (iii) the
extraction of trend and seasonality features, and (iv) data from exogenous sources such
as OpenStreetMap. This study also sought to turn our implementation into a python
package called predspot to reproduce our approach in different cities that do not have
access to the expensive tools available.
In Chapter 3, we presented the Predspot methodology as the new version of our
framework divided into two processing phases. First, model selection preprocesses data,
models criminal incidence in spatiotemporal patterns (features) and trains machine learn-
ing models using them. We also included geographic features from OpenStreetMap open
data, seeking to improve predictive performance. The second phase of Predspot comprises
71
the prediction service, in which predictions are generated in a web service workflow, in
contrast to the adoption of a separate dashboard application. For this work, the predspot
package supports only model selection analysis, but we provided some advice for imple-
menting the web service using it.
The empirical investigation found from our approach were explored in Chapter 4. We
evaluated two crime mapping methods, namely KGrid and KDE, and two extensively used
machine learning algorithms, Random Forest and Gradient Boosting, against a baseline
method based on a single autoregressive aggregation. The purposes of our experiments
were (i) to measure how the predictions are compared to the baseline in the performance of
estimates, and (ii) to measure whether there is a significant difference in the performance
of estimates using different predictive methods. To better interpret the results in a single
measure, we proposed PRRMSE as an alternative regression metric that indicates how
better predictions are compared to the baseline.
In our experiments, we considered twelve crime scenarios from datasets of two cities,
Natal in Brazil and Boston in the US. For each crime scenario, we fitted four predictive
models (KGrid-RF, KGrid-GB, KDE-RF and KDE-GB) and the baseline. The results
indicated that KDE-GB was the best approach in all crime scenarios (p < 0.001), with an
average PRRMSE of 3.123, followed by KDE-RF, KGrid-RF and finally KGrid-GB. In
addition, crime scenarios had different sample sizes, which helped us draw other patterns
from the results. We observed that KDE-based models showed interesting results in crime
scenarios with few samples. Besides, we found that as the size of the samples increased,
the performance of the KDE and KGrid-based models began to get closer.
Although these findings were within the range of our expectations, some results were
surprising. First, feature importance analysis revealed that temporal-based features were
massively preferred in almost all models. Seasonal and trend components have been con-
sidered efficient features for all models adjusted. On the other hand, geographic features
had little participation, even with our KDE proposed modelling approach. We argued that
this might be connected with the fact that our predictive models do not consider local
geographic heterogeneity. In a more sceptical way, this can be explained by considering
the fact that crime incidence itself is much more important to determine future crime
incidence than geographic factors does. We advise future work to improve geographic
features, for instance, reducing the dimensionality of the geographic feature space.
Another limitation of our study was the fact that we tested only a few machine learning
algorithms. As we mentioned, our model-agnostic approach can be benefited by choosing
72
better algorithms. Future work could analyze the role of automated machine learning
in our Predspot methodology since it seeks the optimal algorithm and hyperparameter
configuration (FEURER et al., 2015).
The horizon of our research indicates practical applications of the framework pre-
sented. These applications will be based on the deployment of a web service connected
to a database of criminal occurrences to automatically provide hotspot maps for a pe-
riod in the future. To support it, we provide open-source software that may help criminal
analysts adjusting machine learning models, encapsulating and deploying them in web
services. Besides, police managers that rely on our methodology that estimate 1.6 to 3.1
times better than the traditional hotspot estimation approach and perhaps producing
higher impact with the patrol resources available.
5.1 Future work
Considering the limitations of our study, we examined that some improvements may
be the subject of future work. First, the implementation of the Predspot service in Natal
is immediate demand, so we can evaluate our approach to practical crime reduction, as
suggested by Moses and Chan (2018). Working in this direction must take into account the
real difficulties in translating predictions into police operations. Hunt, Saunders and Hol-
lywood (2014) suggests that without proper planning, the results of a predictive policing
operation can be ineffective.
Second, the evolution of Predspot’s model selection can follow the integration of ex-
planation models to interpret the predictions made. Interpretable machine learning frame-
works have emerged as a requirement for deploying predictive systems (RIBEIRO; SINGH;
GUESTRIN, 2016; LUNDBERG; LEE, 2017). The explanation of why a given place has been
predicted as a hotspot can shed light on patrol manager regarding various ethical aspects
of prediction. For example, if a poor community is always considered a hotspot, these
explanatory models can bring demographic coefficients into the analysis.
Third, integration between the Predspot framework and police operations can take
place in many ways, and a study of patrol vehicle routing can be based on predictions
made. A patrol program can use the Predspot output as input data and generate the
sequence of places that vehicles should visit. Integrating this solution with the Predspot
framework can facilitate many patrol police operations.
Still, the development of predictive systems applied to produce future crime hotspots
73
will pass through many advances. The rise of data-driven methods and large investments
in artificial intelligence will bring significant innovations to aid crime reduction, and pre-
diction performance can improve dramatically. We argue that Predspot is an effort to
standardize vocabulary of the routines involved in modeling machine learning for such
problems, to keep the knowledge produced reusably and continually growing. We believe
our methodology may help formulate even more innovative policing strategies.
74
References
ANDRESEN, M. A.; LINNING, S. J. The (in) appropriateness of aggregating acrosscrime types. Applied Geography, Elsevier, v. 35, n. 1-2, p. 275–282, 2012.
ANDRESEN, M. A.; WEISBURD, D. Place-based policing: new directions, newchallenges. Policing: An International Journal of Police Strategies & Management,Emerald Publishing Limited, v. 41, n. 3, p. 310–313, 2018.
ANGELIDOU, M. Smart city policies: A spatial approach. Cities, Elsevier, v. 41, p.S3–S11, 2014.
ARAUJO, A. et al. Towards a crime hotspot detection framework for patrol planning.In: IEEE. IEEE 16th International Conference on Smart City. [S.l.], 2018. p. 1256–1263.
ARAUJO, A. et al. A predictive policing application to support patrol planning in smartcities. In: IEEE. Smart Cities Conference (ISC2), 2017 International. [S.l.], 2017.
BATTY, M. The new science of cities. [S.l.]: Mit Press, 2013.
BBC. Estas são as 50 cidades mais violentas do mundo (e 17 estão no Brasil). 2018.http://www.bbc.com/portuguese/brasil-43309946.
BERGMEIR, C.; HYNDMAN, R. J.; KOO, B. A note on the validity of cross-validationfor evaluating autoregressive time series prediction. Computational Statistics & DataAnalysis, Elsevier, v. 120, p. 70–83, 2018.
BERGSTRA, J.; BENGIO, Y. Random search for hyper-parameter optimization.Journal of Machine Learning Research, v. 13, n. Feb, p. 281–305, 2012.
BOGOMOLOV, A. et al. Once upon a crime: towards crime prediction fromdemographics and mobile data. In: ACM. Proceedings of the 16th internationalconference on multimodal interaction. [S.l.], 2014. p. 427–434.
BORGES, J. et al. Feature engineering for crime hotspot detection. In: IEEE. 2017 IEEESmartWorld, Ubiquitous Intelligence & Computing, Advanced & Trusted Computed,Scalable Computing & Communications, Cloud & Big Data Computing, Internet of Peopleand Smart City Innovation (SmartWorld/SCALCOM/UIC/ATC/CBDCom/IOP/SCI).[S.l.], 2017. p. 1–8.
BOX, G. E. et al. Time series analysis: forecasting and control. [S.l.]: John Wiley &Sons, 2015.
BRAGA, A. A. The effects of hot spots policing on crime. The ANNALS of the AmericanAcademy of Political and Social Science, Sage Publications Sage CA: Thousand Oaks,CA, v. 578, n. 1, p. 104–125, 2001.
75
BRAGA, A. A.; PAPACHRISTOS, A. V.; HUREAU, D. M. The effects of hot spotspolicing on crime: An updated systematic review and meta-analysis. Justice quarterly,Taylor & Francis, v. 31, n. 4, p. 633–663, 2014.
BREIMAN, L. Random forests. Machine learning, Springer, v. 45, n. 1, p. 5–32, 2001.
BROWN, D. E.; OXFORD, R. B. Data mining time series with applications tocrime analysis. In: IEEE. 2001 IEEE International Conference on Systems, Man andCybernetics. e-Systems and e-Man for Cybernetics in Cyberspace (Cat. No. 01CH37236).[S.l.], 2001. v. 3, p. 1453–1458.
CAPLAN, J. M.; KENNEDY, L. W. Risk terrain modeling compendium. Rutgers Centeron Public Security, Newark, 2011.
CARAGLIU, A.; BO, C. D.; NIJKAMP, P. Smart cities in europe. Journal of urbantechnology, Taylor & Francis, v. 18, n. 2, p. 65–82, 2011.
CHAINEY, S. Examining the influence of cell size and bandwidth size on kernel densityestimation crime hotspot maps for predicting spatial patterns of crime. Bulletin of theGeographical Society of Liege, v. 60, p. 7–19, 2013.
CHAINEY, S.; TOMPSON, L.; UHLIG, S. The utility of hotspot mapping for predictingspatial patterns of crime. Security journal, Springer, v. 21, n. 1-2, p. 4–28, 2008.
CLEVELAND, R. B. et al. Stl: A seasonal-trend decomposition. Journal of OfficialStatistics, v. 6, n. 1, p. 3–73, 1990.
COHEN, J.; GORR, W. L.; OLLIGSCHLAEGER, A. M. Leading indicators and spatialinteractions: A crime-forecasting model for proactive police deployment. GeographicalAnalysis, Wiley Online Library, v. 39, n. 1, p. 105–127, 2007.
DAVIES, T.; JOHNSON, S. D. Examining the relationship between road structure andburglary risk via quantitative network analysis. Journal of Quantitative Criminology,Springer, v. 31, n. 3, p. 481–507, 2015.
DOMINGOS, P. A few useful things to know about machine learning. Communicationsof the ACM, Association for Computing Machinery, v. 55, n. 10, p. 78–87, 2012.
ECK, J. et al. Mapping crime: Understanding hotspots. National Institute of Justice,2005.
FEURER, M. et al. Efficient and robust automated machine learning. In: Advances inneural information processing systems. [S.l.: s.n.], 2015. p. 2962–2970.
GAU, J. M.; BRUNSON, R. K. Procedural justice and order maintenance policing: Astudy of inner-city young men’s perceptions of police legitimacy. Justice quarterly, Taylor& Francis, v. 27, n. 2, p. 255–279, 2010.
GERBER, M. S. Predicting crime using twitter and kernel density estimation. DecisionSupport Systems, Elsevier, v. 61, p. 115–125, 2014.
GILLIES, S. The Shapely user manual. [S.l.]: Version, 2013.
76
GORR, W.; HARRIES, R. Introduction to crime forecasting. International Journal ofForecasting, Elsevier, v. 19, n. 4, p. 551–555, 2003.
HART, T.; ZANDBERGEN, P. Kernel density estimation and hotspot mapping:Examining the influence of interpolation method, grid cell size, and bandwidth on crimeforecasting. Policing: An International Journal of Police Strategies & Management,Emerald Group Publishing Limited, v. 37, n. 2, p. 305–323, 2014.
HUNT, P.; SAUNDERS, J.; HOLLYWOOD, J. S. Evaluation of the shreveport predictivepolicing experiment. [S.l.]: Rand Corporation, 2014.
JORDAHL, K. Geopandas: Python tools for geographic data. URL: https://github.com/geopandas/geopandas, 2014.
KADAR, C.; MACULAN, R.; FEUERRIEGEL, S. Public decision support for lowpopulation density areas: An imbalance-aware hyper-ensemble for spatio-temporal crimeprediction. arXiv preprint arXiv:1902.03237, 2019.
KNIBERG, A.; NOKTO, D. A Benchmark of Prevalent Feature Selection Algorithms ona Diverse Set of Classification Problems. 2018.
KOCHEL, T. R. Constructing hot spots policing: Unexamined consequences fordisadvantaged populations and for police legitimacy. Criminal justice policy review, SagePublications Sage CA: Los Angeles, CA, v. 22, n. 3, p. 350–374, 2011.
KOCHEL, T. R.; WEISBURD, D. Assessing community consequences of implementinghot spots policing in residential areas: findings from a randomized field trial. Journal ofExperimental Criminology, Springer, v. 13, n. 2, p. 143–170, 2017.
LEE, S. H. et al. Towards ubiquitous city: concept, planning, and experiences in therepublic of korea. In: Knowledge-based urban development: Planning and applications inthe information era. [S.l.]: IGI Global, 2008. p. 148–170.
LIN, Y.-L.; YEN, M.-F.; YU, L.-C. Grid-based crime prediction using geographicalfeatures. ISPRS International Journal of Geo-Information, Multidisciplinary DigitalPublishing Institute, v. 7, n. 8, p. 298, 2018.
LUNDBERG, S. M.; LEE, S.-I. A unified approach to interpreting model predictions. In:GUYON, I. et al. (Ed.). Advances in Neural Information Processing Systems 30. CurranAssociates, Inc., 2017. p. 4765–4774. Disponível em: <http://papers.nips.cc/paper/7062-a-unified-approach-to-interpreting-model-predictions.pdf>.
MALIK, A. et al. Proactive spatiotemporal resource allocation and predictive visualanalytics for community policing and law enforcement. IEEE Transactions onVisualization & Computer Graphics, IEEE, n. 12, p. 1863–1872, 2014.
MCKINNEY, W. et al. Data structures for statistical computing in python. In: AUSTIN,TX. Proceedings of the 9th Python in Science Conference. [S.l.], 2010. v. 445, p. 51–56.
MOHLER, G. O. et al. Self-exciting point process modeling of crime. Journal of theAmerican Statistical Association, Taylor & Francis, v. 106, n. 493, p. 100–108, 2011.
77
MOHLER, G. O. et al. Randomized controlled field trials of predictive policing. Journalof the American Statistical Association, Taylor & Francis, v. 110, n. 512, p. 1399–1411,2015.
MOSES, L. B.; CHAN, J. Algorithmic prediction in policing: assumptions, evaluation,and accountability. Policing and Society, Taylor & Francis, v. 28, n. 7, p. 806–822, 2018.
OLSON, R. S. et al. Data-driven advice for applying machine learning to bioinformaticsproblems. arXiv preprint arXiv:1708.05070, World Scientific, 2017.
PEDREGOSA, F. et al. Scikit-learn: Machine learning in python. Journal of machinelearning research, v. 12, n. Oct, p. 2825–2830, 2011.
PERRY, W. L. Predictive policing: The role of crime forecasting in law enforcementoperations. [S.l.]: Rand Corporation, 2013.
RATCLIFFE, J. What is the future. . . of predictive policing. Practice, v. 6, n. 2, p.151–166, 2015.
REPPETTO, T. A. Crime prevention and the displacement phenomenon. Crime &Delinquency, Sage Publications Sage CA: Thousand Oaks, CA, v. 22, n. 2, p. 166–177,1976.
RIBEIRO, M. T.; SINGH, S.; GUESTRIN, C. Why should i trust you?: Explainingthe predictions of any classifier. In: ACM. Proceedings of the 22nd ACM SIGKDDinternational conference on knowledge discovery and data mining. [S.l.], 2016. p.1135–1144.
ROSENBAUM, D. P. The limits of hot spots policing. Police innovation: Contrastingperspectives, Cambridge University Press New York, NY, p. 245–263, 2006.
SHERMAN, L. W.; GARTIN, P. R.; BUERGER, M. E. Hot spots of predatory crime:Routine activities and the criminology of place. Criminology, Wiley Online Library,v. 27, n. 1, p. 27–56, 1989.
SILVERMAN, B. W. Density estimation for statistics and data analysis. [S.l.]: Routledge,2018.
UNODC. Global Study on Homicide. 2013. https://www.unodc.org/documents/gsh/pdfs/2014_GLOBAL_HOMICIDE_BOOK_web.pdf.
VERLEYSEN, M.; FRANÇOIS, D. The curse of dimensionality in data mining and timeseries prediction. In: SPRINGER. International Work-Conference on Artificial NeuralNetworks. [S.l.], 2005. p. 758–770.
VOMFELL, L.; HÄRDLE, W. K.; LESSMANN, S. Improving crime count forecastsusing twitter and taxi data. Decision Support Systems, Elsevier, v. 113, p. 73–85, 2018.
WALT, S. V. D.; COLBERT, S. C.; VAROQUAUX, G. The numpy array: a structure forefficient numerical computation. Computing in Science & Engineering, IEEE ComputerSociety, v. 13, n. 2, p. 22, 2011.
78
WANG, X.; BROWN, D. E.; GERBER, M. S. Spatio-temporal modeling of criminalincidents using geographic, demographic, and twitter-derived information. In: IEEE.Intelligence and Security Informatics (ISI), 2012 IEEE International Conference on.[S.l.], 2012. p. 36–41.
WEISBURD, D. The law of crime concentration and the criminology of place.Criminology, Wiley Online Library, v. 53, n. 2, p. 133–157, 2015.
WEISBURD, D.; BRAGA, A. A. Hot spots policing as a model for police innovation.Police innovation: Contrasting perspectives, Cambridge University Press Cambridge, p.225–244, 2006.
WEISBURD, D.; ECK, J. E. What can police do to reduce crime, disorder, and fear?The Annals of the American Academy of Political and Social Science, Sage Publications,v. 593, n. 1, p. 42–65, 2004.
WEISBURD, D. et al. Does crime just move around the corner? a controlled study ofspatial displacement and diffusion of crime control benefits. Criminology, Wiley OnlineLibrary, v. 44, n. 3, p. 549–592, 2006.
YU, C. H. Exploratory data analysis. Methods, v. 2, p. 131–160, 1977.
YU, C.-H. et al. Crime forecasting using data mining techniques. In: IEEE. 2011 IEEE11th international conference on data mining workshops. [S.l.], 2011. p. 779–786.
ZIEHR, D. Leveraging Spatio-Temporal Features for Improving Predictive Policing. [S.l.]:Msc thesis, Karlsruhe Intitute of Technology, Germany, 2017.
79
APPENDIX A -- Selected Parameters of Machine Learning Models
Table 7: Selected hyperparameters of the machine learning models for the Natal crime scenarios.
Random Forest Gradient BoostingMappingMethod
CrimeScenario n_estimators max_depth max_features n_estimators max_depth max_depth
KGrid
CVP@day 1500 9 0.8 500 3 0.2CVP@night 500 9 0.8 500 3 1CVLI@day 1500 3 0.8 2500 6 0.4CVLI@night 500 3 0.8 500 3 0.4TRED@day 500 6 0.8 500 3 1TRED@night 500 6 0.8 500 3 0.8
KDE
CVP@day 4500 15 0.8 1500 3 0.4CVP@night 3500 15 0.8 500 3 0.6CVLI@day 1500 15 0.8 1500 6 0.2CVLI@night 1500 15 0.8 2500 6 0.4TRED@day 2500 15 0.8 2500 6 0.8TRED@night 3500 15 0.8 1500 6 0.4
80
Table 8: Selected hyperparameters of the machine learning models for the Boston crime scenarios.
Random Forest Gradient BoostingMappingMethod
CrimeScenario n_estimators max_depth max_features n_estimators max_depth max_depth
KGrid
Residential-Burglary@day 500 6 0.6 4500 9 0.4Residential-Burglary@night 1500 6 0.6 4500 9 0.4Simple-Assault@day 1500 6 0.6 500 3 0.2Simple-Assault@night 1500 6 0.8 500 3 0.4Drug-Violation@day 500 6 0.8 500 3 0.2Drug-Violation@night 500 9 0.4 500 3 0.2
KDE
Residential-Burglary@day 4500 15 0.8 2500 6 0.4Residential-Burglary@night 4500 15 0.2 2500 6 0.2Simple-Assault@day 2500 15 0.8 4500 6 0.4Simple-Assault@night 3500 15 0.2 2500 6 0.4Drug-Violation@day 2500 15 0.6 3500 6 0.2Drug-Violation@night 2500 15 0.6 2500 6 0.4