Challenge provided by Urbanalytica

Predicting a safety score for women in Costa Rica

“Where can I be safe” is an example of a crime index map that could decrease gender-based violence by reducing the number of street crimes.

According to a study by the United Nations Entity for Gender Equality and the Empowerment of Women [1], gender violence in cities, specifically in public spaces, has become an increasingly public issue, especially in Latin America. The lack of adequate urban infrastructure, policies, and governance models exacerbates it. Thus, addressing the main obstacles women face regarding their right to an inclusive and safe city becomes a priority.

Police statistics have shown that 70.6% of the complaints of street sexual harassment in Costa Rica in 2019 were submitted by women [2]. While no current strategy from the public authorities is in place, women are raising their voices and creating awareness groups on social media, for example, to report aggressions and missing persons. This is why women need a mapping tool to identify and report whenever they feel their right to enjoy public spaces without being harassed is being threatened.

Tools like this already exist in other parts of the world. For example, in India, the Red Dot Foundation created the SafeCity web app


The goal of this challenge was to create a safety index to assess conditions, insecurity, and gender violence in public spaces affecting women and girls in Costa Rica and predict its trend.

United Nations SDG 

GOAL 11: Sustainable Cities and Communities

  • Target 11.7: Provide access to safe and inclusive green and public spaces


The following datasets were provided to the participants:

  • Police reports from 2010 to 2022, including the type of crime, location, and anonymized information about the victim, provided by Urbanalytica
  • Information about the location of POI, commercial and residential areas, road network, and other public infrastructure, provided by Urbanalytica
  • Demographic and crime rate data, provided by Urbanalytica
  • Google review data, provided by Google
  • Weather data, provided by OpenWeather
  • Open city data, provided by the National Statistics Institute of Costa Rica


While many teams recognized the richness of the datasets in terms of time span, the lack of geographical granularity was noted as one of the weak points. One team simulated how a sample of such a dataset would look like.

Another team used the Penal Code of Costa Rica as an additional source of data regarding the severity of the crimes - the higher the jail time, the bigger weight that crime would have on the index.

There was also a team that enriched the dataset by conducting a survey of young adults of their origin country that was centered on people’s perception of safety, for which they obtained 153 responses. This same team also noted that it would be interesting to have additional data on the flow of people between different points of the city and more demographic data.

A different approach was gathering additional data from OpenStreetMaps regarding public street lighting as a way to measure the correlation between public lighting and crimes committed.

Methods and Techniques

All teams started with some level of exploratory data analysis, mainly around the distribution of each type and subtype of crime, along with its prevalence per age, gender, location, and different time intervals.

One team started by forecasting gender-based crime. They considered a crime gender-based if its prevalence per gender was above the average proportion plus half of the standard deviation. This team then analyzed the autocorrelation and seasonality of the data and trained a forecasting model to predict the number of crimes using Facebook’s Prophet algorithm - which, compared to a naive model baseline, performed 10% better.

After that, the team moved on to compute a risk score considering the following variables as positive indicators of safety: lighting, width, and type of roads, visibility, population density, presence of security facilities, public transportation, and diversity of people. They then calculated the safety score for each polygon on a map and merged that with the crime data for that same polygon in an effort to find which variables had more influence on crime using an Ordinary Least Squares (OLS) algorithm. Most features had statistical significance in the model, and the ones that did not were removed. Finally, they checked spatial autocorrelation between neighboring polygons using Queen Contiguity - because hexagons could have up to six neighbors each - and found positive spatial autocorrelation in the number of crimes with statistical significance, meaning they were clustered among neighbors.

Figure 1 - Maps showing the distribution of the number of crimes (left) and the spatial lag of the variable (right). Spatial lag explains the influence of the neighbors in the data. There are patterns in the data, clustering the crimes in certain zones of San Jose.

Another team calculated their safety risk by simply looking at the crime data per district of San Jose. They studied the Costa Rica Penal Code to find the number of sentence years for each crime as a way to differentiate between the severity of crimes and quantify it. Additionally, if the crime was committed against a minor or against a woman, its severity was multiplied by 2 - this methodology was based on the Pinkerton Crime Index [3]. Finally, they took the sum of the incidence index for each district at each quarter and divided this number by the population of each district at that point in time, so that the index captured the frequency and severity of incidents and the size of the population in that area. For prediction, using pre-2019 data to train and 2019 data to test, this team trained a Linear Regression model, with a Mean Squared Error of 0.48 for yearly predictions and an XGBoost model, with a Mean Squared Error of 0.61 for yearly predictions.

Figure 2 - Quarterly prediction of crime index on the top-20 worst districts for each quarter, using the Linear Regression model.

Another team, while not approaching the problem in a radically different way, did provide a different product solution that led them to develop solutions in route optimization. This team used a Dijkstra's Shortest Path algorithm to find zone-based paths between two points that minimized the traveled distance and their overall safety index. The rationale behind the algorithm was to create a cost matrix depicting the costs between adjacent zones: distance and score.

Main Insights from Data

Several teams found that the most prevalent crimes were theft, assault, and robbery, with homicides being extremely rare. They also found no difference across months or days of the week but found that most crimes occur at night. In terms of gender-based crimes, the most prevalent crimes targeting females more than males were outbursts (64%), femicide (96%), and domestic violence (59%).

One team found a strong correlation between the presence of road infrastructure, properties, and heritage buildings and the increase of crimes. On the other hand, recreational areas, institutional, commercial, mixed land use, and population increase the safety of the space. They also found that the neighboring surroundings have an important influence on the number of crimes in a certain area, meaning crime has to be tackled holistically and not only street by street or block by block.

Another team created a very simple and easily interpretable safety index based solely on crime data - explained in the section above. With this index, they found that although women were victims in only 35% of crimes, proportionally to the crime severity captured by their index, that number climbed to 50%. The same thing happened in the distribution by type of crime - for example, while assault represented 38% of crimes and robbery only 14%, after adjusting for severity, those proportions became 21% and 35%, respectively. Another interesting finding is the variance across districts and years. For instance, this team found that in the district of Carmen, in 2010, the crime index was 16.1 - this means that Carmen was 16 times more dangerous than the San Jose average in the same year. However, Carmen was only the 5th district with more crime reports in 2010, which means that although less crime occurs there, per population and severity, their type of crime is considerably more serious.


Most teams suggested productizing their algorithms by creating an application or website that displayed maps, statistics, and forecasts about gender-based crimes in Costa Rica, allowing future crime reporting and forecasting. The users would mostly be female inhabitants and authorities.

Figure 3  - An example of a crime index map, color-coded by crime and safety index, together with contributing factors to explain that result.

Figure 4 - An example of a crime index map, where different areas are represented by hexagons and the safety index is represented by a color 

There was a team that suggested a navigation system that would suggest routes taking into consideration their crime and safety index, which could be used by women and tourists. This system would compute the shortest, safest, and "danger threshold" paths between different zones and enable users to choose the route they feel most comfortable with.

Figure 5  - An example of a navigation system that suggests routes based on the crime and safety index.

Social Impact

The main outcome identified by all teams was the decrease of gender-based violence by reducing the number of street crimes, including sexual harassment. A secondary outcome would be increasing the amount and reliability of gender-based violence data by making it easier and more comfortable for women to report sexual harassment incidents.

Looking at the primary outcome, the main metrics to assess that would be the number of reports of gender-based crimes, the number of individuals persecuted because of committing gender-based crimes, and the number of areas with a decrease in gender-based crimes.

Based on model and survey predictions, one team estimated an increase in people's safety of 9.09% by having the option to choose safer paths.

Another team pointed out that several publications found evidence that urban planning and design improve the liveability of cities and towns. For instance, with a randomized experiment in New York City, evidence showed a 35% reduction in outdoor nighttime index crimes [4]. In the case of Costa Rica, according to the available data, more than 57% of crimes occurred at night, and although we cannot say whether outdoors or not, even in the case these were only 30% of the total, this product would result in a reduction in the crime index of 10%.


[1] UN Women. "Safe Cities and Safe Public Spaces: Global results report". Available at: https://www.unwomen.org/en/digital-library/publications/2017/10/safe-cities-and-safe-public-spaces-global-results-report

[2] Observatory for Gender-based Violence Against Women of the Costa Rica Government. Available at: https://observatoriodegenero.poder-judicial.go.cr/

[3] Pinkerton Consulting & Investigations. “Pinkerton Crime Index Methodology”. Available at: https://pinkerton.com/products/pinkerton-crime-index/methodology

[4] Chalfin, A., Hansen, B., Lerner, J. et al. Reducing Crime Through Environmental Design: Evidence from a Randomized Experiment of Street Lighting in New York City

Top 5 Solutions
Open-source code

More about this category

World Data League - a competition for data scientists
World Data League @Copyright 2022