Public Transportation
Challenge provided by PSE

Churn model for public transportation

Studying and predicting the churn rate for public transportation can be a good indicator of the quality of service.

The public transport system is crucial to support mobility inside a city. However, the system is only optimal if it can serve the population. Network optimization regarding route stops, interfaces, frequency, and commodities, amongst other issues, is key to achieving this.

A common measure to understand the proportion of customers or subscribers who leave a supplier during a given period is called the ‘’churn rate’’. The churn rate is an indicator of customer dissatisfaction when it is high. Studying and even furthermore predicting the churn rate for public transportation can be a good indicator of the quality of service. 


Identify churn profiles and their key driving factors and propose measures to win back lost segments and their expected impact. 

United Nations SDG 

GOAL 11: Sustainable Cities and Communities

  • Target 11.2.1: Provide access to safe, affordable, accessible, and sustainable transport systems for all.


The following datasets were provided to the participants:

  • Demand for public transportation on a semestral basis in each parish of origin and its respective destination parish in several cities in Portugal, provided by PSE.
  • Socio-demographic (age and gender) information of bus users, provided by PSE.


Besides the data provided, unemployment, parish population data, Google mobility, and points of interest (extracted from OpenStreetMaps) were used. Most teams agreed that more fine-grained data (ideally daily or even hourly) would be useful in solving this challenge and that the bi-yearly period is too short of making a good prediction. This would also enable the teams to use weather and air quality data.

Besides this, it would be helpful to have more segmentation in the socio-demographic data, data from ticket validation, car traffic, parking, and mobility data (e.g., from mobile providers). One team suggested using CCTV footage to count the number of people present at a metro station, bus stop, or even inside a bus.

Methods and Techniques

As the data was not very granular, many teams focused on data analysis rather than creating predictive models. In most cases, they could already identify the changes in public transportation usage by different demographics. One team proposed to use a Principal Component Analysis (PCA) to find the main driving factors behind churn. Some teams built predictive models using either Decision Trees or Gradient Boosting algorithms. 

One team used K-Means to classify the segments that are churning and later to identify which of the churning profiles affect which route the most.

Main Insights from Data

With the limited data available, it was already possible to demonstrate the usefulness of calculating and predicting the churn rate. The teams demonstrated that the demographic distribution of the bus users is different from the population distribution, particularly concerning the younger population. It was shown that the most significant factors for churning were the population density in the district, relative change in unemployment, and age groups.


The map below is a proof of concept of a solution developed by one of the teams. It shows the variation in public transportation usage between several locations in the city. The period considered is pre and post-COVID. Although this map is made for a large period, with access to proper data, it can be generated for shorter periods to assist public transportation companies with usage information.

Figure 1 - Map representing the connectivity between different nodes in the city - red means a decrease in transportation usage, and blue means an increase. 

Social Impact

Many teams pointed out that these outcomes are useful for planning campaigns towards churning groups and evaluating current routes. Further improvements could be achieved by improving the quality and quantity of data. With better data, for example, it would be possible to look at the traffic data and calculate the ratio of cost/effort of using a car and public transportation.

Top 5 Solutions
Open-source code

More about this category

World Data League - a competition for data scientists
World Data League @Copyright 2022