HDSC Winter ’23 Capstone Project by Team Theano

6 min readSep 15



Crime is a form of violence or illegal activity perpetrated by individuals to cause harm to people or property, punishable by authorities. It directly or indirectly impacts lives and significantly influences a country’s development. In African nations, crime rates are on the rise, encompassing a range of actions involving political agents, from governments to various groups and civilians. The increase in crime rates and resulting fatalities has led to the emergence of Crime Analysis.

Crime analysis is a systematic approach that identifies and examines patterns and trends in criminal activities. This process involves exploring and detecting crimes and understanding the relationships between these crimes and their perpetrators. The ultimate goal is to extract valuable insights from extensive datasets and support law enforcement agencies in their efforts to tackle and reduce crime.


1. To analyze crime rates and fatalities across various African countries, examining differences among different actors and actor groups and their contributions to recorded total crimes and fatalities.

2. To identify patterns in crime rate trends spanning from 1997 to 2023 (March 31st) and determine the most influential factors that distinguish the top 3 and bottom 3 countries in terms of crime rates.

3. Develop a crime analytics and forecasting tool capable of assessing fatalities and threat levels based on disorder type, actor involvement, and potential crime events. This tool leverages insights from historical data to inform proactive policy decisions.


The Armed Conflict Location and Event Data Project (ACLED) is a resource tailored for in-depth conflict analysis and crisis mapping. This dataset meticulously records the dates and locations of all documented instances of political violence and protests across numerous developing African countries.

This comprehensive dataset encompasses a broad spectrum of events, such as those occurring within civil wars, periods of instability, public protests, and regime breakdowns. It provides coverage for all African nations, spanning from 1997 to March 31st, 2023.

Key features of this crime data include:

1. Dates and precise locations of conflict events.

2. Detailed categorization of event types, encompassing battles, civilian killings, riots, protests, and recruitment activities.

3. Identification of various actors involved, including rebels, governments, militias, armed groups, protesters, and civilians.

4. Tracking changes in territorial control.

5. Documentation of reported fatalities.



The dataset was obtained directly from ACLED, an encoded dataset generated over a period of year (1997 to March 31st 2023.) detailing conflicts in African nations for 25+ years. All these are recorded in 315940 rows and 31 columns.


On exploratory data analysis, the following insights were generated:

  1. Battles (an event type) contributed the most to fatalities in a crime.
  2. State Forces as the main actors in the crime account for 39.4% of fatalities recorded in Africa between 1997 and 2023 March 31st.
  3. The trend of the number of crimes committed per year, showed that up to 2009, the number of crimes recorded per year didn’t really vary but from 2009 upwards, there is a steady upward trend.
  4. We identified that the top 3 countries in terms of crime rate were Somalia, Nigeria and the Democratic Republic of Congo all having a record of 25000+ crimes.
  5. Botswana and Comoros are among the bottom countries in terms of crime rate recording under 120 crimes.
  6. Actor type 1 which represents State Forces, has the most contribution to the number of fatalities in both cases.
  7. Actor types 2, 3 and 4 which are rare in the countries with least number of fatalities recorded are common in the high fatality countries.
  8. We can then assume that State Forces, Rebel Groups, Political Militias and Identity Militias are responsible for crimes incurring a high number of fatalities.
  9. As regards Angola which had a relatively low number of crimes recorded but the highest number of fatalities, we observed that the majority of the crimes that occurred there were battles which from the dataset generally always resulted in a high number of fatalities.

Record of crime per year


We employed the use of some functions like shape, head, describe, info, columns to ascertain the core feature and its relatives. The dataset has 315940 records with over 60% null values. So, to ensure that our data doesn’t have bias, we took a closer look at the missing values per column:

Missing Values on some columns

  • ASSOC_ACTOR_1–231997
  • ASSOC_ACTOR_2–253699
  • TAGS — 255561
  • ACTOR2–86136
  • ADMIN3–161453
  • ADMIN2–2451
  • ADMIN1–2 (Corresponding locations — Gulf of Guinea & Coast of Benin)

Dropping Values

Looking at the amount of missing values, we definitely cannot drop rows, we then choose to drop columns with large amounts of null & missing values. They are: ASSOC_ACTOR_1, ASSOC_ACTOR_2, etc.

On the two (2) null of ADMIN1 (Gulf of Guinea & Coast of Benin), we replace them with the most closely related of the said ADMIN1 (Bayelsa & Oueme respectively).


For feature selection, columns with no missing data were chosen except for ADMIN1 which was manually filled after searching for the right value based on other columns. Selected features included event date, event type, sub event type, and actor1.

One hot encoding was performed on categorical features with fewer numbers of categories while label encoding was used for columns with a large number of categorical variables. Target variable (Fatalities) was binned into seven categories/levels. The training set was then scaled using standard scaler and dimensionality was reduced to 98 components from 121 using Principal Component Analysis.

Fatalities was binned into seven label encoding in order to train the model:

  • ‘NO_FATALITY’: 6,
  • ‘1_FATALITY’: 2,
  • ‘’2_TO_10': 3,
  • ‘11_TO_50’: 1,
  • ‘51_TO_100’: 5,
  • ‘101_TO_500’: 0,
  • ‘501_TO_1350’: 4


The XGBClassifier algorithm from Xgboost was used to train the model on the preprocessed dataset. The target variable chosen was Fatalities, which was binned into seven categories/levels.

To address the class imbalance in the target variable, we used the ‘class_weight’ parameter from the ‘utils’ module in the Scikit-learn library to compute sample weights based on the target variable’s distribution. The generated sample weights were then passed into the xgboost model during fitting to improve the model’s ability to handle the imbalance. This resulted in a trained model that can effectively predict the target variable with high accuracy.

Evaluating the model, we got an f1-score of 0.70 and a precision of 0.77 which are very good for the size of the dataset. The model can further be improved in the future leveraging deep learning algorithms. And for the predictions that were wrong, the majority fell to the next level in the range of fatalities, which means our model generalised well on the dataset.


We built a crime analytics and forecasting tool that assesses the threat level of fatalities based on disorder types, actors, and possible crime events. The tool leverages insights from historical data to inform preemptive policy decisions.

This tool was built as a web app using Streamlit and deployed using Streamlit Cloud. The dataset referenced by the web app was hosted on the cloud using Azure Blob Storage. All components of the web app can be found in the GitHub repo.

Link to Web App —

Link to GitHub repo — Team-Theano-Capstone-Project


We successfully provided a solution to our defined problem statement.

A pre-processed dataset was used to train an Xgboost model that effectively predicted the target variable (fatalities).

A crime analytics and forecasting tool was built to assess the threat level of fatalities based on disorder types, actors, and possible crime events. The tool leverages insights from historical data to inform preemptive policy decisions.

Our solution provided a comprehensive analysis of crime rates and fatalities in various African countries, with insights into the contributions of different actors and actor groups. Our analysis also highlighted the patterns in the trend of crime rate over the years, and the factors that are most dominant in the top and bottom countries in terms of crime rate. Our findings underscored the importance of considering location, time, and actors when analyzing crime to make informed policy decisions.




Our mission is to develop an army of creative problem solvers using an innovative approach to internships.

Recommended from Medium


See more recommendations