Predicting School Completion Rate in Developing Countries

8 min readSep 25, 2023

HDSC Spring ’23 Capstone Project by Team Kubeflow.


Low school completion rates stem from diverse factors across different situations:

1. Economic hardships, such as poverty, may compel children to quit school to support their families.

2. Violence and insecurity can disrupt regular school attendance.

3. Insufficient resources can affect motivation and result in poor education outcomes.

4. Gender inequalities and cultural norms may limit access to education for specific groups, while poor health and nutrition can hinder learning.

5. Inadequate support for children with special needs can also contribute to low completion rates.

Problem Statement

Our goal was to utilize machine learning algorithms to create predictive models for forecasting school completion rates in developing nations. We focused on educational and socio-economic factors as the key predictive variables associated with students.

Aim and Objectives

Our project’s primary objective was to use educational and socio-economic indicators to forecast and reduce school dropout rates, especially among underprivileged students in developing nations. Here are our specific goals:

1. Analyzing Educational and Socio-economic Factors: We aimed to study the diverse educational and socio-economic factors that contribute to school dropout rates in developing countries.

2. Developing Predictive Models: Our focus was on building advanced machine/deep learning models capable of accurately predicting school completion rates among students in these specific regions.

3. Model Evaluation and Implementation:We committed to rigorously evaluating the effectiveness of the predictive models we developed and putting them into practical use within the educational systems of the respective developing countries included in our dataset.

Data Collection:

We sourced our information from the UNESCO Institute for Statistics, particularly through the Sustainable Development Goal 4 (SDG 4) data viewer. This tool allows users to access data and metadata from 2000 to 2022 via simple dashboards. Users can also copy, print, or download data in various formats like CSV, Excel, and PDF. Additionally, the tool enables keyword searches within the filtered dataset using the browser.

Data Preparation and Cleaning:

Data preparation, also known as data preprocessing, involves cleaning, organizing, and transforming data into a suitable format for analysis and modeling. It is crucial for ensuring the correctness, consistency, and reliability of results in data analysis. In our project, we focused primarily on data reduction and cleaning. The main objective of data cleaning was to address missing and null values, which sometimes required dimensionality reduction by eliminating columns with insufficient data.

For our analysis of school completion rate prediction, we categorized the datasets into three specific groups:

1. Primary Education

2. Lower Secondary Education

3. Upper Secondary Education

The dataset included a diverse set of variables capturing various aspects of education and development, such as ‘Region,’ ‘Country,’ ‘Year,’ ‘Gender,’ ‘School Completion Rate,’ ‘Childhood Education Gross Enrolment Ratio,’ ‘Gross enrolment ratio for early childhood educational development programs,’ ‘Gross intake ratio,’ ‘Literacy rate for 25–64 years old,’ ‘Expenditure on education as a percentage of total government expenditure (%),’ and ‘Government expenditure on education as a percentage of GDP (%).’

Key Features:

  • School Completion Rate: Represents the percentage of students successfully completing education at specific levels: primary, lower secondary, or upper secondary.
  • Gross Intake Ratio: Indicates the percentage of new entrants at a particular education level.
  • Government Expenditure on Education as a Percentage of GDP (%):Measures the proportion of a country’s GDP spent on education, highlighting the relative economic investment in education.

Exploratory Data Analysis

During this phase, our objective was to uncover patterns, relationships, anomalies, and test assumptions within our data using graphical representations and simple statistical summaries.

To conduct Exploratory Data Analysis (EDA), we formulated research questions aimed at enhancing the previous team’s model. These questions included:

  1. Identifying outliers or anomalies in completion rate data that warrant further investigation.

2. Determining the regions and countries with the highest and lowest completion rates.

3. Identifying the years with the highest and lowest completion rates.

4. Analyzing completion rates by gender.

5. Exploring the number of literate individuals aged 25–64 by gender.

6. Investigating the impact of GDP and total government expenditure on completion rates.

7. Calculating the percentage difference in completion rates by year for each country.

8. Understanding how completion rates vary across different education levels (e.g., primary, lower secondary, upper secondary).

9. Comparing Gross Enrolment Ratio and Gross Intake Ratio.

10. Identifying features with a strong correlation to completion rates.

Through our EDA, we obtained insights that helped us answer these questions and gain a deeper understanding of the data.

Modeling Building

The problem at hand is a regression task, and as a result, we trained several regression models to address it. These models include:

1. Linear Regression Model

2. Decision Tree Regressor

3. Random Forest Regressor

4. K-Nearest Neighbors Regressor

5. Gradient Boosting Regressor

These models were employed to predict and analyze the target variable in our regression problem.

Here are the steps we followed during the modeling process:

1. Data Normalization: We applied data normalization using MinMaxScaler from the scikit-learn preprocessing library. This step ensured that all data points were on a consistent scale, which is important for modeling.

2. Train-Test Split: We divided the dataset into two sets: a training set and a testing set. The training set contained 80% of the data, while the testing set held the remaining 20%. This split was performed using the train_test_split function from the sklearn.model_selection module.

Model Building:

We constructed five machine learning models, including Linear Regression, Decision Tree Regressor, Random Forest Regressor, K-Nearest Neighbors Regressor, and Gradient Boosting Regressor.

Model Performance Evaluation:

To select the optimal model for each section, we assessed model performance using key metrics such as Mean Absolute Error, Mean Squared Error, Root Mean Squared Error (RMSE), Residual Sum of Squares (RSS), and r-squared (r2_score). After thorough evaluation, the Random Forest Regressor emerged as the top-performing model, as indicated in the table below:

The model was saved as a “pickle” file, serving as the foundation for deploying our solution.

Modeling Results

For the results we have the accuracy measures of each model performance in the below tables for each section.

For the primary section below is the table of results:

Based on the results presented above, it is evident that for the lower secondary section the Ridge regression model exhibits the most favorable performance. This conclusion is drawn from the fact that the model achieves a mean_absolute_error of 0.1263

For the lower secondary section below is the table of results:

For the upper secondary section below is the table of results:

Based on the results presented above, it is evident that for the upper secondary section the Ridge regression is the most favorable performance. This conclusion is made as the model achieves a mean_absolute_error of 0.1532 which is slightly lower than linear regression having a mean_absolute error value of 0.1539.

Model Deployment:

  • The best-performing model was saved as a ‘pickle’ file (.pkl) using the Pickle library, and we utilized this top-performing model for deployment purposes.
  • This Model was deployed on the Streamlit platform to estimate school completion rates for different education levels.
  • The system incorporates a trained predictive model that takes user inputs to provide quick estimates of completion rates in primary, lower secondary, and upper secondary education.


In the Sub-Saharan region, the completion rate is notably low, and government spending on education is also inadequate. To address this issue and improve completion rates, several strategies can be employed:

1. Increase Government Spending: One crucial step is to boost government spending on education in the region. Adequate funding can lead to improved educational infrastructure, better teacher training, and enhanced access to quality education.

2. Individualized Support: Tailoring educational plans and support services to meet the specific needs of each student is essential. This may involve providing additional tutoring, educational materials, and counseling to students who require extra assistance.

3. Parental Engagement: Encouraging closer collaboration between parents or guardians and schools is vital. Regular communication about students’ progress, attendance, and behavior can foster parental involvement and support in their children’s education.

4.Creating a Better Learning Atmosphere: Establishing a welcoming and inclusive learning environment that promotes student engagement and motivation is crucial. Addressing issues like bullying and prejudice can create a safer and more conducive space for learning.

5. Promoting Literacy Skills: Emphasize the importance of education to parents who may lack literacy skills themselves. Provide special incentives to motivate and support their children in completing their education, highlighting the long-term benefits of education for individuals and communities.

By implementing these strategies and involving government, communities, and other stakeholders, it’s possible to make significant improvements in education and completion rates in the Sub-Saharan region.


The main goal of this study was to identify and predict the critical factors affecting school completion rates. We have achieved this objective by employing a data-centric approach and leveraging machine learning techniques. Our comprehensive analysis has revealed a noteworthy finding: the literacy rate among individuals aged 25–64, a demographic often associated with parenting, plays a pivotal role in influencing school completion rates.

Specifically, our research indicates that countries with higher literacy rates in this age group tend to have lower rates of student completion across different education levels, including primary, lower secondary, and upper secondary. This insight sheds light on the complex interplay between literacy, education, and completion rates in developing countries.

Furthermore, our research underscored the substantial variation in school completion rates across different regions. For instance, we found that regions like Northern Africa and Western Asia exhibit notably lower school completion rates, whereas Europe and Northern America demonstrate comparatively higher rates.

To summarize, to enhance school completion rates in developing countries, it is crucial for governments not only to increase education expenditure as a percentage of total government spending but also to prioritize the education of illiterate parents. By fostering awareness and understanding of the significance of education among parents, students are more likely to approach their education with a positive mindset, leading to increased completion rates. Ultimately, this approach contributes significantly to both individual development and the overall economic growth, as education plays an indispensable role in human progress.




Our mission is to develop an army of creative problem solvers using an innovative approach to internships.