The Impact of Education and Infrastructure on Literacy Rate

7 min readFeb 6, 2024


Team Feature Forge (HDSC Fall 2023 Premiere Project Documentation)

The CCAPS dataset, offering subnational information on literacy rates, education, water access, electricity, and media ownership, is a crucial resource for assessing regional socio-economic development. Focused on key indicators, it provides essential insights into regional quality of life and development, aiding policymakers, researchers, and analysts. The dataset’s granularity facilitates the identification of disparities, guiding targeted development efforts. Its availability and transparency support informed decision-making, policy analysis, and research on development trends and challenges.

Aim and Objectives

This project aims to examine the relationship between literacy rates across different age groups and key socio-economic factors, including household electricity, radio, television, net primary and secondary attendance rates, access to improved water, and access to improved sanitation. The goal is to understand how these factors influence literacy rates and deploy a machine learning model to predict mean literacy rates based on provided indicators. By exploring the interplay between these variables, the project offers insights for educational and development initiatives, providing a data-driven approach to address literacy challenges and enhance overall well-being.

Data Gathering

This data was calculated using raw survey data from three sources: the Demographic and Health Surveys (DHS) supported by the U.S. Agency for International Development (USAID) []; Multiple Indicator Cluster Survey (MICS) supported by UNICEF []; the General Household Surveys conducted by Statistics South Africa []. The datasets used are freely available for download from the websites of these agencies.

Data Preparation

The data preparation process comprised the following steps:

  • Data Collection: The acquired data was structured and encompassed 471 rows and 68 columns.
  • Data Visualization: Various visualization techniques, including scatter plots, bar plots, box plots, heatmaps, and other visual aids, were employed to present the information in a visually intuitive manner, facilitating a concise and comprehensible interpretation.

Exploratory Data Analysis (EDA)

This step was taken to better understand the data that had been gathered, give a more full picture of the data, uncover and comprehend patterns that would explain unexpected results.

Figure 1: Scatter plots

For literacy rates among individuals aged 15 and over:

  • No discernible correlation was observed with “Access to improved water (% of population).
  • No significant correlation was identified with “Improved Sanitation (% of population).
  • A marginal positive correlation was noted with “Radio in household (% of population).
  • A slight positive correlation was observed with “Electricity in household (% of population).
  • A moderate positive correlation was found with “Radio and/or Television in households (% of population).

Figure 2: Scatter plot

In the case of literacy rates within the age group of 25 to 49:

  • A robust positive correlation is evident with the “Net primary attendance rate (%).
  • A moderate positive correlation is observed with the “Gross primary attendance rate (%).
  • A mild positive correlation is identified with the “Net secondary attendance rate (%).
  • A slight positive correlation is noted with the “Gross secondary attendance rate (%).

Figure 3: Bar plot


Zimbabwe, Namibia, and Kenya stand out as countries with notably high literacy rates within the age group of 15 and over.

Figure 4: Boxplot

  • Regarding the “Net primary attendance rate (%),” Sao Tome and Principe, Cameroon, and Rwanda demonstrate the highest rates of primary attendance.
  • In terms of the “Gross primary attendance rate (%),” Sao Tome and Principe, Rwanda, and Mozambique exhibit the highest rates of primary attendance.
  • When considering the “Net secondary attendance rate (%),” Nigeria, Cameroon, and Namibia display the highest rates of secondary attendance.
  • In relation to the “Gross secondary attendance rate (%),” Nigeria, Cameroon, and Namibia reveal the highest rates of secondary attendance.

Model Training

To streamline the model creation process, a subset of the dataset was selected, consisting of the following features:

‘Country name’, ‘Literacy rate (15–19)’, ‘Literacy rate (25–49)’, ‘Literacy rate (15 & over)’, ‘Literacy rate (15–24)’, ‘Electricity in household (% of population)’, ‘Radio in household (% of population)’, ‘Television in household (% of population)’, ‘Radio and/or Television in household (% of population)’, ‘Net primary attendance rate (%)’, ‘Net secondary attendance rate (%)’, ‘Access to improved water (% of population)’, ‘Access to improved sanitation (% of population)’.

This refined selection of variables was employed to build the predictive model, focusing on pertinent factors for the analysis.

Model Evaluation

During the model evaluation process, a range of supervised machine learning algorithms, including the Random Forest Regressor, Linear Regression, Ridge, and Lasso algorithms, were employed. Notably, the Ridge algorithm emerged as the top-performing model, exhibiting the highest coefficient of determination (R2 score) and achieving the lowest values for both mean absolute error (MAE) and mean squared error (MSE).

Model Deployment

The deployment of the model was facilitated through the utilization of Streamlit, seamlessly integrated with a user interface (UI) frontend. Access to the deployed model, alongside the corresponding execution codes, is available through the provided links below:

Streamlit : GitHub:


The data reveals key insights into the relationship between literacy rates and various socio-economic factors in the surveyed countries. Zimbabwe stands out with the highest literacy rate, reflecting substantial investments in education and stable socio-political conditions. Conversely, Niger, Mali, and Guinea exhibit lower literacy rates, indicating potential challenges in these regions.

The analysis highlights the complex relationship between literacy, access to electricity, media, and education. For example, Zimbabwe combines high literacy rates with extensive access to amenities, while Liberia, with limited access to electricity and media, maintains high attendance rates and improved water access. These findings emphasize the influence of socio-economic factors on literacy and education outcomes.

Subnational analysis within Zimbabwe reveals disparities, with Harare and Bulawayo showing higher literacy rates and access to amenities, and Matabeleland North facing potential educational challenges. Regional variations, like Midlands with high literacy but limited electricity access, and Masvingo with lower literacy but higher primary attendance, stem from unique historical, cultural, and economic factors.

Examining literacy rates and development indicators in regions like Harare, Bulawayo, Khomas, and Nairobi highlights the need for context-specific policies. The correlation matrix indicates a strong positive relationship between literacy rates and primary and secondary attendance rates, emphasizing the importance of education continuity.

In summary, the analysis offers crucial insights into the intricate relationship between literacy rates, education, and socio-economic factors. Context-specific interventions are crucial to address disparities and enhance educational outcomes at both national and subnational levels.


The machine learning model developed for predicting literacy rates based on various socio-economic factors has provided valuable insights into the dynamics influencing literacy in different African countries and subnational regions. The findings underscore the significance of data-driven approaches in understanding and addressing literacy challenges.

High-performing regions like Zimbabwe, which exhibit substantial literacy rates, also tend to have better access to amenities such as electricity, media, and improved education indicators. Conversely, regions with lower literacy rates, like Niger and Mali, face greater socio-economic challenges.


  1. Targeted Interventions: The machine learning model suggests that interventions should be specifically tailored to regions with lower literacy rates, such as Niger and Mali. Policies and programs designed to improve education quality, access to electricity, and media exposure can help elevate literacy levels.
  2. Invest in Infrastructure: Given the positive correlations between literacy, access to electricity, and media, governments should invest in infrastructure development to ensure electricity availability and media access, especially in underserved regions.
  3. Continued Data Collection: The accuracy and currency of data are crucial for model effectiveness. Governments and organizations should continue to collect and update data regularly to ensure the model remains relevant and reliable.
  4. Quality of Education: While primary and secondary attendance rates correlate with literacy, the focus should extend beyond enrollment. Improving the quality of education, including teacher training and curriculum enhancements, is essential to boost literacy outcomes.
  5. Cross-Sector Collaboration: Collaboration between the education sector and other relevant government departments, such as health and infrastructure, is pivotal in addressing the multifaceted nature of literacy challenges.
  6. Regular Model Refinement: The machine learning model should be continuously refined and updated as new data becomes available. Regular evaluation of the model’s performance can help enhance its predictive accuracy.
  7. Policy Impact Assessment: Implementing policies and programs informed by the model’s insights should be followed by rigorous impact assessments to evaluate their effectiveness in improving literacy rates.
  8. Public Awareness: Raising public awareness about the importance of literacy and education can foster community engagement and support for initiatives aimed at improving literacy.

By aligning policies and interventions with the machine learning model’s predictions and recommendations, governments and organizations can work toward more targeted, effective, and data-informed efforts to elevate literacy rates and promote overall socio-economic development.




Our mission is to develop an army of creative problem solvers using an innovative approach to internships.