HDSC WINTER ’23 CAPSTONE PROJECT — TEAM FLASK
Malaria, a disease caused by the Plasmodium parasite, is transmitted to humans through the bites of infected Anopheles mosquitoes. Africa bears a disproportionately high global malaria burden, with approximately 93% of all malaria cases and 94% of malaria-related deaths occurring on the continent, as reported by the World Health Organization (WHO). While there was a decline in malaria incidence in Africa from 2007 to 2017, the disease remained a significant public health concern, with an estimated 200 million cases reported in 2017.
This decrease in malaria incidence can be attributed to various factors, such as increased funding for malaria control programs, widespread use of insecticide-treated bed nets, and improved access to diagnostic testing and effective treatment. Leveraging artificial intelligence (AI) and machine learning (ML) algorithms offers an opportunity to address persistent challenges in malaria monitoring and control across Africa during this timeframe.
Our project’s primary focus is on establishing a data-driven approach to predict and monitor malaria incidence while effectively controlling its spread in Africa. We aim to develop a robust machine learning model that incorporates various dataset features, including malaria incidence, reported cases, drinking water quality, sanitation levels, and the distribution of treated malaria mosquito nets. This model will enable us to accurately predict malaria cases and assess how enhancements in sanitation and water quality impact malaria incidence.
The project’s ultimate goal is to create a comprehensive tool for malaria monitoring and control across the entire African continent, with the overarching objective of reducing malaria incidence and improving public health outcomes.
Malaria presents a significant public health concern in Africa, with multiple contributing factors like limited access to clean water and sanitation, inadequate distribution of treated mosquito nets, and the emergence of drug-resistant malaria strains. This persistent problem is exacerbated by the absence of precise, up-to-date data on malaria incidence, hindering effective disease control efforts. To tackle this issue, the project aims to create a machine learning model capable of accurately predicting and monitoring malaria incidence in Africa, leveraging pertinent dataset features.
AIMS & OBJECTIVES
- To develop a machine learning model which can accurately predict malaria cases and assess the impact of improving sanitation and water quality.
- To propel a reduction in malaria incidence with the help of the concerned stakeholders.
- To contribute to the efforts in place that seek to control the spread of malaria in Africa.
Efforts to reduce malaria transmission and mortality in Africa have relied on traditional statistical solutions such as regression analysis and time series analysis. Researchers have also developed cost-effective and feasible AI models that use supervised machine learning to predict one-month-ahead prevalence of malaria in Nigeria. Additionally, machine learning algorithms have been applied to low-cost, portable optical microscopes for the early diagnosis of malaria, and preventive practices such as the use of insecticide-treated nets and Indoor Residual Spraying have been studied. Malaria diagnosis via Rapid Diagnosis Tests has also been compared to microscopy, and a robotic automated computer-expert system has been developed to provide access to effective malaria diagnosis in developing countries where malaria is endemic.
This project will be useful for clinicians and public health personnel as a reliable assisting tool to rapidly identify potential malaria cases and take proactive steps.
The project could identify patterns and trends in malaria incidence rates and risk levels, which could lead to more effective and targeted interventions. For instance, it could help identify regions that require more resources to control malaria and where preventative measures such as bed nets, insecticides, and vaccines are most urgently needed.
The project could help predict the future malaria incidence risk levels for different regions, which could assist policymakers and public health officials in developing more effective strategies for malaria prevention and control. For example, if a particular region is predicted to experience a high incidence of malaria in the coming years, authorities could take proactive measures to mitigate the impact of the disease by increasing public awareness campaigns and providing more resources for malaria control.
This can enable public health officials to allocate resources, such as mosquito nets, insecticides, and antimalarial drugs, to the areas most at risk of malaria transmission. Additionally, the project can identify factors that contribute to the spread of malaria, such as environmental and socioeconomic factors, which can inform policy decisions aimed at addressing these underlying causes.
Figure 1: Outline of Project Methodology
The project utilized a dataset obtained from Kaggle, which included essential malaria-related information from 54 African nations. This dataset comprised 27 data columns and 594 data rows, with a focus on key features such as statistics related to mosquito nets, the quality of drinking water, sanitation services, and geographical coordinates of the respective African countries.
The provided Kaggle datasets were downloaded manually and uploaded to the main project GitHub repository as well as Google Colab for further processing through collaboration.
Figure 2: List of Columns
This step involved preserving and improving upon the quality and tidiness of the dataset provided. The data was duly assessed visually and programmatically to find the issues that required cleaning. The data contained missing data as shown in Figure 1. To this end, the following steps were taken:
- The data columns were renamed.
- All null values were refilled with zeros.
- The ‘Year’ and ‘Malaria’ data columns had their data types converted from ‘int64’ and ‘float64’ to ‘datetime’ and ‘int64’ respectively.
Figure 3: Distribution of Missing Values in the dataset
Five(5) machine learning algorithms were then evaluated on the clean dataset using the R-squared metric.
Exploratory Data Analysis (EDA)
Figure 4: A pie chart showing the spread of malaria cases by region.
Over the decade spanning from 2007 to 2017, countries located in the southern and northern regions of Africa reported minimal, almost negligible, malaria cases. In contrast, the eastern countries were the most susceptible, recording the highest number of malaria cases. The East’s vulnerability is attributed to its water-rich environment and elevated tropical temperatures, which provide ideal breeding conditions for mosquitoes. Conversely, the northern regions feature desert-like conditions with exceptionally high temperatures that exceed the optimal range for mosquito parasite development. The western countries ranked second in malaria cases, followed by the central regions. (Figure 4)
Figure 5: A scatterplot showing the correlation between malaria incidence and cases reported.
The incidence of malaria is directly proportional to the malaria cases reported throughout the decade (Figure 5). There is a strong positive correlation between the incidence of malaria and the malaria cases reported. Therefore, a general increase in the incidence of malaria will lead to a general increase in the malaria cases reported to the nearest health center for any given region within Africa.
Figure 6: Bar Chart showing the 10 countries with the highest incidence of malaria in Africa.
The 3 countries with the highest malaria cases in Africa were the Democratic Republic of Congo, Mozambique and Burkina Faso, as seen in Figure 6.
Model Training and Evaluation
The dataset was scaled using a Standard scaler, thereafter the explanatory variable (X) and response variable (y) were specified. The dataset was split into a training set and a testing set of 70:30. Five regression models namely Linear Regression, Support Vector Regression, Lasso Regression, Ridge Regression and Random Forest Regressor were trained on the dataset. The model was evaluated with the Root Mean Square Error (RMSE) and R-Squared (R2) scores.
Figure 7: Chart showing the R-squared values of each trained regression model.
The first four models were built using default parameters and the evaluation is shown in Figure 2. Hyperparameter tuning was carried out on the Random Forest Regressor model using RandomSearchCV. The R-Squared is a good measure to evaluate the model fitness. Our preferred model is the Random Forest Regressor because it has a R2 Score of 0.85 (85%).
Using the model, we sought to describe what factors contributed the most to the incidence of malaria. From that we noted that Basic Sanitation and Drinking Water have the highest impact in predicting malaria case occurrence in Africa. One of the key reasons for the decline of malaria incidence has always been the use of Insecticide-treated nets. Thus, it was not a surprise that it ranked low in the chart.
Figure 8: Feature importance as described by the model.
The Random Forest model was saved using pickle and deployed on Streamlit2 to make live predictions.
From the exploratory data analysis and model training and evaluation carried out, here is a summary of our insights:
- Countries in the south and north of Africa had very few reported malaria cases, almost negligible over the decade from 2007–2017.
- The 3 countries with the highest malaria cases in Africa were the Democratic Republic of Congo, Mozambique and Burkina Faso.
- Those to the East were the most susceptible and recorded the highest number of malaria cases.
- A general increase in the incidence of malaria will lead to a general increase in the malaria cases reported to the nearest health center for any given region within Africa.
- Basic Sanitation and Drinking Water have the highest impact in predicting malaria case occurrence in Africa.
Based on our results and an exhaustive literature review, we were able to proffer key recommendations:
- Governments and health organizations should increase the distribution of treated mosquito nets to communities at risk. This will help to reduce the number of malaria cases and also prevent the spread of the disease.
- Governments and health organizations should focus on providing communities with safe drinking water sources to help reduce the incidence of malaria. This can include improving infrastructure such as wells, boreholes, and piped water systems.
- The relevant authorities should focus on improving sanitation services in communities at risk. This can include constructing latrines, improving sewage systems, and promoting good hygiene practices.
- The relevant authorities should increase the use of intermittent preventive treatments: Intermittent preventive treatment is an effective strategy for preventing malaria in high-risk populations such as pregnant women and infants.
- Based on the trend analysis that shows a general increase in malaria cases despite the overall decrease in incidence, governments and health organizations should develop targeted interventions to address the specific factors driving the increase. This may involve conducting further research to identify the drivers of the trend and developing specific interventions to address those drivers. For example, if the increase in malaria cases is linked to a particular region or population group, interventions could be targeted to those areas to address factors such as access to healthcare, mosquito control measures, or education on prevention methods.
As you can see from our exhaustive methodology and recommendations, AI does have a place in tackling the Malaria scourge in Africa. We do hope it is ready to take that place, because we are.
2. Deploying Machine Learning Models with Python & Streamlit | 365 Data Science, https://365datascience.com/tutorials/machine-learning-tutorials/how-to-deploy-machine-learning-models-with-python-and-streamlit/ (accessed 1 May 2023).