Building a Malaria Incidence Rate Predictive Model to Enhance the Understanding of Malaria Control Strategies
HDSC SPRING ’23 CAPSTONE PROJECT BY TEAM GITLAB
WHO’s Global Technical Strategy for malaria emphasizes the significance of malaria surveillance as a crucial component for achieving malaria elimination. Effective surveillance data is vital for monitoring progress and directing interventions to high-risk areas. A review of Team Flask’s malaria incidence prediction model, developed during the Hamoye Winter Cohort 2023, revealed shortcomings in data handling, modeling techniques, and the inclusion of critical factors. Team Gitlab addressed these issues by introducing new measurement indices, perspectives, and exploring the benefits of interactions among existing preventive measures. This study aims to enhance future intervention targeting for areas and populations most affected by malaria.
2. AIM AND OBJECTIVES
This research aims to address critical lapses in malaria research done by Team Flask. Our research objectives include
● To assess the intricate interplay of population dynamics and malaria incidence rates, to discern the potential of incorporating population data as a predictive feature to amplify model generalization and predictive accuracy.
● To study the interaction amongst malaria preventive measures and identify complementary and region-specific measures.
● To create an improved malaria prediction model by engineering new features using our findings, introducing incidence rate thresholds and considering other modeling options based on research findings.
3. FLOW PROCESS
The data for the past 10 years (2007–2017) on confirmed malaria incidence and annual population in 36 selected countries was sourced from the World Bank’s data repository. This dataset includes normalized values of annual confirmed malaria incidence per 1000 population, calculated by dividing confirmed malaria cases by the population size. Information on the use of malaria control measures during the same period was obtained from both UNICEF child health coverage and the World Health Organization’s data repositories.
Data cleaning and pre-processing:
Machine learning principles emphasize achieving high predictive accuracy through meticulous data preprocessing. In our analysis of the original dataset, we identified a significant portion with missing values. To address this issue, we carefully imputed these missing values by incorporating additional datasets from various sources. These datasets were harmoniously merged based on common columns like country names and years to prevent further data gaps.
Among the added datasets were figures such as estimated malaria deaths, confirmed cases, total population, rural and urban population data. Each of these datasets underwent specific preprocessing to ensure data quality. We also transformed the data frame into a geo-dataframe to leverage spatial information, including geometry, longitude, and latitude, using the Geopandas tool.
Data Analysis and Feature Engineering:
Exhaustive data analysis was done alongside feature engineering to derive extensive insights into the analysis. In the process of feature engineering, we combined different aspects of the datasets to create new features. These new features included things like total malaria cases, standardized incidence rate, mortality rate, prevalence rate, and case fatality rate. The analysis conducted is outlined below:
• A factor analysis of malaria indices (as variables) and malaria incidence report was carried out to examine the correlation and possible causal relationship between these variables and malaria incidence. The combined impact of these variables on malaria incidence was observed. To do this, Bartlett’s test of sphericity was used to determine the chi-square and P-values of our data. Also, model calculation was done and a scree plot of the eigenvalues obtained was created.
• Individual engineered features were analysed in terms of temporality and spatiality to draw further insights (please refer to the research article)
• A counterfactual scenario of the preventive measures to predict the incidence rates was modelled • Spatial analysis was also performed using diverse techniques such clustering analysis, spatial autocorrelation and spatial regression using spreg module. In the spatial autocorrelation, two methods were used, this include
o Global Moran’s I’s statistics
o Hotspot / cold spots analysis using Local Getis-Ord Statistics (Gi)*
Due to the limitation of the first method, the second approach was carried to investigate for the spatial pattern of malaria incidence rates.
Several corresponding data visualizations were carried out. Some of the visualization done included interactive maps using folium to visually represent the malaria incidence rates by country and year. Geographical visualizations of malaria hotspot areas with respect to the hotspot analysis stated above were created. Temporal trends of malaria metrics i.e. incidence rates, prevalence rate, mortality rate, case fatality rates etc. were created. Most of these visualizations are included in the research article and presentation slides based on how we intend to use them.
The permutation features important techniques using Random Forest and XGBOOST guided in the selection of the features for modeling. Using the RMSE as error metrics, this technique evaluates the importance of each feature by permuting its values while keeping other features unchanged and then measuring the resulting decrease in model performance. Overall, from the selected features, only 5 of them were selected for easy interpretation and deployment. The selected features included ‘Malaria cases reported’, ‘Malaria death’, ‘Total Malaria Cases’, ‘Total Population’
In this project stage, four(4) models were built.
• Random Forest
• Stacking regressor using Random Forest as the final estimator
• Voting regressor
The model training followed the conventional steps of standardizing the input variables using StandardScaler from the sklearn library, splitting the data into training and test sets, initializing the model, fitting the model, making predictions on test sets, evaluation using mean absolute error, mean squared error, root mean squared error and R2score.
Model Validation and Hyperparameter Tuning
A 10-fold cross-validation was performed on each training set for the baseline models. The mean and the standard deviation were also computed. For the hyperparameter tuning, a random search CV was employed factoring all the hyperparameter present in each model used.
During the evaluation, a number of metrics were considered and these included the RMSE and R2 score. The model that has the lowest RMSE and high R2score was chosen and this was the Random Forest Regressor with an RMSE of 48 and R2score of 0.9918
The interpretability part was handled using LIME and then incorporating a baseline classification threshold to classify the predicted incidence rates into either low, medium or high.
The best-performing model ie the Random Forest was selected and deployed on stream lit
Below, we present the predictive performance of the models. The table provides an overview of the RMSE and R2scores achieved during the training and testing phases
Table 1: Baseline model performance on both train and test sets
XGBOOST STACKING REGRESSOR
EVALUATION ON TRAINING SET
RMSE 43.4975 0.0999 12.8840 21.7508 R2 SCORE 0.9939 0.999 0.9936 0.9985 EVALUATION ON TEST SET
RMSE 44.4683 54.8749 42.8703 45.6289 R2 SCORE 0.9931 0.9895 0.9342 0.9927Table 2: Model performance after cross-validation and hyperparameter tuning
XGBOOST STACKING REGRESSOR
CROSS-VALIDATION EVALUATION ON TRAIN SETS
RMSE 96.8500 ± 72.665 80.2247 ± 51.539 90.2765 ± 72.718 80.3541 ± 59.627 R2 SCORE 0.9144 ± 0.113 0.9574 ± 0.00345 0.9332±0.0765` 0.9502 ± 0.0604 HYPERPARAMETER TUNING — EVALUATION ON TEST SETS RMSE 48.3735 83.0140 82.6712 58.3468 R2 SCORE 0.9918 0.9759 0.9761 0.9881
Performance comparison between different models
A bar graph of total malaria incidence
A bar graph of malaria incidence rates per 1000 population
Graph showing annual average of malaria prevalence in Africa between 2007 and 2017
This project presented the use of real-world data to classify malaria incidence in Africa, analyzing the same data for insights into the effectiveness of malaria control measures, selecting these features and building an improved malaria Incidence prediction model. The results suggest that the principal variable that influences malaria incidence varies from one country to another in different ways. Our dataset contained data only on malaria incidence in thirty-six African countries between 2007 and 2017. Future work can replicate and extend our work to other countries in Africa, as well as other countries where malaria is prevalent in the world. It can also make use of more current data. The main objective of this study was not only to locate African countries at a higher risk of malaria, but to build a more comprehensive computational model capable of predicting malaria incidence in Africa. For this project, we only considered variables with a factor loading of at least 0.5. Consequently, the relationship between other variables and malaria incidence was not evaluated. In addition to the variables we have considered currently, seasonal changes like rainfall, the distribution of malaria parasites of varying types and other disease control programs, such as COVID 19, affect malaria control and resource management. Future work might consider these other factors considered to be contributing to malaria incidence as well as using other factor analysis methods to get the whole picture. They can consider further sophistication of the model to include these factors as features.