HDSC Spring ’23 Capstone Project
Global household electrification refers to the percentage of households worldwide with access to electricity, which is essential for reducing poverty, fostering economic growth, and raising living standards. Although there isn’t a universally agreed definition, it generally includes reliable power, cooking facilities, and minimum energy consumption. Progress has been made, with global electricity access increasing from 71% in 1990 to 87% in 2016, thanks to infrastructure and policy advancements. However, disparities remain, especially in underdeveloped areas, where 13% lacked electricity in 2016. Closing this gap is a shared responsibility for governments, international bodies, and stakeholders, promoting sustainable development and inclusivity (Ritchie et al., 2022).
While extensive research has delved into the challenges of household electrification in Sub-Saharan Africa, there remains a lesser-explored area concerning disparities in access to electricity within OECD (Organization for Economic Cooperation and Development) nations. Despite their overall high levels of economic development, certain segments of the population within OECD countries still grapple with insufficient access to electricity, obstructing their socio-economic advancement and overall well-being.
The existing electrification datasets pose a substantial challenge due to their inconsistencies, inaccuracies, and limited accessibility. These deficiencies have the potential to hinder informed and effective decision-making processes.
Aim and Objectives
The primary aim of this project is to develop a global household predictive model with a specific focus solely on OECD countries. While the objectives are:
- to incorporate advanced machine learning techniques and additional dataset to refine the model’s accuracy.
- to perform some analysis to uncover patterns, correlations, and trends within the dataset to extract some valuable information.
The project follows the following flow process:
Data Gathering & Cleaning
To gather the dynamics of household electrification globally, multiple datasets were sourced from reputable organizations.
- International Energy Agency (IEA) The IEA dataset constitutes a pivotal source, encompassing valuable insights into electricity generation. Specifically, it spans the period from 2016 to 2023, offering a detailed breakdown of the electricity produced by each country.
- Data World: The dataset procured from Data World furnishes a broader perspective, covering a substantial timeframe spanning from 1960 to 2017.
- WHO: The WHO dataset contributes a distinct dimension by providing information on the distribution of people in rural and urban areas.
A thorough data cleaning process was carried out, encompassing three distinct datasets: the initial dataset, a second sourced dataset, and rural population data. Additionally, a new feature representing urban population characteristics was derived from the rural dataset. These datasets were subsequently merged, initially combining the initial dataset with the second sourced data, and then integrating the population data. Within the final merged dataset, calculations were performed to determine electric rural and electric urban rates, which are pivotal indicators of household electrification progress. This meticulous data cleaning phase served as the project’s cornerstone.
To enhance data quality and streamline analysis, specific adjustments were applied to the datasets. Notably, data for the year 2023 was excluded from the IEA dataset due to its incomplete nature. Furthermore, a selective feature curation approach was employed, retaining only those relevant to OECD countries.
To address missing data, the KNNImputer method was utilized, which imputes missing values by considering neighboring data points. Additionally, to mitigate the impact of zero values on predictive accuracy, instances with multiple zero values were converted to ‘NaN’ to signify missing values. The KNNImputer technique was then applied to these ‘NaN’ entries as well, expanding its application to both missing values and the newly designated ‘NaN’ entries resulting from zero replacements.
Exploratory Data Analysis
EDA is an important step in the data analysis process because it allows us to understand the data before applying any statistical models or making any decisions. The following observations were made:
- The geographical distribution of average total electricity values across countries was effectively visualized through a map. This visualization (Figure 1) highlighted the United States (USA) as possessing the highest values among the OECD countries studied.
- One of the key trends that emerged was revealed by a line chart (Figure 2) showcasing the variation in total electric values over time. Notably, the year 2014 stood out as the period with the highest electric value, indicating a potential significant event or trend that impacted electricity consumption during that year.
- Delving deeper into the distribution of Total Electricity Value, a bar chart (Figure 3) illustrated a skewed pattern. This asymmetry in distribution has implications for understanding the disparities in electricity consumption across OECD countries.
- Examining rural populations in various countries, a bar chart (Figure 4) identified Slovenia, Portugal, and Slovakia as the top three nations with the highest rural population figures.
- Shifting focus to urban populations, another bar chart (Figure 5) pinpointed Belgium, Iceland, and Israel as countries with the highest urban populations.
- A separate analysis of electric values across countries, depicted in a bar chart (Figure 6), highlighted the United States (USA), Japan, and France as leading in terms of high electric values.
- Finally, a heatmap visualization (Figure 7) revealed strong correlations between the ‘Value’ variable and both ‘electric_rural’ and ‘electric_urban’. Moreover, the high degree of correlation between ‘electric_rural’ and ‘electric_urban’ suggests a potential interdependence between rural and urban electricity consumption patterns.
Figure 1: Geographical distribution of average total electricity values across countries
Figure 2: Trend of electric values over time Figure 3: distribution of Total Electricity Value
Figure 4: Countries with the highest rural population Figure 5: Countries with the highest urban population
Figure 6: Countries with the highest Electric Values Figure 7: Correlation between the features
The data preprocessing phase involves several essential steps to ensure the quality and suitability of the dataset for analysis and modeling. The following procedures were executed in this regard:
- The training and testing sets were formed from an 80/20 split (respectively) of the dataset.
- Categorization of ‘Value’ Variable: The ‘Value’ variable was transformed from a continuous variable to a categorical one with three classes — ‘Low’, ‘Medium’, and ‘High’, using the qcut method. This categorization facilitates a more intuitive interpretation of electricity consumption levels.
- Class Distribution Analysis: The class distribution of the newly categorized ‘Value’ variable was examined. The number of unique classes, their labels, and their proportional representation within the dataset were calculated and visualized using a bar plot (Figure 8). It was discovered that the ‘value’ variable which is the target variable is unbalanced.
- Train-Test Split: The dataset was split into training and testing sets using the train-test split function from the sklearn library. Features were separated from the target variable to facilitate model training and evaluation in a ratio of 80% for training data and 20% for testing data, respectively.
- Label Encoding: Categorical columns such as ‘Location’ were encoded using the LabelEncoder to convert their categorical values into numerical values. This encoding step is necessary for most machine learning algorithms. Both the training and testing sets were encoded accordingly.
- Data Balancing: The training set was balanced using the Synthetic Minority Over-sampling Technique (SMOTE) to address class imbalance issues. SMOTE generates synthetic samples of the minority class to match the distribution of the majority class.
- Standard Scaling: Numerical features were standardized using the StandardScaler, which scales features to have zero mean and unit variance. This preprocessing step ensures that features are on a similar scale, preventing dominance by features with larger magnitudes
Figure 8: Imbalanced Target variable (Value)
During the model development phase, a series of steps were taken to build and evaluate predictive models for the given dataset. Firstly, the baseline accuracy was determined by calculating the normalized maximum value count, resulting in a baseline accuracy of 0.5638. Various machine learning algorithms were then compared for their performance. The models considered included DecisionTreeClassifier (DTC), RandomForestClassifier (RFC), GradientBoostingClassifier (GBC), AdaBoostClassifier (ADBC), XGBClassifier (XGB), LGBMClassifier (LGB), CatBoostClassifier (CBC), Support Vector Classifier (SVC), Gaussian Naive Bayes (GNB), Logistic Regression (LGR), and KNeighborsClassifier (KNC).
Through cross-validation with a 5-fold split, the accuracy scores for each model were computed. The resulting scores revealed the strengths and weaknesses of different algorithms. Additionally, a boxplot (Figure 9) visualization was employed to facilitate a comprehensive comparison of these models based on their accuracy scores.
Figure 9: Comparison of the accuracy results of the models
Furthermore, a specific model, the LGBMClassifier, was selected for in-depth analysis due to its promising performance. This model was trained using the preprocessed and scaled training data. Subsequently, an evaluation of training and test accuracies was conducted, resulting in a training accuracy of 1.0 and a test accuracy of 0.9942.
To gain a more comprehensive understanding of the model’s performance, a confusion matrix (Figure 11) was generated, illustrating the model’s classification outcomes. Additionally, a classification report (Figure 10) was generated, providing insights into precision, recall, and F1-score for each class.
Finally, feature importances were derived from the LGBMClassifier model and visualized in a bar chart (Figure 12) to highlight the significance of different features in predicting the target variable.
Figure 10: Classification Report
Figure 11 : Showing the Confusion matrix Figure 12: Feature Importance
The deployment of the developed machine learning model was achieved using Streamlit, a platform for interactive data applications. This involved saving the trained LGBMClassifier model in a Streamlit-compatible format (pickle file), creating a Python script to construct the application’s layout and functionality, and storing the code in a GitHub repository for version control. By connecting the GitHub repository to a Streamlit account, the application was deployed, allowing users to interact with the model through a web browser.
Link to the web app:
Figure 13: The deployed model web app
In conclusion, this research aimed to employ various machine learning algorithms to predict electricity values within OECD countries, ultimately contributing to the improvement of household electricity access. Based on the findings, it is evident that the Light Gradient Boosting Machine outperformed other machine learning algorithms used in this study for predicting household electrification.
As a recommendation, following the focus on Sub-Saharan countries in the previous study and the shift to OECD nations in this project, it is advisable to adopt a more expansive approach for future research, encompassing a global perspective. Exploring electricity access and related metrics on a global scale through collaborative efforts could lead to a more comprehensive and impactful project spanning diverse regions and economies.
Hannah Ritchie, Max Roser and Pablo Rosado (2022) — “Energy”. Published online at OurWorldInData.org. Retrieved from: ‘https://ourworldindata.org/energy' [Online Resource]