HDSC Fall ’22 Premiere Project
Introduction
Machine learning and artificial intelligence have transformed various fields by extracting knowledge from raw data. In this project, Team Scrapy used the “Adult Income Census” dataset from Kaggle, which is based on the 1994 United States Census Bureau data. They aimed to predict whether individuals would earn more or less than $50,000 per year using variables such as age, capital gains or losses, native country, and education level. This project plays a crucial role in providing insights into income and health prediction based on census data.
Data Preparation
The data preparation phase is where the data was explored for understanding and insights. The data was cleaned, transformed, and integrated. Below is a snippet of the first 5 observations in the data:
The dataset contained 14 variables that were used to predict whether an individual’s income would be less or more than 50K per year. The columns in the dataset are as follows:
After removing rows with missing values (represented by “?”) and restricting the column names to alphanumeric characters, the dataset was further manipulated and cleaned. The final dataset consisted of 30,139 rows and 15 columns.
Exploratory data analysis (EDA)
Exploratory data analysis (EDA) is a crucial step in data science that involves analyzing and summarizing data sets to understand their key characteristics. It helps data scientists uncover patterns, identify anomalies, test hypotheses, and validate assumptions. To conduct EDA on the dataset, the team utilized the Pandas Profiling library, which generated a report with descriptive statistics. Here are some of the findings from the report:
- The youngest and oldest ages in the dataset were 17 and 90 years respectively, with a mean age of 36.44 years
- The work class column contains 7 unique work classes, with individuals working in the private sector representing a significant portion of our dataset.
- 85% of the observations were from white individuals.
- The dataset had twice as many observations on males as it did on females
Correlations in the dataset are shown below
Feature engineering
No new features were added. The education column was dropped due to its correlation with education number. Categorical variables were encoded using one-hot encoding for compatibility with the models.
Categorical features were encoded, and numeric columns were scaled using min-max scaling to address outliers. The dataset showed mild imbalance, with the positive class representing approximately 24% of the observations. SMOTE was utilized to balance the minority class.
SMOTE (Synthetic Minority Oversampling Technique) is an oversampling technique that addresses the imbalance in the dataset by generating synthetic samples for the minority class. It mitigates the issue of overfitting associated with random oversampling by creating new instances in the feature space through interpolation between positive instances in close proximity.
Model training and evaluation.
The dataset was split into train, test and evaluation sets using the train_test_split() function of sklearn and the following classifier models were used:
- Logistic regression
- Decision Tree Classifier
- Random forest classifier
- The K-nearest neighbors’ algorithm
- Support vector machines
- Naïve Bayes algorithm
- LightGBM Classifier
- XGBoost Classifier
The performance of the models was examined using accuracy, precision, recall and F1 scores. The results are shown below:
Observation
The LightGBM and XGBoost models outperformed the other classifiers, thus they were chosen for evaluation on the unseen data as can be seen below:
The LightGBM model outperformed the XBoost model, and was therefore chosen for further development and use.
Conclusion
The objective of current research in data science is to develop systems and algorithms that extract knowledge from data. The findings obtained in this project serve as a benchmark for future endeavors in predicting values from census data. This project can also serve as a foundation for enhancing existing classifiers and techniques, leading to improved technologies for accurately predicting an individual’s income level.