HDSC Winter’23 Premiere Project
A Project by Team ARIMA
Introduction
A high-risk pregnancy poses a serious threat to the health of the mother and her unborn child, often necessitating specialized medical attention. Some pregnancies become high-risk as they progress, while others involve increased risk factors even before conception. Early and consistent prenatal care can greatly improve the chances of a healthy pregnancy and delivery. However, a shortage of medical experts, exacerbated by population growth, particularly in developing countries like India, leaves many lower or middle-income women without proper healthcare or awareness about potential pregnancy complications, especially in rural areas. Additionally, the fear of unnecessary and costly medical tests prescribed by doctors further compounds these issues.
Aims and Objectives
The project’s goal is to create a machine learning model and a web application for predicting the risk level in pregnant patients. The dataset includes various factors impacting patient health, and the objective is to classify patients into low, medium, or high-risk categories for pregnancy complications. This will be accomplished by training the machine learning model using historical patient records that incorporate diverse features.
Methodology
This flowchart provides a comprehensive overview of the entire machine learning project lifecycle. It begins with data collection and cleaning, progresses through model training and evaluation, and concludes with model deployment.
flowchart showing the whole process of ML project
Step1: Data Description
An open-source data obtained from Kaggle was used, it can be accessed via the link below:
https://www.kaggle.com/datasets/csafrit2/maternal-health-risk-data?resource=dow nload
Data has been collected from different hospitals, community clinics, maternal health cares through the IoT based risk monitoring system.
- Age: Age in years when a woman is pregnant.
- SystolicBP: Upper value of Blood Pressure in mmHg, another significant attribute during pregnancy.
- DiastolicBP: Lower value of Blood Pressure in mmHg, another significant attribute during pregnancy.
- BS: Blood glucose levels are in terms of a molar concentration, mmol/L.
- HeartRate: A normal resting heart rate in beats per minute.
- Risk Level: Predicted Risk Intensity Level during pregnancy considering the previous attribute.
Step2: Data Pre-processing
- The dataset has 1014 rows and 7 columns of data entries.
- There are no missing values in the data.
- Number of rows with duplicates is 562.
- Most of the columns are numeric.
- Converted column RiskLevel with string data type to numeric data type to make them fit for analysis and modelling.
- To get familiar with the dataset, we used describe() function and gathered some information which is mentioned below:
- The youngest reported patient in the dataset is 10 years.
- The oldest patient in the dataset is 70 years of age.
- The highest reported blood sugar level in the dataset is 19mm0l/L.
- The lowest Diastolic blood pressure in the dataset is 49mmHg.
- The lowest reported heart rate in the dataset is 7 beats/min.
- The average systolic blood pressure in the dataset is about 113mmHg.
Step3: Exploratory Data Analysis
We did some data analysis including Univariate and Bivariate analysis to understand our data better and here it is shown below:
- Distribution of Target Variable (Risk Levels)
0.0→ Low-Risk Level1 1.0→Medium Risk Level 2.0→High Risk Level
- Distribution of Age VS Risk Levels
This plot shows that with increase in age, risk level increases.
- Distribution of Heart Rate with Risk Levels
This plot shows the distribution of heart rate with risk level.
- Correlation Plot between features
We created a correlation plot to examine the relationships between various features. The analysis revealed that body temperature features exhibit strong negative correlations with all other dataset features. On the other hand, systolic BP and diastolic BP features have a weak negative correlation solely with the heart rate feature, while the remaining dataset features display positive correlations. Notably, there is a significant positive correlation between systolic BP and diastolic BP features, which is relevant to the classification problem at hand.
Step4: Feature Engineering
The dataset requires minimal preprocessing since it lacks missing values or categorical variables. Additionally, feature scaling is unnecessary as many classification algorithms, such as random forests and decision trees, do not depend on scaling. However, it’s worth noting the presence of outliers and certain features that exhibit very low correlations with our target variables.
- Removing Outliers
Normal Heart Rate Range is between 60–100, 7 bpm is definitely an outlier, so we will remove this entire row before training.
- Feature Selection
We employ the scikit ExtraTreeClassifier Model to assess feature importance, and our findings indicate that the heart rate feature exhibits a notably low correlation, measuring only 0.018. Consequently, we plan to exclude the entire HeartRate column from our training process.
Step5: Model Training And Evaluation
Since our problem involves classification, we conducted experiments with various classification algorithms to identify the model that achieves the highest test accuracy.
We began by training 8 different classification models from the scikit-learn library using their default parameters. Among these, BaggingClassifier, DecisionTreeClassifier, and RandomForestClassifier outperformed the others, prompting us to focus on hyperparameter tuning for these three models. We employed RandomizedSearchCV for hyperparameter tuning, and among the choices, RandomForestClassifier yielded the best results. To further enhance accuracy, we employed GridSearchCV, ultimately achieving a final accuracy of 90%. This represents a 10% improvement over the base model. Consequently, we concluded that RandomForestClassifier is the optimal classification model for our project, delivering the highest accuracy of 90%.
Step6: Model Deployment
We deployed our project using the Streamlit open-source framework, hosting it on Streamlit as well. Streamlit is known for its efficiency in creating data science and machine learning web apps quickly. Our model’s strong performance ensured accurate predictions within the web application. You can access the project through the provided links.
Streamlit link: Maternal health Risk Prediction App
Link to repo: Github link
Conclusion
In this project, we aimed to address perinatal complications through a range of machine learning algorithms, ultimately selecting the RandomForestClassifier as our preferred model. It is crucial to advance research in this domain and leverage machine learning to promote risk-free pregnancies. This systematic review is a noteworthy contribution to both artificial intelligence and women’s health.
The model we developed plays a pivotal role in maternity care decision-making by identifying and alerting pregnant women at risk of preterm delivery. This proactive approach helps prevent potential complications, lowers diagnostic costs, and ultimately reduces the risk of preterm birth (PTB).
Recommendations
This project takes a novel approach to analyze and predict the intensity of risk factors in maternal and fetal health. To improve accuracy, future research can expand the dataset and explore deep learning-based neural networks. However, a current limitation is the relatively small dataset for assessing risk in pregnant patients. Despite this, expert clinical judgment remains crucial for individual cases.
References
- Using Machine Learning to Predict Complications in Pregnancy: A Systematic Review,
https://www.frontiersin.org/articles/10.3389/fbioe.2021.780389/full
- Pregnancy Outcome Prediction study, https://en.wikipedia.org/wiki/Pregnancy_Outcome_Prediction_study
- MODEL FOR PREDICTING RISK LEVELS IN MATERNAL HEALTHCARE,
https://ijariie.com/AdminUploadPdf/MODEL_FOR_PREDICTING_RISK_LEVE LS_IN_MATERNAL_HEALTHCARE_ijariie18831.pdf