Team ML-Matrix Premiere Project of the HDSC Fall ’23 Cohort
This project focuses on developing a machine learning model to assess the nutrient profile of various cereals and making dietary recommendations based on ratings. The goal is to enable consumers to make informed and balanced dietary choices to prevent malnutrition by leveraging data-driven insights.
The primary objective of this project is to create a regression model that predicts the rating of cereals, considering various nutritional attributes, and provides dietary recommendations based on ratings.
- Data Collection: Gather data on cereal products, including their nutrient content and consumer ratings.
- Data Preprocessing: Clean and prepare the data for machine learning.
- Exploratory Data Analysis: Visualized the attributes to gain deep insight
- Model Selection: Choosing a suitable regression algorithm for the task.
- Model Training: Train the regression model on the dataset.
- Evaluation: Assess the model’s performance using appropriate metrics.
- Dietary Recommendations: Develop a recommendation system based on the model’s predictions.
The dataset was obtained from the Kaggle dataset page -
- Nutritional attributes: Calories, protein, fat, sodium, carbohydrates, fiber, vitamins, sugars, vitamins, etc.
- Other Attributes: Name, manufacturer, weight, consumer ratings, etc.
- Data Cleaning: Handle inconsistencies. Features such as potass, carbo, and sugars with -1 values may impact the statistics of the dataset and the model negatively. Those features were replaced with zeros
- High Cardinality Features: Dropped (name & mfr) categorical features that have high cardinality to avoid high dimensional features
- Exploratory Data Analysis: All numeric variables in our dataset are not normally distributed and some of them have outliers. Had model training given poor performance, we would have applied log transformation
Figure 1: Correlation Heatmap
- One Hot Encoding: The categorical feature (type) was one-hot encoded to convert it to numerical values.
- Feature Selection: A high cardinality categorical feature (name) and, two irrelevant features were dropped (mfr & shelf). Although L1 Regularization did not reduce any feature coefficient to exactly zero(0), It did highlight Fiber, Protein, Fat, Carbo (carbohydrate), Sugars, and Calories as its Top 6 important features
- Split Data: The dataset was divided into training and testing sets at a 75–25 split ratio.
- Data Transformation: Normalize and scale features as needed.
Figure 2: Feature importance bar chart with L1 regularization
We built baseline models using Linear Regression, Ridge Regression, Lasso Regression, and Elastic Net. The Linear Regression model performed best with a Root Mean Squared Error of 0.024 on our test set. The least performing was Elastic Net with a Root Mean Squared Error of 5.79 on our test set.
Figure 3: Box Plot showing a range of 10-fold Cross-Validation R-squared scores of regression models.
We also decided to apply some Hyperparameter tuning to some of the other models to see if their performance metrics could be improved. Ridge regression showed the best improvement with an RMSE of 0.024
The model’s performance was assessed using various metrics such as Root Mean Squared Error (RMSE) and R-squared.
The codes can be found in the link below:
Develop a recommendation system that utilizes the trained model to suggest balanced dietary choices based on consumer preferences and nutrient profiles.
This project aims to provide consumers with a data-driven approach to making balanced dietary choices when selecting cereals. By creating a Ridge regression model that assesses nutrient profiles and incorporates ratings, we can help individuals make informed decisions for healthier living.
- Expand the recommendation system to include other food categories.
- Incorporate user-specific preferences and dietary restrictions.
- Continuous data collection to keep the model up-to-date.