Nutrient Profiling: Cereal Rating Prediction

4 min readFeb 6, 2024

Team ML-Matrix Premiere Project of the HDSC Fall ’23 Cohort

Introduction

This project focuses on developing a machine learning model to assess the nutrient profile of various cereals and making dietary recommendations based on ratings. The goal is to enable consumers to make informed and balanced dietary choices to prevent malnutrition by leveraging data-driven insights.

Project Overview

Objective

The primary objective of this project is to create a regression model that predicts the rating of cereals, considering various nutritional attributes, and provides dietary recommendations based on ratings.

Key Tasks

Data Collection: Gather data on cereal products, including their nutrient content and consumer ratings.
Data Preprocessing: Clean and prepare the data for machine learning.
Exploratory Data Analysis: Visualized the attributes to gain deep insight
Model Selection: Choosing a suitable regression algorithm for the task.
Model Training: Train the regression model on the dataset.
Evaluation: Assess the model’s performance using appropriate metrics.
Dietary Recommendations: Develop a recommendation system based on the model’s predictions.

Data Collection

Data Source

The dataset was obtained from the Kaggle dataset page -

https://www.kaggle.com/datasets/crawford/80-cereals

Data Fields

Nutritional attributes: Calories, protein, fat, sodium, carbohydrates, fiber, vitamins, sugars, vitamins, etc.
Other Attributes: Name, manufacturer, weight, consumer ratings, etc.

Data Preprocessing

Data Cleaning: Handle inconsistencies. Features such as potass, carbo, and sugars with -1 values may impact the statistics of the dataset and the model negatively. Those features were replaced with zeros
High Cardinality Features: Dropped (name & mfr) categorical features that have high cardinality to avoid high dimensional features
Exploratory Data Analysis: All numeric variables in our dataset are not normally distributed and some of them have outliers. Had model training given poor performance, we would have applied log transformation

Figure 1: Correlation Heatmap

Feature Engineering

One Hot Encoding: The categorical feature (type) was one-hot encoded to convert it to numerical values.
Feature Selection: A high cardinality categorical feature (name) and, two irrelevant features were dropped (mfr & shelf). Although L1 Regularization did not reduce any feature coefficient to exactly zero(0), It did highlight Fiber, Protein, Fat, Carbo (carbohydrate), Sugars, and Calories as its Top 6 important features
Split Data: The dataset was divided into training and testing sets at a 75–25 split ratio.
Data Transformation: Normalize and scale features as needed.

Figure 2: Feature importance bar chart with L1 regularization

Model Selection

We built baseline models using Linear Regression, Ridge Regression, Lasso Regression, and Elastic Net. The Linear Regression model performed best with a Root Mean Squared Error of 0.024 on our test set. The least performing was Elastic Net with a Root Mean Squared Error of 5.79 on our test set.

Figure 3: Box Plot showing a range of 10-fold Cross-Validation R-squared scores of regression models.

Model Training

We also decided to apply some Hyperparameter tuning to some of the other models to see if their performance metrics could be improved. Ridge regression showed the best improvement with an RMSE of 0.024

Evaluation

The model’s performance was assessed using various metrics such as Root Mean Squared Error (RMSE) and R-squared.

The codes can be found in the link below:

https://github.com/ml-matrix/premier-project/blob/3c7b1cc67fbb5f3a63db22296ed916b0e4040e7d/Ml_Matrix_Premiere_Project_1_.ipynb

Dietary Recommendations

Develop a recommendation system that utilizes the trained model to suggest balanced dietary choices based on consumer preferences and nutrient profiles.

Conclusion

This project aims to provide consumers with a data-driven approach to making balanced dietary choices when selecting cereals. By creating a Ridge regression model that assesses nutrient profiles and incorporates ratings, we can help individuals make informed decisions for healthier living.

Future Enhancements

Expand the recommendation system to include other food categories.
Incorporate user-specific preferences and dietary restrictions.
Continuous data collection to keep the model up-to-date.