Understanding the Relationship Between Nutritional Content and Cereal Ratings: a Data-driven Approach

8 min readFeb 6, 2024

Premiere Project Documentation of Team InsightHub for HDSC Fall 2023


According to the US Food Drug Administration (FDA)

In an era where quick meals are paramount, breakfast cereals stand out for their convenience. However, with this convenience comes a question of nutritional value, which varies greatly among different cereals. The US Food and Drug Administration (FDA) is amplifying efforts to educate consumers about nutritionally deficient options. The label “healthy” is not just a marketing term; it’s a regulated claim that can only be used if the product meets specific nutritional criteria. Against the backdrop of heightened health consciousness and the crucial role of breakfast cereals in dietary patterns — particularly in underprivileged homes where food diversity is scant — our project seeks to navigate the cereal aisle effectively, ensuring food security and combating malnutrition (https://www.fda.gov/food/food-labeling-nutrition/use-term-healthy-food-labeling).

Problem Statement

Many households rely on breakfast cereals, but knowing which options are truly healthy is challenging, especially in households with limited food choices.


To simplify cereal selection by analyzing nutritional content, categorize cereals as ‘Healthy’ or ‘Unhealthy’ and predict consumer ratings.


  • Analysis of nutritional content provided.
  • Predict ratings of cereals based on their nutritional makeup, providing a reliable guide for consumers.
  • Systematically categorize cereals into ‘Healthy’ and ‘Unhealthy’ groups based on established nutritional criteria.
  • Share insights from the analysis to enhance consumer understanding of cereal nutrition.

Data Description

80 cereal data set contains the nutrition information from 77 unique cereals from 7 different manufacturers. The data also provides information about the serving sizes, that is, how many cups make a serving per cereal and the measurements of several nutritional items present in it.

The data set was taken from the Kaggle competition page — https://www.kaggle.com/datasets/crawford/80-cereals/data

The dataset contains a list of 77 different cereals, their manufacturer, the measurement of food nutrients present, the display shelf, the weight of each serving, the number of cups per serving, and the ratings of each cereal. The description of the fields is as follows:

  • name: Name of cereal
  • mfr: Manufacturer of cereal
  • A = American Home Food Products
  • G = General Mills
  • K = Kelloggs
  • N = Nabisco
  • P = Post
  • Q = Quaker oats
  • R = Ralston Purina
  • type:
  • C = Cold
  • H = Hot
  • calories: calories per serving
  • protein: grams of protein
  • fat: grams of fat
  • sodium: milligrams of sodium
  • fiber: grams of dietary fiber
  • carbo: grams of complex carbohydrates
  • sugars: grams of sugars
  • potass: milligrams of potassium
  • vitamins: vitamins and minerals — 0, 25, 100, indicating the typical percentage of FDA recommended
  • shelf: display shelf (1, 2, or 3, counting from the floor)
  • weight: weight in ounces of one serving
  • cups: number of cups in one serving
  • rating: a rating of the cereals (Possibly from Consumer Reports)

Below is a preview of the dataset for the study.


  • Data Cleaning
  • Exploratory Data Analysis
  • Data Preprocessing
  • Model Selection & Training
  • Z-Score Normalization
  • Evaluation
  • Deployment


Two important steps taken include:

  • Renaming manufacturer abbreviations for clarity
  • Rectifying negative nutrient values by replacing them with the mean.


In the Exploratory Data Analysis phase, we took a comprehensive dive into the dataset. Here, the aim was to get a better understanding, uncover the underlying and irregular patterns and gain robust insights from the data.

Univariate Analysis

We examined each nutritional feature individually to establish a baseline understanding. This provided a foundation for understanding each feature on its own before considering interactions and relationships with other features.

Fig. 1

From this, the following was observed:

  • More than 30 of the products are ranging from 100–120 calories
  • Over 50 products have a protein content ranging from 2–3g
  • Over 55 products have a fat content ranging from 0–2g
  • More than 44 products have a sodium content ranging from 150–200mg
  • About 35 products have a fibre content of 0–2g
  • More than 50 products contain carbohydrate of 12–18g
  • More than 15 products have a sugar content of 2–3g, and about 12 products have a sugar content of 12–13g
  • About 60 products have a vitamin content of 25%

Heat Map

The heat map illustrates feature correlations, highlighting the relationships between various nutrients and ratings.

Fig. 2

This shows that the highest correlation seen is 0.9 which is between fiber and potassium.

Other correlations ranging from 0.5–0.7 are seen in calories and weight, rating and fiber, calories and fat, calories and sugar, protein and fiber, and protein and potassium.

Fig. 3

This shows that rating has a positive moderate relationship with fiber, protein and potassium and a high negative relationship with sugars and calories. Carbohydrates, shelf, cups and vitamins have a very weak relationship with the customer ratings.

Fig. 4

From our nutrient analysis, we inferred that 46.8% of the cereals in this dataset have low Fiber(Health Risk) while 26% of the cereals in the dataset are both low in fiber and high in sugar which shows half of the cereals have low fiber content. 9.1% of the cereals in the dataset have high sugar(Diabetes Risk) which is the lowest relatively.

An article stated that less than 3g of fiber is not healthy enough while another article stated that a serving size of healthy cereal should not contain more than 10g of sugar.

Fig. 5

From the above visualization, it can be observed that Post cereal manufacturers have the highest sugar content while Nabisco has the lowest sugar content when compared to other products.

Fig. 6

Concurrently, it can be observed from the visualization in Fig.6 that Nabisco has the highest customer rating while General Mills has the lowest customer rating.

A rating category was created by mapping the ratings. Pandas cut() function, which takes in the ‘rating’ column as an array to be binned, was used.


With further analysis, we can see that only Kelloggs cereal was considered worthy of the title, ‘Excellent’, though it was only one Cereal. However, Nabisco and Quaker Oat were not far behind, with Nabisco having five(5) out of six(6) of their cereals considered good. Over half of Kelloggs cereals are considered average while the rest except one(1), were rated poor. The majority of General Mills cereals are in the Poor category. The rest of the manufacturer’s cereals;

Ralston Purina’s, Quaker Oats’, Post’s, and American Home Food Products’ were mostly found to be poor or average or both in the rating category.

The analysis was further narrowed down to the name of the cereals rated “Excellent” and “Very Poor” along with their food nutrients.

The cereal rated as Excellent is All-Bran with Extra Fiber manufactured by Kelloggs.

The cereals rated as the poorest are Cap’n’Crunch manufactured by Quaker Oats and Cinnamon Toast Crunch manufactured by General Mills.


After the EDA, the next phase is model building and implementation. First, the data was separated into features and target, the categorical variable was encoded using OneHotEncoder, multicollinearity was removed, and the data was split into train and test.


Different machine learning models were deployed. The models include linear regression, ridge regression and lasso regression. These three models performed well, but ridge regression had the best outcome with R2score of 0.995, Mean ABsolute error of 0.75 and Root Mean Square Error of 1.022.

Z-Score Normalization

Attempting to enhance the model performance, the Z-score normalization was applied but observed a negligible impact on the outcomes.

The metric scores of the models are shown below.


The best performing model was deployed on Streamlit, due to its ease of use in deploying machine learning models. This allowed us to interactively present our results through a user-friendly interface.

Here is the link to the deployed model: https://cerealratings.streamlit.app/?utm_medium=social


This project evaluated the nutritional value of popular breakfast cereals by conducting an extensive Exploratory Data Analysis to scrutinize nutrient content.

We subsequently developed a machine learning model that forecasts consumer ratings.


  • Getting standard cereal requirements for all nutrients was difficult as this information is not readily available on the FDA website. Resorted to using articles based on other research.
  • With a relatively small dataset, the risk of overfitting increases. We may require more data to build a more robust model.
  • The information about how the cereal ratings are determined is not readily available. Hence, the variability in cereal ratings may not be consistent across different levels of predictor variables, leading to heteroscedasticity.
  • Some predictor variables in the dataset display multicollinearity and can confound the individual impact they have on the dependent variable, in this case, cereal ratings. For example, potassium and fiber have a strong correlation coefficient of 0.9. Due to this high correlation and the common nutritional emphasis on fiber, potassium was excluded from the model to help isolate the unique contribution of each variable to the cereal ratings.


However, to increase the predictive power and generalizability of the model, we recommend an expanded data collection encompassing a wider variety of cereal brands. Also, as new cereal products are introduced, and nutritional contents are regulated, the model must be regularly updated with new data and adjusted to maintain its relevance and accuracy.




Our mission is to develop an army of creative problem solvers using an innovative approach to internships.