recipes-data-analysis
What Do We Think of the Recipes?
By
Tawseef Rahman
tawseefr@umich.edu
This is my work for the Final Project in the EECS 398: Practical Data Science course at the University of Michigan - Ann Arbor.
In this project, I take a look at a recipes dataset. I will…
- propose a research question
- clean the dataset
- display plots of univariate analysis
- display plots of bivariate analysis
- propose a prediction problem
- train a baseline model
- tune my baseline model to create a final model
Introduction
Introduction and Question Identification
Question and Reasoning
The question that I am investigating in my project is:
Do recipes with more reviews (
review
count) have a higher average rating (avg_rating
)?
Understanding this relationship could help users make more informed decisions when selecting recipes. If there is a strong correlation between the number of reviews and the average rating, it may suggest that popular recipes tend to be better received - or that crowd consensus leads to more accurate ratings over time. This insight could also benefit food bloggers, content creators, and developers of food recommendation systems.
About the Dataset
The dataset I’m using contains information on 234,429 recipes and their reviews, compiled into a single DataFrame named merged_recipes_df
. This DataFrame includes a wide range of features related to recipe content, nutrition, user information, and review data.
The DataFrame merged_recipes_df
contains the following columns:
Variable Name | Variable Type | Variable Purpose |
---|---|---|
name |
string |
The name of the recipe |
id |
int |
The ID number corresponding to the recipe |
minutes |
int |
The length of time (in minutes) it takes to cook the recipe |
contributor_id |
int |
The ID number for the author of the recipe post |
submitted |
YYYY-MM-DD |
The date when the recipe post was published |
tags |
list of string s |
Relevant tags related to the recipe post |
n_steps |
string |
The number of steps in the recipe |
steps |
list of string s |
The steps used to cook the recipe |
description |
string |
The description of the recipe |
ingredients |
list of string s |
All the ingredients for the recipe |
n_ingredients |
int |
The number of ingredients for the recipe |
calories |
int |
The number of calories for the recipe |
total_fat_PDV |
float |
The percentage of daily value of total fat in the recipe |
sugar_PDV |
float |
The percentage of daily value of sugar in the recipe |
sodium_PDV |
float |
The percentage of daily value of sodium in the recipe |
protein_PDV |
float |
The percentage of daily value of protein in the recipe |
saturated_fat_PDV |
float |
The percentage of daily value of saturated fat in the recipe |
carbohydrates_PDV |
float |
The percentage of daily value of carbohydrates in the recipe |
user_id |
int |
The ID number for the author of the review post |
recipe_id |
int |
The ID number for the recipe for the corresponding review post |
date |
YYYY-MM-DD |
The date of the review post |
rating |
float |
The rating (1, inclusive to 5, inclusive) of the review |
review |
string |
The actual review |
avg_rating |
float |
The average value of ratings for the recipe |
year |
int |
The year the recipe post was published |
Data Cleaning and Exploratory Data Analysis
Data Cleaning
Data Cleaning and Preparation
Before analyzing the relationship between review count and average rating, it was essential for me to clean and preprocess the raw data. The original RAW_recipes.csv
dataset contains metadata and nutritional information for each recipe. The original RAW_interactions.csv
dataset contains individual user interactions, including ratings and written reviews. These raw datasets contained a variety of formats and inconsistencies that needed to be addressed before performing any meaningful analysis.
- Loading the Datasets
a. I started by loading both datasets usingpandas
. - Converting and Expanding the Nutrition Data
a. InRAW_recipes.csv
, thenutrition
column is stored as a string representation of a list. To analyze nutritional content, I needed to convert the string representation of a list into an actual list and separate that list into meaningful components.
b. I applied theeval()
function to transform the string into a Python list.
c. I then expanded this list into individual nutrition columns:calories
,total_fat_PDV
,sugar_PDV
,sodium_PDV
,protein_PDV
,saturated_fat_PDV
, andcarbohydrates_PDV
.
d. Step 2c. enabled me to treat nutritional data as structured numerical values, making it possible to include them in future quantitative analyses or visualizations. - Merging Recipe and Interaction Data
a. I then joined theRAW_recipes.csv
dataset and theRAW_interactions.csv
dataset using a left join onid
from theRAW_recipes.csv
dataset andrecipe_id
from theRAW_interactions.csv
dataset. This data merging step on the two datasets preserved all recipe entries, even those without user interactions.
b. Step 3a. connected user-generated ratings and reviews to specific recipes, allowing me to measure popularity (review
count) and perceived quality (avg_rating
) for each recipe. - Handling Invalid Ratings
a. Some ratings in the merged dataset had a value of0
, which is not a valid rating on a typical 1-5 scale. These entries were likely due to user error or placeholder values; I replaced the0
values withNaN
to prevent skewing the results.
b. Step 4a. ensured that the average rating calculations were accurate and not artificially deflated by invalid scores. - Calculating Average Rating Per Recipe
a. I then calculated the mean rating for each recipe using the cleaned ratings and merged this value back into the main dataset as a new column,avg_rating
.
b. Theavg_rating
column is central to my project question. By aggregating this value for each recipe, I can copare it meaningfully to the number of reviews it received. - Extracting the Year of Submission
a. Thedate
column in theRAW_interactions.csv
dataset records the timestamp of each interaction. I converted the date into ayear
column to make it easier for me to analyze trends over time.
b. Step 6a. allowed for temporal analyses - like examining whether newer recipes tend to receive higher ratings or more reviews.
head
of the Cleaned merged_recipes_df
DataFrame
Here’s the head
of the cleaned merged_recipes_df
DataFrame:
name |
id |
minutes |
contributor_id |
submitted |
tags |
n_steps |
steps |
description |
ingredients |
n_ingredients |
calories |
total_fat_PDV |
sugar_PDV |
sodium_PDV |
protein_PDV |
saturated_fat_PDV |
carbohydrates_PDV |
user_id |
recipe_id |
date |
rating |
review |
avg_rating |
year |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
1 brownies in the world best ever | 333281 | 40 | 985201 | 2008-10-27 | [‘60-minutes-or-less’, ‘time-to-make’, ‘course’, ‘main-ingredient’, ‘preparation’, ‘for-large-groups’, ‘desserts’, ‘lunch’, ‘snacks’, ‘cookies-and-brownies’, ‘chocolate’, ‘bar-cookies’, ‘brownies’, ‘number-of-servings’] | 10 | [‘heat the oven to 350f and arrange the rack in the middle’, ‘line an 8-by-8-inch glass baking dish with aluminum foil’, ‘combine chocolate and butter in a medium saucepan and cook over medium-low heat , stirring frequently , until evenly melted’, ‘remove from heat and let cool to room temperature’, ‘combine eggs , sugar , cocoa powder , vanilla extract , espresso , and salt in a large bowl and briefly stir until just evenly incorporated’, ‘add cooled chocolate and mix until uniform in color’, ‘add flour and stir until just incorporated’, ‘transfer batter to the prepared baking dish’, ‘bake until a tester inserted in the center of the brownies comes out clean , about 25 to 30 minutes’, ‘remove from the oven and cool completely before cutting’] | these are the most; chocolatey, moist, rich, dense, fudgy, delicious brownies that you’ll ever make…..sereiously! there’s no doubt that these will be your fav brownies ever for you can add things to them or make them plain…..either way they’re pure heaven! | [‘bittersweet chocolate’, ‘unsalted butter’, ‘eggs’, ‘granulated sugar’, ‘unsweetened cocoa powder’, ‘vanilla extract’, ‘brewed espresso’, ‘kosher salt’, ‘all-purpose flour’] | 9 | 138.4 | 10 | 50 | 3 | 3 | 19 | 6 | 386585 | 333281 | 2008-11-19 | 4 | These were pretty good, but took forever to bake. I would send it ended up being almost an hour! Even then, the brownies stuck to the foil, and were on the overly moist side and not easy to cut. They did taste quite rich, though! Made for My 3 Chefs. | 4 | 2008 |
1 in canada chocolate chip cookies | 453467 | 45 | 1848091 | 2011-04-11 | [‘60-minutes-or-less’, ‘time-to-make’, ‘cuisine’, ‘preparation’, ‘north-american’, ‘for-large-groups’, ‘canadian’, ‘british-columbian’, ‘number-of-servings’] | 12 | [‘pre-heat oven the 350 degrees f’, ‘in a mixing bowl , sift together the flours and baking powder’, ‘set aside’, ‘in another mixing bowl , blend together the sugars , margarine , and salt until light and fluffy’, ‘add the eggs , water , and vanilla to the margarine / sugar mixture and mix together until well combined’, ‘add in the flour mixture to the wet ingredients and blend until combined’, ‘scrape down the sides of the bowl and add the chocolate chips’, ‘mix until combined’, ‘scrape down the sides to the bowl again’, ‘using an ice cream scoop , scoop evenly rounded balls of dough and place of cookie sheet about 1 - 2 inches apart to allow for spreading during baking’, ‘bake for 10 - 15 minutes or until golden brown on the outside and soft & chewy in the center’, ‘serve hot and enjoy !’] | this is the recipe that we use at my school cafeteria for chocolate chip cookies. they must be the best chocolate chip cookies i have ever had! if you don’t have margarine or don’t like it, then just use butter (softened) instead. | [‘white sugar’, ‘brown sugar’, ‘salt’, ‘margarine’, ‘eggs’, ‘vanilla’, ‘water’, ‘all-purpose flour’, ‘whole wheat flour’, ‘baking soda’, ‘chocolate chips’] | 11 | 595.1 | 46 | 211 | 22 | 13 | 51 | 26 | 424680 | 453467 | 2012-01-26 | 5 | Originally I was gonna cut the recipe in half (just the 2 of us here), but then we had a park-wide yard sale, & I made the whole batch & used them as enticements for potential buyers ~ what the hey, a free cookie as delicious as these are, definitely works its magic! Will be making these again, for sure! Thanks for posting the recipe! | 5 | 2012 |
412 broccoli casserole | 306168 | 40 | 50969 | 2008-05-30 | [‘60-minutes-or-less’, ‘time-to-make’, ‘course’, ‘main-ingredient’, ‘preparation’, ‘side-dishes’, ‘vegetables’, ‘easy’, ‘beginner-cook’, ‘broccoli’] | 6 | [‘preheat oven to 350 degrees’, ‘spray a 2 quart baking dish with cooking spray , set aside’, ‘in a large bowl mix together broccoli , soup , one cup of cheese , garlic powder , pepper , salt , milk , 1 cup of french onions , and soy sauce’, ‘pour into baking dish , sprinkle remaining cheese over top’, ‘bake for 25 minutes or until cheese is lightly browned’, ‘sprinkle with rest of french fried onions and bake until onions are browned and cheese is bubbly , about 10 more minutes’] | since there are already 411 recipes for broccoli casserole posted to “zaar” ,i decided to call this one #412 broccoli casserole.i don’t think there are any like this one in the database. i based this one on the famous “green bean casserole” from campbell’s soup. but i think mine is better since i don’t like cream of mushroom soup.submitted to “zaar” on may 28th,2008 | [‘frozen broccoli cuts’, ‘cream of chicken soup’, ‘sharp cheddar cheese’, ‘garlic powder’, ‘ground black pepper’, ‘salt’, ‘milk’, ‘soy sauce’, ‘french-fried onions’] | 9 | 194.8 | 20 | 6 | 32 | 22 | 36 | 3 | 29782 | 306168 | 2008-12-31 | 5 | This was one of the best broccoli casseroles that I have ever made. I made my own chicken soup for this recipe. I was a bit worried about the tsp of soy sauce but it gave the casserole the best flavor. YUM! | 5 | 2008 |
The photos you took (shapeweaver) inspired me to make this recipe and it actually does look just like them when it comes out of the oven. | ||||||||||||||||||||||||
Thanks so much for sharing your recipe shapeweaver. It was wonderful! Going into my family’s favorite Zaar cookbook :) | ||||||||||||||||||||||||
412 broccoli casserole | 306168 | 40 | 50969 | 2008-05-30 | [‘60-minutes-or-less’, ‘time-to-make’, ‘course’, ‘main-ingredient’, ‘preparation’, ‘side-dishes’, ‘vegetables’, ‘easy’, ‘beginner-cook’, ‘broccoli’] | 6 | [‘preheat oven to 350 degrees’, ‘spray a 2 quart baking dish with cooking spray , set aside’, ‘in a large bowl mix together broccoli , soup , one cup of cheese , garlic powder , pepper , salt , milk , 1 cup of french onions , and soy sauce’, ‘pour into baking dish , sprinkle remaining cheese over top’, ‘bake for 25 minutes or until cheese is lightly browned’, ‘sprinkle with rest of french fried onions and bake until onions are browned and cheese is bubbly , about 10 more minutes’] | since there are already 411 recipes for broccoli casserole posted to “zaar” ,i decided to call this one #412 broccoli casserole.i don’t think there are any like this one in the database. i based this one on the famous “green bean casserole” from campbell’s soup. but i think mine is better since i don’t like cream of mushroom soup.submitted to “zaar” on may 28th,2008 | [‘frozen broccoli cuts’, ‘cream of chicken soup’, ‘sharp cheddar cheese’, ‘garlic powder’, ‘ground black pepper’, ‘salt’, ‘milk’, ‘soy sauce’, ‘french-fried onions’] | 9 | 194.8 | 20 | 6 | 32 | 22 | 36 | 3 | 1.19628e+06 | 306168 | 2009-04-13 | 5 | I made this for my son’s first birthday party this weekend. Our guests INHALED it! Everyone kept saying how delicious it was. I was I could have gotten to try it. | 5 | 2009 |
412 broccoli casserole | 306168 | 40 | 50969 | 2008-05-30 | [‘60-minutes-or-less’, ‘time-to-make’, ‘course’, ‘main-ingredient’, ‘preparation’, ‘side-dishes’, ‘vegetables’, ‘easy’, ‘beginner-cook’, ‘broccoli’] | 6 | [‘preheat oven to 350 degrees’, ‘spray a 2 quart baking dish with cooking spray , set aside’, ‘in a large bowl mix together broccoli , soup , one cup of cheese , garlic powder , pepper , salt , milk , 1 cup of french onions , and soy sauce’, ‘pour into baking dish , sprinkle remaining cheese over top’, ‘bake for 25 minutes or until cheese is lightly browned’, ‘sprinkle with rest of french fried onions and bake until onions are browned and cheese is bubbly , about 10 more minutes’] | since there are already 411 recipes for broccoli casserole posted to “zaar” ,i decided to call this one #412 broccoli casserole.i don’t think there are any like this one in the database. i based this one on the famous “green bean casserole” from campbell’s soup. but i think mine is better since i don’t like cream of mushroom soup.submitted to “zaar” on may 28th,2008 | [‘frozen broccoli cuts’, ‘cream of chicken soup’, ‘sharp cheddar cheese’, ‘garlic powder’, ‘ground black pepper’, ‘salt’, ‘milk’, ‘soy sauce’, ‘french-fried onions’] | 9 | 194.8 | 20 | 6 | 32 | 22 | 36 | 3 | 768828 | 306168 | 2013-08-02 | 5 | Loved this. Be sure to completely thaw the broccoli. I didn't and it didn't get done in time specified. Just cooked it a little longer though and it was perfect. Thanks Chef. | 5 | 2013 |
Univariate Analysis
plotly
Plot 1: Histogram
Distribution of average ratings (avg_rating
):
The histogram shows that the distribution of average recipe ratings is left-skewed, with most recipes receiving high ratings. However, there is a noticable drop between the number of recipes with an average rating of 4.5 and those with a perfect 5.0, suggesting that while many recipes are well-received, far fewer consistently earn top marks.
plotly
Plot 2: Histogram
Distribution of the number of steps (n_steps
):
The histogram of recipe steps is right-skewed, indicating that while some recipes are quite complex, the majority involve a manageable 5 to 10 steps. The aforementioned trend suggests that most users tend to submit relatively simple recipes - which may attract more reviews and higher ratings due to their accessibility, a factor worth considering when exploring the relationship between review count and average rating.
Bivariate Analysis
plotly
Plot 3: Scatter Plot
Relationship between the number of reviews (review
count) and average rating (avg_rating
)
The scatter plot shows that most recipes cluster around 20 to 40 reviews with average ratings between 4 and 5. While this aforementioned trend suggests that popular recipes tend to be highly rated, the relatively flat trend line indicates no strong linear relationship between review count and average rating - implying that simply having more reviews does not necessarily mean a recipe is rated higher.
plotly
Plot 4: Bar Chart
Distribution of the average percentage of daily value in sugar (sugar_PDV
) for recipes at or below 2,000 calories in recipes by year (year
)
The bar chart reveals that the average sugar percentage of daily value in recipes has remained relatively stable since 2008, but shows a noticable increase beginning around 2015, with peaks in 2015 and 2018. The aforementioned trend suggests that in recent years, users may have submitted sweeter or less health-conscious recipes. To ensure a more accurate representation of recipes, this analysis only includes recipes with 2,000 calories or fewer as those above that threshold contained extreme outliers in sugar content that skewed the data.
Interesting Aggregates
Here’s a pivot table that summarizes the year trends (year
) in recipe reviews (review
count) and average ratings (avg_rating
):
year |
avg_rating |
review |
---|---|---|
2008 | 4.62743 | 31593 |
2009 | 4.67466 | 48420 |
2010 | 4.70223 | 36970 |
2011 | 4.70723 | 29187 |
2012 | 4.73037 | 23827 |
2013 | 4.70944 | 21849 |
2014 | 4.68466 | 12580 |
2015 | 4.6112 | 8206 |
2016 | 4.59224 | 5999 |
2017 | 4.59481 | 8825 |
2018 | 4.59397 | 6915 |
The pivot table summarizes yearly trends in both average recipe ratings and total review counts, providing a broader view of how user engagement and user perceptions have evolved over time. By tracking these metrics from year to year, I can identify periods of increased activity or shifting user preferences - which helps contextualize my main question about the relationship between review count and average rating. For example, a rise in review volume without a corresponding increase in average rating could suggest that more reviews don’t always translate to higher recipe approval.
Imputation
Only the missing values in the rating
column needed to be imputed with a value of Na.N
because the rating
column was the only column from the RAW_interactions.csv
dataset that contained missing values for some of the rows.
Framing a Prediction Problem
Problem Identification
The goal of my prediction task is to predict the average rating (avg_rating
) that a recipe receives based on various features such as the number of reviews, ingredients, nutritional values, and more. Because avg_rating
is a continuous numerical variable, this is a regression problem.
-
Response Variable: The response variable I am predicting is
avg_rating
.
I chose this variable because it reflects how positively a recipe is perceived by users, and predicting the average rating can help identify what kinds of recipes are most likely to be rated highly. -
Evaluation Metric: I am using the coefficient of determination (R2) to evaluate my model’s performance.
The R2 value measures the proportion of variance in the response variable that can be explained by the features used in the model. I chose R2 over other regression metrics (like Mean Absolute Error (MAE) or Root Mean Square Error (RMSE)) because it provides an interpretable measure of how well the model captures the variability in recipe ratings, which is especially useful when comparing different models or feature sets.
Baseline Model
For my baseline model, I used a linear regression approach to predict the avg_rating
of a recipe based on two features:
review_count
: The number of user-submitted reviews per recipecalories
: The total calorie content of the recipe
Both features are quantitative variables. Because the model uses only numeric data, no encoding (such as one-hot encoding) was necessary.
After splitting the data into training and testing sets (80% train, 20% test), I evaluated model performance using the R2 metric. This metric tells me how well the model explains variability in the response variable.
The R2 score value for the baseline model is 0.0003; this value is extremely low - close to zero - which means that the model explains virtually none of the variance in average recipe ratings. The aforementioned interpretation of the R2 score value for the baseline model suggests that simply knowing how many reviews a recipe has and its calorie content is not sufficient to accurately predict the recipe’s rating. Therefore, I do not consider this baseline model to be good. This model serves primarily as a starting point for comparison as I explore more complex models and richer sets of features.
Final Model
For my final model, I introduced two additional features: protein_PDV
(percentage of daily value of protein) and n_ingredients
(number of ingredients) in a recipe. I chose these features because they both provide valuable nutritional and complexity-related information about each recipe. Recipes that are high in protein or require more ingredients may influence user satisfaction and perceived quality, which in turn could affect their average ratings. Including these variables helps capture underlying patterns in the data that go beyond the number of calories and review frequency for a recipe, which were used in the baseline model.
To model the prediction task, I used a Random Forest Regressor, a non-linear ensemble method that is well-suited for capturing complex relationships in the data. The aforementioned choice was motivated by the fact that user ratings can be influenced by a combination of interactions that a linear model might not be able to capture.
I applied preprocessing using a ColumnTransformer
:
- I standardized numeric features (
calories
,protein_PDV
, andn_ingredients
) usingStandardScaler
. - I normalized the highly skewed
review_count
usingQuantileTransformer
with a normal output distribution.
I then used GridSearchCV to tune the hyperparameters of the Random Forest model. The best hyperparameters from the grid search were:
n_estimators
: 100max_depth
: Nonemin_sample_split
: 2
I selected the best model by evaluating its performance using 5-fold cross-validation, with the R2 metric as the evaluation metric. The R2 metric is an appropriate metric for regression tasks because it quantifies how much of the variance in the response variable is explained by the model.
The final model achieved an R2 score of 0.4502, a substantial improvement over the baseline model’s R2 score of 0.0003. The improvement in the R2 score suggrsts that incorporating domain knowledge through new features and using a more expressive model significantly improved the model’s ability to predict average recipe ratings.