I scraped 14K recipes, so you won’t have to.

This week, I’m proud of myself for getting outside my comfort zone to learn web scraping. For a pretty long time have I viewed this skill as difficult to learn and practice. This belief found itself home in my mind back when I attended my first Data Hackathon in 2024. I was placed in a squad of individuals who were pretty well-versed with what we were tasked with: Scrape house prices and their related data from a home properties website, get it in the right shape, conduct good ol’ EDA and finally develop an ML model to (you guessed it) predict house prices. Nothing fancy. But scraping? Heck was that? CSS selectors? Beautiful Soup? Dawg, what? Fast forward, I did my research and developed a decent understanding. Oh, I also fell in love with R!

Anyway, enough yap, let’s get cooking. In this blog post, I will show you how to scrape recipes from Allrecipes.com in R using the RSelenium and rvest packages. This tutorial assumes you’re already set up. If not, Samer Hijjazi has you covered this useful guide. Otherwise, let’s help alleviate world hunger.

To get started, let’s load all our necessary libraries:

library(RSelenium)
library(polite)
library(purrr)
library(tidyverse)
library(mongolite)
library(wdman)
library(netstat)
library(rvest)
library(progressr)

Now that we have the tools we need on the ready, let’s set up our session using Firefox (or chrome, refer to the docs). The code chunk below will start a Firefox session and set up your client.

rD <- rsDriver(browser = "firefox", 
               chromever = NULL,
               phantomver = NULL,
               port = free_port())

remDr <- rsDriver$client

Great! Now let’s navigate to one of the recipes on the site, which is shows you how to go about making some simple and easy stuffed peppers.

remDr$navigate("https://www.allrecipes.com/recipe/105016/simple-and-easy-stuffed-peppers/") 

page <- read_html(remDr$getPageSource()[[1]])

This reads the HTML source of the current browser page controlled by Selenium into a rvest-compatible object so we can parse and extract data from it using CSS or XPath selectors. Since we are interested in targeting the elements containing the ingredients, author’s name, when the recipe was updated/published, data involving prep times, cook times and other times, nutrition facts, ratings and reviews data, this is the way to go:

# Ingredients
ingredients <- page |>
  html_nodes("ul.mm-recipes-structured-ingredients__list li") |>
  html_text(trim = TRUE)

# Author name
author <- page |>
  html_nodes(".mntl-attribution__item-name, 
              span.comp.mntl-bylines__item.mntl-attribution__item.mntl-attribution__item-name") |>
  html_text(trim = TRUE)

# Published date
date <- page |>
  html_node("div.mntl-attribution__item-date") |>
  html_text(trim = TRUE)

# Recipe metadata labels + values (e.g. servings, prep time)
labels <- page |> html_nodes("div.mm-recipes-details__label") |> html_text(trim = TRUE)
values <- page |> html_nodes("div.mm-recipes-details__value") |> html_text(trim = TRUE)
details <- tibble::tibble(label = labels, value = values)

# Nutrition facts
nutrition_names <- page |> html_nodes("td.mm-recipes-nutrition-facts-summary__table-cell.text-body-100") |> html_text(trim = TRUE)
nutrition_values <- page |> html_nodes("td.mm-recipes-nutrition-facts-summary__table-cell.text-body-100-prominent") |> html_text(trim = TRUE)
nutrition <- tibble::tibble(fact = nutrition_names, amount = nutrition_values)

# Ratings and Reviews
average_rating <- page |> html_node("#mm-recipes-review-bar__rating_1-0") |> html_text(trim = TRUE)
total_ratings  <- page |> html_node("#mm-recipes-review-bar__rating-count_1-0") |> html_text(trim = TRUE)
review_count   <- page |> html_node("#mm-recipes-review-bar__comment-count_1-0") |> html_text(trim = TRUE)

And to compile all this information into a nice table:

meta <- tibble::tibble(
  author = author,
  date_published = date,
  average_rating = average_rating,
  total_ratings = total_ratings,
  review_count = review_count
)

ingredients_tbl <- tibble::tibble(ingredient = ingredients)

nutrition_tbl <- tibble::tibble(
  fact = nutrition_names,
  amount = nutrition_values
)

details_tbl <- tibble::tibble(label = labels, value = values)

# Optionally: pivot wider
details_wide <- tidyr::pivot_wider(details_tbl, names_from = label, values_from = value)

recipe_data <- list(
  url = "https://www.allrecipes.com/recipe/105016/simple-and-easy-stuffed-peppers/",
  author = author,
  date_published = date,
  ratings = list(
    average = average_rating,
    total = total_ratings,
    reviews = review_count
  ),
  details = details_wide,
  ingredients = ingredients_tbl,
  nutrition = nutrition_tbl
)

So far so hoot! We have successfully scraped a single recipe but we want to scrape a whole lot more. Let’s navigate to the page that will lead us to thousands more, for mankind must feast.

# This will lead to 14k recipes
recipes_a_to_z <- "https://www.allrecipes.com/recipes-a-z-6735880#alphabetical-list-z" 

remDr$navigate(recipes_a_to_z)

Phase 1: Building the Recipe Category Index

The code chunk above leads to a few 100 recipe categories with each further leading to the actual recipes. Therefore, we need to grab these links and store them. You know the drill now. Fetch page source, read the HTML, grab elements of interest (the recipe category links) and go brrrr. And yes, you need to run this line with the page variable again since we are on a new page.

page_source <- remDr$getPageSource()[[1]]
page <- read_html(page_source)

links <- page |>
  html_node("#mntl-alphabetical-list_1-0") |>
  html_nodes("a")

The lines above target the alphabetical recipe index block and extract all anchor tags (<a>) which contain category names and links. Then we proceed to build our recipe index using our link variable which will contain a tibble of category names and their URLs for use in the next phase.

recipe_index <- tibble::tibble(
  name = html_text(links, trim = TRUE),
  link = html_attr(links, "href")
)
category_urls <- recipe_index$link

Phase 2: Scrape Recipes from Each Category

Once we have the category URLs in hand, we will dive into each one and extract the actual recipe links. That’s where the scrape_category_recipes() function comes in. It’s purpose is to navigate to a given category page and then parses out two distinct sets of recipe links: those in a kind of “Featured” section and those in the main recipe list. Here’s the function’s logic:

scrape_category_recipes <- function(url) {
  remDr$navigate(url)
  Sys.sleep(1.5)  # Allow page to fully render
  
  page <- read_html(remDr$getPageSource()[[1]])
  
  # "Featured" section
  featured_anchors <- page |>
    html_node("#mntl-three-post__inner_1-0") |>
    html_nodes("a")
  
  featured <- tibble::tibble(
    name = featured_anchors |> html_nodes("span.card__title-text ") |> html_text(trim = TRUE),
    link = featured_anchors |> html_attr("href")
  ) |>
    filter(str_detect(link, "/recipe/|recipe-[0-9]+")) |>
    mutate(type = "Featured", source_url = url)
  
  # "Main" recipe list across multiple divs
  main_anchors <- page |>
    html_nodes("div.comp.tax-sc__recirc-list.card-list.mntl-universal-card-list.mntl-document-card-list.mntl-card-list.mntl-block") |>
    html_nodes("a")
  
  main <- tibble::tibble(
    name = main_anchors |> html_nodes("span.card__title-text ") |> html_text(trim = TRUE),
    link = main_anchors |> html_attr("href")
  ) |>
    filter(str_detect(link, "/recipe/|recipe-[0-9]+")) |>
    mutate(type = "List", source_url = url)
  
  # Unified result
  bind_rows(featured, main)
}

Once both sets are scraped, bind them together into a single tibble. This function gets called repeatedly for every category URL using purrr::map_dfr(), which stitches all the results into one big dataframe. But of course, scraping isn’t always clean. I quickly noticed duplicates — same recipe showing up in multiple categories. So I counted how many times each link appeared, filtered for those with more than one occurrence, and then deduplicated the dataset using distinct().

# Loop through all category URLs and bind results
results <- purrr::map_dfr(category_urls, scrape_category_recipes)

# I spy duplicates
dupes <- results |> dplyr::count(link) |> dplyr::filter(n > 1)

# So I do the deduping
results <- results |> dplyr::distinct(link, .keep_all = TRUE)

At this point, we should have a clean, unified list of recipe URLs — each one pointing to a page rich with ingredients, nutrition facts, ratings, and more. Now we finna dig in.

Phase 3: Now, we feast.

Using the logic we developed earlier with the simple and easy stuffed peppers, we get this handy function.

scrape_recipe_details <- function(recipe_url) {
  remDr$navigate(recipe_url)
  Sys.sleep(0.5)
  
  page <- read_html(remDr$getPageSource()[[1]])
  
  # Author & date
  author <- page |> html_nodes(".mntl-attribution__item-name, 
              span.comp.mntl-bylines__item.mntl-attribution__item.mntl-attribution__item-name") |> html_text(trim = TRUE)
  date <- page |> html_node("div.mntl-attribution__item-date") |> html_text(trim = TRUE)
  
  # Ratings
  average_rating <- page |> html_node("#mm-recipes-review-bar__rating_1-0") |> html_text(trim = TRUE)
  total_ratings  <- page |> html_node("#mm-recipes-review-bar__rating-count_1-0") |> html_text(trim = TRUE)
  review_count   <- page |> html_node("#mm-recipes-review-bar__comment-count_1-0") |> html_text(trim = TRUE)
  
  # Metadata
  labels <- page |> html_nodes("div.mm-recipes-details__label") |> html_text(trim = TRUE)
  values <- page |> html_nodes("div.mm-recipes-details__value") |> html_text(trim = TRUE)
  details <- tibble::tibble(label = labels, value = values) |> tidyr::pivot_wider(names_from = label, values_from = value)
  
  # Ingredients
  ingredients <- page |> html_nodes("ul.mm-recipes-structured-ingredients__list li") |> html_text(trim = TRUE)
  
  # Nutrition
  nutrition_names <- page |> html_nodes("td.mm-recipes-nutrition-facts-summary__table-cell.text-body-100") |> html_text(trim = TRUE)
  nutrition_values <- page |> html_nodes("td.mm-recipes-nutrition-facts-summary__table-cell.text-body-100-prominent") |> html_text(trim = TRUE)
  nutrition <- tibble::tibble(fact = nutrition_names, amount = nutrition_values)
  
  # Package as tibble row
  tibble::tibble(
    url = recipe_url,
    author = author,
    date_published = date,
    ratings = list(tibble::tibble(avg = average_rating, total = total_ratings, reviews = review_count)),
    details = list(details),
    ingredients = list(ingredients),
    nutrition = list(nutrition)
  )
}

This function will collect our data of interest from each of the 14.4K recipes in the results tibble. To store these recipes, I chose MongoDB. It’s flexible, schema-less, and perfect for storing nested data like ingredients, nutrition facts, and ratings. Using the mongolite package, I connected to my database with:

mongo_conn <- mongo(collection = "Recipes",
                    db = "Cook",
                    url = "my_mongo_conn_string")

This line establishes a connection to a remote MongoDB cluster, targeting the "Recipes" collection inside the "Cook" database which I had created beforehand. Every time a recipe is scraped, it is inserted directly into this collection as a document. MongoDB handles the nested structure beautifully — ingredients become arrays, nutrition facts become embedded objects, and metadata like prep time or servings slot in without needing a rigid schema.

To wrap up the scraping and storage workflow, the final code block orchestrates the entire process of iterating through recipe URLs, scraping each one, and inserting the structured data into MongoDB while providing us with a lil progress bar that tells us how far we are with the process.

with_progress({
  p <- progressor(steps = length(results$link))
  purrr::walk(results$link, function(link) {
    try({
      recipe_df <- scrape_recipe_details(link)
      mongo_conn$insert(recipe_df)
      p() 
    }, silent = TRUE)
  })
})

And that’s about it! Next steps: Export the data from the database either using MongoDB Compass’ GUI or use mongolite to clean it and develop whatever data product out of it. Until next time :))

I scraped 14K recipes, so you won’t have to.

Phase 1: Building the Recipe Category Index

Phase 2: Scrape Recipes from Each Category

Leave a Reply Cancel reply

About this Blog

Latest Posts

Let's Connect