Web Shop Data Scraping with Cypress

Experimenting with web scraping is a fun thing to do. Cypress offers great features that can successfully address those kinds of tasks. I’ve decided to do the test on the website of Gomex, the retail supermarket I like to visit from time to time.

Table of Contents

How do you choose a shop website to scrape?

We must be aware of some things in order to scrape product data from a retail shop website. Not every supermarket has a website or an online shop where the products are listed in HTML. Many of them upload PDF versions of the paper catalogs. Scraping data from those sources is a totally different story.

Our example is a supermarket website listing the products it offers in its retail shops and the current prices. I’ve decided to use the website’s essential features and get some data from it. Also, what I could see is that Gomex is regularly updating the component I’m interacting with here. Still, they are updating products (product name, measures and prices) only, while they don’t change other component features. It is essential since I don’t want to update my script with new class names every time they make an update. Those kinds of websites make the process a bit time-consuming if we want to scrape them again and again for a relatively longer period.

You may find the scraping script I’ve come up with here: Web-Shop-Scraping-with-Cypress/gomex_scraper.spec.js at main · NoToolsNoCraft/Web-Shop-Scraping-with-Cypress (github.com)

Step by step, what is the script doing

First, Cypress opens the Gomex website.
On the same page, it navigates to the “Nedeljna ponuda” tab inside the carousel component.
The tab mentioned above contains multiple sub-tabs.
Each sub-tab contains multiple products.
Cypress opens each sub-tab and scrapes data (title and price) for each product listed.
This repeats until it interacts with all sub-tabs and products.
As it’s done, Cypress prints the data that was scraped in the JSON document.

While running web scraping scripts, it’s ideal to do it headless, meaning you don’t do it through the browser (as I did in the video above) but just through code. To run the test headless, I used this terminal command:

npx cypress run --spec "cypress/integration/gomex_scraper.spec.js"

How does the scraped data look like?

Here you may see how the scraped data looks in a JSON format:

{
    "title": "ČIPS CHIPS WAY ČAČANSKI REBRASTI 150G",
    "price": "184.99 din"
  },
  {
    "title": "ČIPS CHIPS WAY ČAČANSKI SLANI 150G",
    "price": "184.99 din"
  },
  {
    "title": "ČOKOLADA SCHOGETTEN LEŠNIK 100G",
    "price": "199.99 din"
  },
  {
    "title": "ČOKOLADA SCHOGETTEN MLEČNA 100G",
    "price": "179.99 din"
  },
  {
    "title": "GEL ZA TUŠIRANJE MALIZIA SORT 1L",
    "price": "299.99 din"
  },
  {
    "title": "HLEB PEKARA AS TOST 500GR",
    "price": "179.99 din"
  },

You may see how the full printed results look like at the end with added data that was taken from the source: Web-Shop-Scraping-with-Cypress/gomex_products.json at main · NoToolsNoCraft/Web-Shop-Scraping-with-Cypress (github.com)

What do I plan to do with this data?

Namely, I’ve received an idea to create an app that will compare product prices in multiple supermarkets. The first and essential step in that adventure is to get data from relevant sources. As soon as I get data from various sources, I can do some solutions to compare them. The idea I currently have is to do that by keywords. For example, let’s say that an app could show the user chocolate products from multiple shops so the user can compare the prices instantly.

How do you choose a shop website to scrape?

Step by step, what is the script doing

How does the scraped data look like?

What do I plan to do with this data?

Related Posts