The Ultimate Guide to Web Scraping Walmart

Raluca Penciuc on Feb 23 2023

Web scraping Walmart is a popular topic among data enthusiasts and businesses alike. Walmart is one of the largest retail companies in the world, with a vast amount of data available on its website. By scraping this data, you can gain valuable insights into consumer behavior, market trends, and much more.

In this article, we will explore the process of web scraping Walmart using TypeScript and Puppeteer. We will also go over the environment setup, the identification of the data, and extract it for use in your own projects. In the end, we will discuss how using a professional scraper may be a more effective and reliable solution.

By the end of this guide, you'll have a solid understanding of the process and be able to use it to improve your business or research. Whether you are a data scientist, a marketer, or a business owner, this guide will help you harness the power of Walmart's data to drive your success.

Pré-requisitos

Antes de começarmos, vamos certificar-nos de que dispomos das ferramentas necessárias.

Primeiro, baixe e instale o Node.js do site oficial, certificando-se de usar a versão Long-Term Support (LTS). Isso também instalará automaticamente o Node Package Manager (NPM), que usaremos para instalar outras dependências.

Para este tutorial, usaremos o Visual Studio Code como nosso ambiente de desenvolvimento integrado (IDE), mas você pode usar qualquer outro IDE de sua escolha. Crie uma nova pasta para seu projeto, abra o terminal e execute o seguinte comando para configurar um novo projeto Node.js:

npm init -y

Isto criará um ficheiro package.json no seu diretório de projeto, que armazenará informações sobre o seu projeto e as suas dependências.

Em seguida, precisamos instalar o TypeScript e as definições de tipo para o Node.js. O TypeScript oferece tipagem estática opcional que ajuda a evitar erros no código. Para fazer isso, execute no terminal:

npm install typescript @types/node --save-dev

Pode verificar a instalação executando:

npx tsc --versão

O TypeScript usa um arquivo de configuração chamado tsconfig.json para armazenar opções do compilador e outras configurações. Para criar esse arquivo em seu projeto, execute o seguinte comando:

npx tsc -init

Certifique-se de que o valor de "outDir" esteja definido como "dist". Desta forma, iremos separar os ficheiros TypeScript dos ficheiros compilados. Pode encontrar mais informações sobre este ficheiro e as suas propriedades na documentação oficial do TypeScript.

Agora, crie um diretório "src" no seu projeto e um novo ficheiro "index.ts". É aqui que vamos manter o código de raspagem. Para executar o código TypeScript, é necessário compilá-lo primeiro, portanto, para garantir que não nos esqueçamos dessa etapa extra, podemos usar um comando definido de forma personalizada.

Vá ao ficheiro "package. json" e edite a secção "scripts" desta forma:

"scripts": {

    "test": "npx tsc && node dist/index.js"

}

Desta forma, quando for executar o script, basta digitar "npm run test" no seu terminal.

Por fim, para extrair os dados do site, usaremos o Puppeteer, uma biblioteca de navegador sem cabeça para Node.js que permite controlar um navegador da Web e interagir com sites de forma programática. Para instalá-la, execute este comando no terminal:

npm install puppeteer

É altamente recomendado quando se pretende garantir a integridade dos dados, uma vez que muitos sítios Web actuais contêm conteúdo gerado dinamicamente. Se estiver curioso, pode consultar a documentação do Puppeteer antes de continuar para ver o que ele é capaz de fazer.

Localização dos dados

Now that you have your environment set up, we can start looking at extracting the data. For this article, I chose to scrape data from this product page: https://www.walmart.com/ip/Keter-Adirondack-Chair-Resin-Outdoor-Furniture-Teal/673656371.

Vamos extrair os seguintes dados:

the product name;
the product rating number;
the product reviews count;
the product price;
the product images;
the product details.

Pode ver todas estas informações destacadas na imagem de ecrã abaixo:

Ao abrir as Ferramentas do desenvolvedor em cada um desses elementos, você poderá observar os seletores CSS que usaremos para localizar os elementos HTML. Se não sabe muito bem como funcionam os selectores CSS, pode consultar este guia para principiantes.

Extração de dados

Antes de escrever o nosso script, vamos verificar se a instalação do Puppeteer correu bem:

import puppeteer from 'puppeteer';

async function scrapeWalmartData(walmart_url: string): Promise<void> {

    // Launch Puppeteer

    const browser = await puppeteer.launch({

        headless: false,

    	  args: ['--start-maximized'],

    	  defaultViewport: null

    })

    // Create a new page

    const page = await browser.newPage()

    // Navigate to the target URL

    await page.goto(walmart_url)

    // Close the browser

    await browser.close()

}

scrapeWalmartData("https://www.walmart.com/ip/Keter-Adirondack-Chair-Resin-Outdoor-Furniture-Teal/673656371")

Here we open a browser window, create a new page, navigate to our target URL, and then close the browser. For the sake of simplicity and visual debugging, I open the browser window maximized in non-headless mode.

Agora, vamos dar uma vista de olhos à estrutura do sítio Web:

To get the product name, we target the “itemprop” attribute of the “h1” element. The result we’re looking for is its text content.

// Extract product name

const product_name = await page.evaluate(() => {

    const name = document.querySelector('h1[itemprop="name"]')

    return name ? name.textContent : ''

})

console.log(product_name)

For the rating number, we identified as reliable the “span” elements whose class name ends with “rating-number”.

// Extract product rating number

const product_rating = await page.evaluate(() => {

    const rating = document.querySelector('span[class$="rating-number"]')

    return rating ? rating.textContent : ''

})

console.log(product_rating)

And finally (for the highlighted section), for the number of reviews and for the product price we rely on the “itemprop” attribute, just like above.

// Extract product reviews count

const product_reviews = await page.evaluate(() => {

    const reviews = document.querySelector('a[itemprop="ratingCount"]')

    return reviews ? reviews.textContent : ''

})

console.log(product_reviews)

// Extract product price

const product_price = await page.evaluate(() => {

    const price = document.querySelector('span[itemprop="price"]')

    return price ? price.textContent : ''

})

console.log(product_price)

Moving on to the product images, we navigate further in the HTML document:

Slightly more tricky, but not impossible. We cannot uniquely identify the images by themselves, so this time we will target their parent elements. Therefore, we extract the “div” elements that have the attribute “data-testid” set to “media-thumbnail”.

Then, we convert the result to a Javascript array, so we can map each element to its “src” attribute.

// Extract product images

const product_images = await page.evaluate(() => {

    const images = document.querySelectorAll('div[data-testid="media-thumbnail"] > img')

    const images_array = Array.from(images)

    return images ? images_array.map(a => a.getAttribute("src")) : []

})

console.log(product_images)

And last but not least, we scroll down the page to inspect the product details:

We apply the same logic from extracting the images, and this time we simply make use of the class name “dangerous-html”.

// Extract product details

const product_details = await page.evaluate(() => {

    const details = document.querySelectorAll('div.dangerous-html')

    const details_array = Array.from(details)

    return details ? details_array.map(d => d.textContent) : []

})

console.log(product_details)

O resultado final deve ser o seguinte:

Keter Adirondack Chair, Resin Outdoor Furniture, Teal

(4.1)

269 reviews

Now $59.99

[

'https://i5.walmartimages.com/asr/51fc64d9-6f1f-46b7-9b41-8880763f6845.483f270a12a6f1cbc9db5a37ae7c86f0.jpeg?odnHeight=80&odnWidth=80&odnBg=FFFFFF',  'https://i5.walmartimages.com/asr/80977b5b-15c5-435e-a7d6-65f14b2ee9c9.d1deed7ca4216d8251b55aa45eb47a8f.jpeg?odnHeight=80&odnWidth=80&odnBg=FFFFFF',

'https://i5.walmartimages.com/asr/80c1f563-91a9-4bff-bda5-387de56bd8f5.5844e885d77ece99713d9b72b0f0d539.jpeg?odnHeight=80&odnWidth=80&odnBg=FFFFFF',  'https://i5.walmartimages.com/asr/fd73d8f2-7073-4650-86a3-4e809d09286e.b9b1277761dec07caf0e7354abb301fc.jpeg?odnHeight=80&odnWidth=80&odnBg=FFFFFF',

'https://i5.walmartimages.com/asr/103f1a31-fbc5-4ad6-9b9a-a298ff67f90f.dd3d0b75b3c42edc01d44bc9910d22d5.jpeg?odnHeight=80&odnWidth=80&odnBg=FFFFFF',  'https://i5.walmartimages.com/asr/120121cd-a80a-4586-9ffb-dfe386545332.a90f37e11f600f88128938be3c68dca5.jpeg?odnHeight=80&odnWidth=80&odnBg=FFFFFF',

'https://i5.walmartimages.com/asr/47b8397f-f011-4782-bbb7-44bfac6f3fcf.bb12c15a0146107aa2dcd4cefba48c38.jpeg?odnHeight=80&odnWidth=80&odnBg=FFFFFF'

]

[

  'The Keter Adirondack chair lets you experience the easy-living comfort of the popular chair but with none of the worries of wood. Combining traditional styling and the look and feel of wood with durable and maintenance-free materials, this chair will find',

  'Keter Adirondack Chair, Resin Outdoor Furniture, Gray:   Made from an all-weather resistant resin for ultimate durability  Weather-resistant polypropylene construction prevents fading, rusting, peeling, and denting - unlike real wood  Quick and easy assembly  Rotating cup holder  Classic comfort redefined  Ergonomic design  Durable and weather-resistant  Worry-free relaxation  Dimensions: 31.9" L x 31.5" W x 38" H  Seat height is 15.4 in. for a deep bucket seat and tall backrest  Chair Weighs 22 lbs. - heavy enough to not blow over in the wind, yet light enough to easily rearrange your patio space  350 lbs. capacity '

]

Contornar a deteção de bots

While scraping Walmart may seem easy at first, the process can become more complex and challenging as you scale up your project. The retail website implements various techniques to detect and prevent automated traffic, so your scaled-up scraper starts getting blocked.

Walmart uses the “Press & Hold” model of CAPTCHA, offered by PerimeterX, which is known to be almost impossible to solve from your code. Besides this, the website also uses protections offered by Akamai and ThreatMetrix and collects multiple browser data to generate and associate you with a unique fingerprint.

Entre os dados do navegador recolhidos, encontramos:

propriedades do objeto Navigator (deviceMemory, hardwareConcurrency, idiomas, plataforma, userAgent, webdriver, etc.)
canvas fingerprinting
controlos de prazos e de desempenho
plugin and voice enumeration
web workers
controlo das dimensões do ecrã
e muitos mais

One way to overcome these challenges and continue scraping at a large scale is to use a scraping API. These kinds of services provide a simple and reliable way to access data from websites like walmart.com, without the need to build and maintain your own scraper.

O WebScrapingAPI é um exemplo de um produto deste género. O seu mecanismo de rotação de proxy evita completamente os CAPTCHAs, e a sua base de conhecimentos alargada permite aleatorizar os dados do browser para que se pareça com um utilizador real.

A configuração é rápida e fácil. Basta registar uma conta para receber a sua chave de API. Esta pode ser acedida a partir do seu painel de controlo e é utilizada para autenticar os pedidos que envia.

Como já configurou o seu ambiente Node.js, podemos utilizar o SDK correspondente. Execute o seguinte comando para o adicionar às dependências do seu projeto:

npm install webscrapingapi

Agora só falta ajustar os selectores CSS anteriores à API. A poderosa caraterística das regras de extração torna possível analisar dados sem modificações significativas.

import webScrapingApiClient from 'webscrapingapi';

const client = new webScrapingApiClient("YOUR_API_KEY");

async function exampleUsage() {

    const api_params = {

        'render_js': 1,

    	  'proxy_type': 'residential',

    	  'timeout': 60000,

    	  'extract_rules': JSON.stringify({

            name: {

                selector: 'h1[itemprop="name"]',

                output: 'text',

        	},

        	rating: {

                selector: 'span[class$="rating-number"]',

                output: 'text',

        	},

        	reviews: {

                selector: 'a[itemprop="ratingCount"]',

                output: 'text',

        	},

        	price: {

                selector: 'span[itemprop="price"]',

                output: 'text',

        	},

        	images: {

                selector: 'div[data-testid="media-thumbnail"] > img',

                output: '@src',

                all: '1'

        	},

        	details: {

                selector: 'div.dangerous-html',

                output: 'text',

                all: '1'

        	}

        })

    }

    const URL = "https://www.walmart.com/ip/Keter-Adirondack-Chair-Resin-Outdoor-Furniture-Teal/673656371"

    const response = await client.get(URL, api_params)

    if (response.success) {

        console.log(response.response.data)

    } else {

        console.log(response.error.response.data)

    }

}

exampleUsage();

Conclusão

This article has presented you with an overview of web scraping Walmart using TypeScript along with Puppeteer. We have discussed the process of setting up the necessary environment, identifying and extracting the data, and provided code snippets and examples to help guide you through the process.

The advantages of scraping Walmart's data include gaining valuable insights into consumer behavior, market trends, price monitoring, and much more.

Furthermore, opting for a professional scraping service can be a more efficient solution, as it ensures that the process is completely automated and handles the possible bot detection techniques encountered.

By harnessing the power of Walmart's data, you can drive your business to success and stay ahead of the competition. Remember to always respect the website's terms of service and not to scrape too aggressively in order to avoid being blocked.

Notícias e actualizações

Mantenha-se atualizado com os mais recentes guias e notícias sobre raspagem da Web, subscrevendo a nossa newsletter.

Preocupamo-nos com a proteção dos seus dados. Leia a nossa Política de Privacidade.