A melhor folha de dicas sobre XPath. Como escrever facilmente seletores poderosos.

Mihai Maxim em 16 de dezembro de 2022

Uma folha de dicas sobre XPath?

Alguma vez precisou de escrever um seletor CSS que fosse independente da classe? Se a resposta for não, bem, pode considerar-se sortudo. Se a resposta for sim, então a nossa folha de dicas sobre XPath é o que precisa. A web está cheia de dados. Empresas inteiras dependem de juntar alguns deles para oferecer novos serviços ao mundo. As APIs são de grande utilidade, mas nem todos os sítios Web têm APIs abertas. Por vezes, terá de obter o que precisa da maneira antiga. Terá de construir um "scraper" para o sítio Web. Os sítios Web modernos contornam a recolha de dados renomeando as suas classes CSS. Como resultado, é melhor escrever selectores que dependam de algo mais estável. Neste artigo, aprenderá a escrever selectores com base na disposição dos nós DOM da página.

O que é o XPath e como é que o posso experimentar?

XPath é o acrónimo de XML Path Language. Utiliza uma notação de caminho (como nos URLs) para fornecer uma forma flexível de apontar para qualquer parte de um documento XML.

O XPath é usado principalmente no XSLT, mas também pode ser usado como uma forma muito mais poderosa de navegar pelo DOM de qualquer documento de linguagem semelhante a XML usando XPathExpression, como HTML e SVG, em vez de depender dos métodos Document.getElementById() ou Document.querySelectorAll(), das propriedades Node.childNodes e de outros recursos do DOM Core. XPath | MDN (mozilla.org)

Uma notação de caminho?

<!DOCTYPE html>
<html lang="en">
<head>
    <title>Nothing to see here</title>
</head>
<body>
    <h1>My First Heading</h1>
    <p>My first paragraph.</p>
    <div>
        <h2>My Second Heading</h2>
        <p>My second paragraph.</p>
        <div>
            <h3>My Third Heading</h3>
            <p>My third paragraph.</p>
        </div>
    </div>
</body>
</html>

Existem dois tipos de caminhos: relativo e absoluto

O caminho único (ou caminho absoluto) para O meu terceiro parágrafo é /html/body/div/div/p

Um caminho relativo para o Meu terceiro parágrafo é //body/div/div/p
Para o Meu segundo título => //body/div/h2
Para o Meu primeiro parágrafo => //body/p

Repara que estou a usar //body. Os caminhos relativos usam // para saltar diretamente para o elemento desejado.

The usage of //<path> also implies that it should look for all occurrences of <path> in the document, regardless of what came before <path>.

For example, //div/p returns both My second paragraph. and My third paragraph.

Pode testar este exemplo no seu browser para ter uma melhor visão geral!

Cole o código num ficheiro .html e abra-o com o seu browser. Abra as ferramentas de desenvolvimento e prima control + F. Cole o localizador XPath na pequena barra de entrada e prima enter.

Também pode obter o XPath de qualquer etiqueta clicando com o botão direito do rato na etiqueta no separador Elementos e selecionando "Copiar XPath"

Reparem como estou a alternar entre "O meu segundo parágrafo" e "O meu terceiro parágrafo".

Also, another important thing to know is that it is not necessary for a path to contain // in order to return multiple elements. Let's see what happens when I add another <p> in the last <div>.

/html/body/div/div/p já não é um caminho absoluto.

Se me seguiu até aqui, parabéns, está no caminho certo para dominar o XPath. Agora está pronto para mergulhar nas coisas divertidas.

Os suportes quadrados

Pode utilizar os parênteses rectos para selecionar elementos específicos.

 In this case, //body/div/div[2]/p[3] only selects the last <p> tag.

Atributos

Também pode utilizar atributos para selecionar os seus elementos.

//body//p[@class="not-important"] => select all the <p> tags that are inside a <body> tag and have the "not-important" class.

//div[@id] => select all the <div> tags that have an id attribute.

//div[@class="p-children"][@id="important"]/p[3] => select the third <p> that is within a <div> tag that has both class="p-children" and id="important"

//div[@class="p-children" and @id="important"]/p[3] => same as above

//div[@class="p-children" or @id="important"]/p[3] => select the third <p> that is within a <div> that has class="p-children" or id="important"

Repare que @ marca o início de um atributo

Funções

O XPath fornece um conjunto de funções úteis que pode utilizar dentro dos parênteses rectos.

position() => returns the index of the element
Ex: //body/div[position()=1] selects the first <div> in the <body>

last() => returns the last element
Ex: //div/p[last()] selects all the last <p> children of all the <div> tags

count(element) => returns the number of elements
Ex: //body/count(div) returns the number of child <div> tags inside the <body>

node() or * => returns any element
Ex: //div/node() and //div/*=> selects all the children of all the <div> tags

text() => returns the text of the element
Ex: //p/text() returns the text of all the <p> elements

concat(string1, string2) => junta string1 com string2

contains(@attribute, "value") => returns true if @attribute contains "value" 
Ex:
 //p[contains(text(),"I am the third child")] selects all the <p> tags that have the "I am the third child" text value.

starts-with(@attribute, "value") => devolve verdadeiro se @attribute começar com "value" 
ends-with(@attribute, "value") => devolve verdadeiro se @attribute terminar com "value"

substring(@attribute,start_index,end_index)] => devolve a substring do valor do atributo com base em dois valores de índice
Ex:
//p[substring(text(),3,12)="am the third"] => devolve verdadeiro se text() = "I am the third child"

normalize-space() => actua como text(), mas remove os espaços finais
Ex: normalize-space(" example ") = "example"

string-length() => returns the length of the text
Ex: //p[string-length()=20] returns all the <p> tags that have the text length of 20

As funções podem ser um pouco complicadas de lembrar. Felizmente, The Ultimate Xpath Cheat Sheet fornece exemplos úteis:

//p[text()=concat(substring(//p[@class="not-important"]/text(),1,15), substring(text(),16,20))]

//p[text()=<expression_return_value>] will select all the <p> elements that have the text value equal to the return value of the condition.

//p[@class="not-important"]/text() returns the text values of all the <p> tags that have class="not-important".

If there is only one <p> tag that satisfies this condition, then we can pass the return_value to the substring function.

substring(valor_de_retorno,1,15) devolverá os primeiros 15 caracteres da cadeia de caracteres valor_de_retorno.

substring(text(),16,20) devolverá os últimos 5 caracteres do mesmo

text() value that we used in //p[text()=<expression_return_value>].

Finally, concat() will merge the two substrings and create the return value of <expression_return_value>.

Aninhamento de trajectórias

O XPath suporta o aninhamento de caminhos. Isso é fixe, mas o que quero dizer exatamente com aninhamento de caminhos?

Vamos tentar algo novo: /html/body/div[./div[./p]]

You can read it as "Select all the <div> sons of the <body> that have a <div> child. Also, the children must also be parents to a <p> element."

If you don't care about the father of the <p> element, you can write: /html/body/div[.//p]

This now translates to "Select all the div children of the body that have a <p> descendant"

Neste exemplo particular, /html/body/div[./div[./p]] e /html/body/div[.//p] produzem o mesmo resultado.

Por esta altura, tenho a certeza que está a perguntar-se o que se passa com aqueles pontos em ./ e .//

O ponto representa o elemento self. Quando utilizado num par de parênteses, faz referência à etiqueta específica que os abriu. Vamos aprofundar um pouco mais.

In our example, /html/body/div returns two divs:
<div class="no-content"> and <div class="content">

/html/body/div[.//p] traduz-se em:

   /html/body/div[1][/html/body/div[1]//p]
e /html/body/div[2][/html/body/div[2]//p]

/html/body/div[2][/html/body/div[2]//p] é verdadeiro, por isso devolve /html/body/div[2]

In our case, the dot ensures that /html/body/div and /html/body/div//p refer to the same <div>

Agora vamos ver o que teria acontecido se não tivesse acontecido.

/html/body/div[/html/body/div//p] would return both 
<div class="no-content">  and <div class="content">

Porquê? Porque /html/body/div//p é verdadeiro tanto para /html/body/div[1] como para /html/body/div[2].

/html/body/div[/html/body/div//p] actually translates to "Select all the div children of the <body> if /html/body/div//p is true.

/html/body/div//p is true if the body has a <div> child, and that child has a <p> descendent". In our case, this statement is always true.

É uma pena que outras folhas de dicas do Xpath não mencionem nada sobre o aninhamento. Considero-o fantástico. Permite-lhe analisar o documento em busca de diferentes padrões e voltar para devolver outra coisa. A única desvantagem é que escrever consultas desta forma pode tornar-se difícil de seguir. A boa notícia é que existem outras formas de o fazer.

Os eixos

Pode utilizar eixos para localizar nós relativamente a outros nós de contexto.

Vamos explorar algumas delas.

Os quatro eixos principais

//p/ancestor::div => selects all the divs that are ancestors of <p>

How I read it: Get all the <p> tags, for each <p> look through its ancestors. If you find <div> tags, select them.

//p/parent::div => selects all the <div> tags that are parents of <p>

How I read it: Get all the <p> tags and of all their parents, if the parent is a <div>, select it.

//div/child::p=> selects all the <p> tags that are children of <div> tags.

How I read it: Get all the <div> tags and their children, if the child is a <p>, select it.

//div/descendant::p => selects all the <p> tags that are descendants of <div> tags.

How I read it: Get all the <div> tags and their descendants, if the descendant is a <p>, select it.

Agora é altura de reescrever a expressão anterior:

/html/body/div[./div[./p]] é equivalente a /html/body/div/div/p/parent::div/parent::div

Mas /html/body/div[.//p] NÃO é equivalente a /html/body/div//p/ancestor::div

A boa notícia é que podemos ajustá-lo um pouco.

/html/body/div//p/ancestor::div[last()] é equivalente a /html/body/div[.//p]

Outros eixos importantes

//p/following-sibling::span => for each <p> tag, select its following <span> siblings.

//p/preceding-sibling::span => for each <p> tag, select its preceding <span> siblings.

//title/following::span => selects all the <span> tags that appear in the DOM after the <title>.

In our example, //title/following::span selects all the <span> tags in the document.

//p/preceding::div => selects all the <div> tags that appear in the DOM before any <p> tag. But it ignores ancestors, attribute nodes and namespace nodes.

In our case, //p/preceding::div only selects <div class="p-children"> and <div class="no_content">.

Most of the <p> tags are in <div class="content">, but this <div> is not selected because it is a common ancestor for them. As I mentioned, the 
preceding axe ignores ancestors.

<div class="p-children"> is selected because it is not an ancestor for the <p> tags inside <div class="p-children" id="important">

Resumo

Parabéns, conseguiu. Adicionou uma nova ferramenta à sua caixa de ferramentas do seletor! Se está a construir um web scraper ou a automatizar testes web, esta folha de dicas do Xpath vai ser útil! Se está à procura de uma forma mais suave de percorrer o DOM, está no sítio certo. De qualquer forma, vale a pena experimentar o XPath. Quem sabe, talvez você descubra ainda mais casos de uso para ele.
O conceito de raspagem da Web parece interessante para você? Pode contactar-nos aqui WebScrapingAPI - Contacto. Se quiser fazer scraping da Web, teremos todo o gosto em ajudá-lo. Entretanto, considere experimentar o WebScrapingAPI - Product gratuitamente.