Appendix S: CSS & XPath Selectors

When extracting information from a website, you have the option to use CSS or XPath selectors.

This page is meant only as an overview of the selector syntax itself. See 7 Web Scraping for more details on how & where you’d apply these.

CSS

What is CSS?

If you’ve written HTML you are probably somewhat familiar with CSS. CSS, or Cascading Style Sheets, is a language for describing the presentation of HTML.

A very simple style sheet might look something like:

body {
  background-color: white;
  color: black;
}
a.link {
  color: blue;
}

This would make the entire body element white, with black text, and all links with the class “link” blue.

Each statement above has two parts, the selector and the properties. The selector is the part that determines which elements the style will be applied to (e.g. the <body> element), and then the part within the curly braces determines the properties that will be applied to those elements (e.g. background-color: white;).

CSS provides a powerful syntax for selecting elements to be styled. Websites can select elements that match very specific criteria so that (for instance) every third paragraph in an article is styled differently, or all links in a particular section have a distinct style.

When we use CSS Selectors for scraping, we are leveraging this powerful syntax to pick one or more elements out of our parsed HTML tree.

Here is an example HTML document:

<html>
  <body>
    <div id="first" class="block">
        <ul>
            <li>One</li>
            <li>Two</li>
            <li>Three</li>
        </ul>
        <p class="inner">After the list</p>
    </div>
    <div id="second" class="block">
        <div id="inner">
            <p>Some text</p>
        </div>
    </div>
  <body>
</html>

Basic Selectors

By Tag

Let’s say someone wanted to get a list of all of the <li> elements in the document.

The CSS selector for this would be li. A bare tag name will match all elements of that type.

By Class

If we wanted to get a list of all of the elements with the class block, we could use the selector .block. The . is used to indicate that we are selecting by class. (I remember this by reminding myself that in most programming languages we access class attributes with a ..)

If the class is used on multiple tags, like “inner” is in the example above, then the selector will match all of those elements.

You can combine selection by tag and by class with a selector like p.inner. This will match all <p> elements with the class inner, but not the <div> element with the same class.

By ID

To select by the id attribute, you use a # instead of a .. For example, to get the div with the id first, you would use the selector #first.

IDs are meant to be unique within HTML documents, so you typically do not need to combine this with a tag, but it is possible to do so if you need to. (e.g. div#first)

By Other Attribute

While id and class are the most common attributes and are treated specially by CSS, you can select by any attribute using special attribute selectors.

Attribute Selector	Description
`[attr]`	Selects all elements with the attribute `attr`.
`[attr=val]`	Selects all elements with the attribute `attr` with value `val`.
`[attr~=val]`	Selects all elements with the attribute `attr` where one of the (space-separated) values is val.
`[attr^=val]`	Selects all elements where `attr` starts with `val`.
`[attr$=val]`	Selects all elements with `attr` ends with `val`.
`[attr*=val]`	Selects all elements with `attr` contains `val`.

Combinators

You can combine selectors to select elements that match more than one criteria.

Combinator	Description
`A B`	Selects all B elements that are descendants of A.
`A > B`	Selects all B elements that are children of A.
`A + B`	Selects all B elements that are immediately preceded by A.
`A ~ B`	Selects all B elements that are preceded by A.

A and B can be any other CSS selector, for example:

.block #inner will select all elements with the id inner that are descendants of elements with the class block.
ul > li will select all <p> elements that are children of a <div>.

Psuedo-Classes

Psuedo-classes are special selectors that select elements based on their state.

Psuedo-Class	Description
`:first-child`	Selects all elements that are the first child of their parent.
`:last-child`	Selects all elements that are the last child of their parent.
`:nth-child(n)`	Selects all elements that are the nth child of their parent.
`:only-child`	Selects all elements that are the only child of their parent.

Others may be available as well, depending on the CSS selector engine you are using.

For cssselect, which powers lxml and parsel’s CSS selector support you can visit https://cssselect.readthedocs.io/en/latest/#supported-selectors for more details.

XPath

XPath is a language designed for selecting elements in XML documents.

Since HTML is a close cousin to XML, it is possible to use XPath syntax against an HTML document.

XPath describes a means of navigating from a starting point in the document to the desired element(s).

XPath Selectors

When you use an XPath selector, you are starting from a particular node in the document.

When using lxml.html or parsel for example you typically parse the entire HTML document, so you are starting from the root node, which is the <html> element.

If you have that element in a node named root, you can use root.xpath() to evaluate XPath expressions using that as the starting node.

As you navigate the tree, you might use other nodes as a starting point. For instance, you find a <div> element that contains the content you need, and you want to select all of the <a> elements that are children of that <div>.

Here are some examples:

XPath	Description
`//a`	Selects all `<a>` elements anywhere in the document.
`.//a`	Selects all `<a>` elements anywhere within the current node.
`./a`	Selects all `<a>` elements that are immediate children of the current node.
`../a`	Selects all `<a>` elements that are children of the parent of the current node. (siblings)

These XPath expressions will return different results depending on the starting point.

Location Steps

XPath makes it possible to do a fairly complex navigation of the parse tree using a syntax called location steps.

An XPath like //div/p/a will select all <a> elements that are descendants of a <p> element that is a descendant of a <div> element. Each piece between the slashes is known as a “location step”.

A location step is in the form axis::node_type[predicate], only node_type is required.

The examples above just use the node type portion. Node types are the name of a tag (e.g. div, a, tr), or * to match all elements.

Predicates

The predicate portion of a location step allows filtering of the elements that match the node type.

Selecting by Attribute

You can select elements by attribute using syntax like //div[@id="first"]. This will select all <div> elements with the id attribute set to first.

Similarly, //div[@class="block"] will select all <div> elements with the class attribute set to block.

Unlike CSS, there is no special syntax for id and class, all attributes can be selected in the same manner. //div[@attr=val] will select all <div> elements with the attribute attr set to val.

Useful predicates

Not all predicates are attribute selectors.

[1] selects the first element matched by the slashed portion of the XPath (e.g. //div[1] selects the first <div> element in the document). (Note: XPath is 1-indexed not 0-indexed.)

[last()] selects the last element matched by the slashed portion of the XPath (e.g. //div[last()] selects the last <div> element in the document).

./li[position() < 4] selects the first three elements that match ./li.

//a[contains(@href, "pdf")] matches all <a> tags where the href attribute contains “pdf”.

text() matches the text content of the current node. //a[text()="Next Page"] matches all <a> tags where the text content is “Next Page”.

Axes

Axes in XPath allow for selection on relationships to the current node.

To use an axis, you precede the node portion of the XPath with the axis name, followed by ::.

For example, //p/ancestor::div will select all <div> elements that are ancestors of a <p> element.

Some of the most useful axes:

ancestor selects all ancestors of the current node.
ancestor-or-self selects all ancestors of the current node, and the current node itself.
child selects all children of the current node.
descendant selects all descendants of the current node.
following-sibling selects all siblings of the current node that come after it.
preceding-sibling selects all siblings of the current node that come before it.

Caveat: Class Selectors in XPath

There’s a common gotcha that arises when using XPath to select elements by class.

CSS treats an element like <div class="abc xyz"> as having two classes, abc and xyz.

One might think then that the equivalent of the CSS selector div.abc.xyz would look like: //div[@class="abc" and @class="xyz"]. This however will not work, because in XPath all attributes are strings, and @class is a space-separated list of classes. You could match using //div[@class="abc xyz"] but that would only match if that is the order, whereas CSS selectors are not order-dependent.

If you are doing a lot of matching on classes, CSS selectors are probably the more robust choice.

Quick Reference

CSS Selector	XPath Selector	Description
`div`	`//div`	Selects all div elements.
`#xyz`	`//*[@id="xyz"]`	Selects an element with the id ‘xyz’.
`.xyz`	`//*[@class="xyz"]`	Selects an element with the class ‘xyz’.
`div.xyz`	`//div[@class="xyz"]`	Selects all div elements with the class ‘xyz’.
`div > p`	`//div/p`	Selects all p elements that are children of a div element.
`div p`	`//div//p`	Selects all p elements that are descendants of a div element.
`div + p`	`//div/following-sibling::p[1]`	Selects the first p element that is a sibling of a div element.
`div ~ p`	`//div/following-sibling::p`	Selects all p elements that are siblings of a div element.