<?xml version="1.0" encoding="utf-8"?>
<feed xmlns="http://www.w3.org/2005/Atom"><title>Todd Schiller - scraping</title><link href="https://toddschiller.com/" rel="alternate"></link><link href="https://toddschiller.com/feeds/tag/scraping.atom.xml" rel="self"></link><id>https://toddschiller.com/</id><updated>2020-12-03T00:00:00-05:00</updated><subtitle>Human ✘ Artificial Intelligence</subtitle><entry><title>7 bite-sized tips for reliable web automation and scraping selectors</title><link href="https://toddschiller.com/blog/reliable-web-scrapers.html" rel="alternate"></link><published>2020-12-03T00:00:00-05:00</published><updated>2020-12-03T00:00:00-05:00</updated><author><name>Todd Schiller</name></author><id>tag:toddschiller.com,2020-12-03:/blog/reliable-web-scrapers.html</id><summary type="html">&lt;p&gt;If you're like most developers, you've probably
encountered &lt;a href="https://developer.mozilla.org/en-US/docs/Web/CSS/CSS_Selectors"&gt;Cascading Style Sheet (CSS) selectors&lt;/a&gt;
for styling webpages.&lt;/p&gt;
&lt;p&gt;For example, the following CSS rule combines a paragraph element selector &lt;code&gt;p&lt;/code&gt;
with a class name selector &lt;code&gt;.lead&lt;/code&gt; to set the font size:&lt;/p&gt;
&lt;div class="highlight"&gt;&lt;pre&gt;&lt;span&gt;&lt;/span&gt;&lt;code&gt;&lt;span class="nt"&gt;p&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nc"&gt;lead&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;
&lt;span class="w"&gt;    &lt;/span&gt;&lt;span class="k"&gt;font-size&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mi"&gt;32&lt;/span&gt;&lt;span class="kt"&gt;px&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;
&lt;p&gt;We can mix-and-match CSS selectors to …&lt;/p&gt;</summary><content type="html">&lt;p&gt;If you're like most developers, you've probably
encountered &lt;a href="https://developer.mozilla.org/en-US/docs/Web/CSS/CSS_Selectors"&gt;Cascading Style Sheet (CSS) selectors&lt;/a&gt;
for styling webpages.&lt;/p&gt;
&lt;p&gt;For example, the following CSS rule combines a paragraph element selector &lt;code&gt;p&lt;/code&gt;
with a class name selector &lt;code&gt;.lead&lt;/code&gt; to set the font size:&lt;/p&gt;
&lt;div class="highlight"&gt;&lt;pre&gt;&lt;span&gt;&lt;/span&gt;&lt;code&gt;&lt;span class="nt"&gt;p&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nc"&gt;lead&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;
&lt;span class="w"&gt;    &lt;/span&gt;&lt;span class="k"&gt;font-size&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mi"&gt;32&lt;/span&gt;&lt;span class="kt"&gt;px&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;
&lt;p&gt;We can mix-and-match CSS selectors to describe any subset of elements on a page.
There are CSS selectors for HTML tag types, ids, classes, attributes, page
structure, and even UX interactions.&lt;/p&gt;
&lt;p&gt;Because of their expressiveness, CSS selectors are used everywhere in the web
ecosystem:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;Web
Applications (&lt;a href="https://developer.mozilla.org/en-US/docs/Web/API/Document/querySelector"&gt;Web API&lt;/a&gt;,
&lt;a href="https://api.jquery.com/category/selectors/"&gt;JQuery&lt;/a&gt;):
modifying the DOM, attaching event handlers, etc.&lt;/li&gt;
&lt;li&gt;Automated Testing: finding elements to interact with, checking assertions,
etc.&lt;/li&gt;
&lt;li&gt;Web Scraping: selecting data to extract, finding links to traverse, etc.&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;Over at PixieBrix, selectors are an integral part
of &lt;a href="https://www.pixiebrix.com/"&gt;our web customization engine&lt;/a&gt;.&lt;/p&gt;
&lt;p&gt;Unfortunately, if we don't control a web page, writing reliable selectors can be
an unpleasant experience. Some old-school WYSIWYG-built sites (I'm looking at
you, &lt;a href="https://en.wikipedia.org/wiki/Microsoft_FrontPage"&gt;Microsoft FrontPage&lt;/a&gt;)
and modern Single Page Applications (SPAs) can be downright hostile to work
with.&lt;/p&gt;
&lt;p&gt;To help folks out, I've compiled a set of tips I find useful when writing
selectors for enhancing, automating, and scraping 3rd-party sites.&lt;/p&gt;
&lt;h2&gt;Tip #0: Use JQuery extensions&lt;/h2&gt;
&lt;p&gt;When working in a browser context (normal or headless),
use &lt;a href="https://api.jquery.com/category/selectors/jquery-selector-extensions/"&gt;JQuery's selector extensions&lt;/a&gt;.
Selectors such as &lt;code&gt;:eq&lt;/code&gt; , &lt;code&gt;:header&lt;/code&gt; , and &lt;code&gt;:contains&lt;/code&gt; make many selections
significantly more straightforward.&lt;/p&gt;
&lt;p&gt;A downside is that JQuery extensions are slower because they can't leverage the
browser's native CSS evaluation. However, for automation and scraping, you
probably won't ever notice. If it does become a problem, there are
simple &lt;a href="https://learn.jquery.com/performance/optimize-selectors/"&gt;techniques to optimize JQuery selectors&lt;/a&gt;.&lt;/p&gt;
&lt;p&gt;In addition to CSS+JQuery, it's also helpful to learn a bit
of &lt;a href="https://developer.mozilla.org/en-US/docs/Web/XPath"&gt;XPath&lt;/a&gt;, an XML/HTML
traversal language that browsers support natively.&lt;/p&gt;
&lt;h2&gt;Tip #1: Avoid structure-based selectors&lt;/h2&gt;
&lt;p&gt;Browsers and some &lt;a href="https://github.com/fczbkk/css-selector-generator"&gt;libraries&lt;/a&gt;
are able to automatically generate unique CSS selectors for elements.&lt;/p&gt;
&lt;p&gt;&lt;img src="/assets/images/web-selectors/chrome-css-selector.webp" alt="Generating a unique CSS selector for an element in Chrome" loading="lazy" decoding="async" /&gt;&lt;/p&gt;
&lt;p&gt;Automatic generators are great because they can take advantage of the structure
of the Document Object Model (DOM). Automatic generators are also terrible
because they sometimes rely too much on the structure of the DOM.&lt;/p&gt;
&lt;p&gt;Recently, a tool gave me the following selector:&lt;/p&gt;
&lt;div class="highlight"&gt;&lt;pre&gt;&lt;span&gt;&lt;/span&gt;&lt;code&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="nd"&gt;nth-child&lt;/span&gt;&lt;span class="o"&gt;(&lt;/span&gt;&lt;span class="nt"&gt;5&lt;/span&gt;&lt;span class="o"&gt;)&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;&amp;gt;&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="nd"&gt;nth-child&lt;/span&gt;&lt;span class="o"&gt;(&lt;/span&gt;&lt;span class="nt"&gt;2&lt;/span&gt;&lt;span class="o"&gt;)&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;&amp;gt;&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="nd"&gt;nth-child&lt;/span&gt;&lt;span class="o"&gt;(&lt;/span&gt;&lt;span class="nt"&gt;2&lt;/span&gt;&lt;span class="o"&gt;)&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;&amp;gt;&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="nd"&gt;nth-child&lt;/span&gt;&lt;span class="o"&gt;(&lt;/span&gt;&lt;span class="nt"&gt;1&lt;/span&gt;&lt;span class="o"&gt;)&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;&amp;gt;&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="nd"&gt;nth-child&lt;/span&gt;&lt;span class="o"&gt;(&lt;/span&gt;&lt;span class="nt"&gt;3&lt;/span&gt;&lt;span class="o"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;
&lt;p&gt;Structural selectors like these are extremely sensitive to small changes on the
page, including non-visible changes. (For example: an element with
&lt;code&gt;display:none&lt;/code&gt; could be inserted before an element.)&lt;/p&gt;
&lt;p&gt;These selectors will silently return the wrong element, and we'll wind up
debugging a failed automated test, or ingesting bad extracted data.&lt;/p&gt;
&lt;p&gt;Therefore, whenever possible, I eschew overly-specific structural selectors in
favor of ids, class names, attributes, and textual content.&lt;/p&gt;
&lt;h2&gt;Tip #2: Avoid dynamically generated attributes&lt;/h2&gt;
&lt;p&gt;In modern Single Page Applications (SPAs), element ids and class names are
commonly dynamically generated.&lt;/p&gt;
&lt;p&gt;Dynamic class names are computed when the code is compiled/bundled. That's
because the class names in the HTML file must match the names in separate CSS
stylesheets. Dynamic class names will change whenever the site is updated. (
Unless they are computed deterministically, e.g., using a hash. Then they'll
change whenever the style is changed).&lt;/p&gt;
&lt;p&gt;For example, take the &amp;quot;Google Search&amp;quot; button on
the &lt;a href="https://www.google.com"&gt;Google homepage&lt;/a&gt;. It's built
with &lt;a href="https://developers.google.com/closure"&gt;Google's Closure Framework&lt;/a&gt;:&lt;/p&gt;
&lt;div class="highlight"&gt;&lt;pre&gt;&lt;span&gt;&lt;/span&gt;&lt;code&gt;&lt;span class="p"&gt;&amp;lt;&lt;/span&gt;&lt;span class="nt"&gt;input&lt;/span&gt; &lt;span class="na"&gt;class&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="s"&gt;&amp;quot;gNO89b&amp;quot;&lt;/span&gt; &lt;span class="na"&gt;value&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="s"&gt;&amp;quot;Google Search&amp;quot;&lt;/span&gt; &lt;span class="na"&gt;name&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="s"&gt;&amp;quot;btnK&amp;quot;&lt;/span&gt; &lt;span class="na"&gt;type&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="s"&gt;&amp;quot;submit&amp;quot;&lt;/span&gt;&lt;span class="p"&gt;&amp;gt;&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;
&lt;p&gt;At first glance, the &lt;code&gt;name=&amp;quot;btnK&amp;quot;&lt;/code&gt; also looks random, but it's not. The &amp;quot;I'm
Feeling Lucky&amp;quot; button button on the page has the name &lt;code&gt;btnI&lt;/code&gt;, suggesting they're
human-picked. As a rule of thumb, &lt;code&gt;input&lt;/code&gt; name attributes can't be dynamic
because other parts of the application (either front-end, or backend) depend on
them.&lt;/p&gt;
&lt;p&gt;A common source of dynamic class names in applications is the use
of &lt;a href="https://css-tricks.com/css-modules-part-1-need/"&gt;CSS Modules&lt;/a&gt;. CSS Modules
automatically encapsulating styles to avoid accidental styling collisions.
Here's an example header:&lt;/p&gt;
&lt;div class="highlight"&gt;&lt;pre&gt;&lt;span&gt;&lt;/span&gt;&lt;code&gt;&lt;span class="p"&gt;&amp;lt;&lt;/span&gt;&lt;span class="nt"&gt;h1&lt;/span&gt; &lt;span class="na"&gt;class&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="s"&gt;&amp;quot;_styles__title_309571057&amp;quot;&lt;/span&gt;&lt;span class="p"&gt;&amp;gt;&lt;/span&gt;An example heading&lt;span class="p"&gt;&amp;lt;/&lt;/span&gt;&lt;span class="nt"&gt;h1&lt;/span&gt;&lt;span class="p"&gt;&amp;gt;&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;
&lt;p&gt;While we can't match the exact class name, it will follow a consistent naming
pattern across updates of the site. Therefore, we can match on the pattern
using &lt;a href="https://www.w3schools.com/css/css_attribute_selectors.asp"&gt;CSS's starts-with attribute selector&lt;/a&gt;:&lt;/p&gt;
&lt;div class="highlight"&gt;&lt;pre&gt;&lt;span&gt;&lt;/span&gt;&lt;code&gt;&lt;span class="nt"&gt;h1&lt;/span&gt;&lt;span class="o"&gt;[&lt;/span&gt;&lt;span class="nt"&gt;class&lt;/span&gt;&lt;span class="o"&gt;^=&lt;/span&gt;&lt;span class="s2"&gt;&amp;quot;_styles__title_&amp;quot;&lt;/span&gt;&lt;span class="o"&gt;]&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;
&lt;p&gt;Dynamic ids, on the other hand, are managed by the SPA framework at runtime.
Therefore, they'll change for each run of the application. Take, for example,
the profile header from LinkedIn, which uses
the &lt;a href="https://emberjs.com/"&gt;Ember.js&lt;/a&gt; framework:&lt;/p&gt;
&lt;div class="highlight"&gt;&lt;pre&gt;&lt;span&gt;&lt;/span&gt;&lt;code&gt;&lt;span class="p"&gt;&amp;lt;&lt;/span&gt;&lt;span class="nt"&gt;section&lt;/span&gt; &lt;span class="na"&gt;id&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="s"&gt;&amp;quot;ember1398&amp;quot;&lt;/span&gt; &lt;span class="na"&gt;class&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="s"&gt;&amp;quot;pv-top-card artdeco-card ember-view&amp;quot;&lt;/span&gt;&lt;span class="p"&gt;&amp;gt;&lt;/span&gt;
    &lt;span class="cm"&gt;&amp;lt;!-- more elements --&amp;gt;&lt;/span&gt;
&lt;span class="p"&gt;&amp;lt;/&lt;/span&gt;&lt;span class="nt"&gt;section&lt;/span&gt;&lt;span class="p"&gt;&amp;gt;&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;
&lt;p&gt;Here, the ember1398 id isn't random — the framework sequentially generates a new
id for each element it renders. In practice, the id of an element will differ
between page loads because elements don't always load in the exact same order.&lt;/p&gt;
&lt;h2&gt;Tip #3: Search text with JQuery's &lt;code&gt;:contains&lt;/code&gt; selector&lt;/h2&gt;
&lt;p&gt;JQuery's &lt;a href="https://api.jquery.com/contains-selector/"&gt;contains selector&lt;/a&gt; selects
elements that contain the given text anywhere within them. The selector is not
available in plain CSS, but it is available as
an &lt;a href="https://developer.mozilla.org/en-US/docs/Web/XPath/Functions/contains"&gt;XPath function&lt;/a&gt;.&lt;/p&gt;
&lt;p&gt;For example, consider the following HTML:&lt;/p&gt;
&lt;div class="highlight"&gt;&lt;pre&gt;&lt;span&gt;&lt;/span&gt;&lt;code&gt;&lt;span class="p"&gt;&amp;lt;&lt;/span&gt;&lt;span class="nt"&gt;div&lt;/span&gt;&lt;span class="p"&gt;&amp;gt;&lt;/span&gt;Lorem ipsum dolor sit amet, consectetur ...&lt;span class="p"&gt;&amp;lt;/&lt;/span&gt;&lt;span class="nt"&gt;div&lt;/span&gt;&lt;span class="p"&gt;&amp;gt;&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;
&lt;p&gt;We could select this element with:&lt;/p&gt;
&lt;div class="highlight"&gt;&lt;pre&gt;&lt;span&gt;&lt;/span&gt;&lt;code&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="nd"&gt;contains&lt;/span&gt;&lt;span class="o"&gt;(&lt;/span&gt;&lt;span class="s2"&gt;&amp;quot;Lorem ipsum&amp;quot;&lt;/span&gt;&lt;span class="o"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;
&lt;p&gt;If there are other tags, we need to add additional selectors to clarify which
element we want. For example, the above selector would select the &lt;code&gt;div&lt;/code&gt;, &lt;code&gt;p&lt;/code&gt;,
and &lt;code&gt;b&lt;/code&gt; elements in this document:&lt;/p&gt;
&lt;div class="highlight"&gt;&lt;pre&gt;&lt;span&gt;&lt;/span&gt;&lt;code&gt;&lt;span class="p"&gt;&amp;lt;&lt;/span&gt;&lt;span class="nt"&gt;div&lt;/span&gt;&lt;span class="p"&gt;&amp;gt;&lt;/span&gt;
    &lt;span class="p"&gt;&amp;lt;&lt;/span&gt;&lt;span class="nt"&gt;p&lt;/span&gt;&lt;span class="p"&gt;&amp;gt;&amp;lt;&lt;/span&gt;&lt;span class="nt"&gt;b&lt;/span&gt;&lt;span class="p"&gt;&amp;gt;&lt;/span&gt;Lorem ipsum&lt;span class="p"&gt;&amp;lt;/&lt;/span&gt;&lt;span class="nt"&gt;b&lt;/span&gt;&lt;span class="p"&gt;&amp;gt;&lt;/span&gt; dolor sit amet, consectetur...&lt;span class="p"&gt;&amp;lt;/&lt;/span&gt;&lt;span class="nt"&gt;p&lt;/span&gt;&lt;span class="p"&gt;&amp;gt;&lt;/span&gt;
&lt;span class="p"&gt;&amp;lt;/&lt;/span&gt;&lt;span class="nt"&gt;div&lt;/span&gt;&lt;span class="p"&gt;&amp;gt;&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;
&lt;p&gt;To select just the &lt;code&gt;div&lt;/code&gt;, we'd need to specify the tag in the selector:&lt;/p&gt;
&lt;div class="highlight"&gt;&lt;pre&gt;&lt;span&gt;&lt;/span&gt;&lt;code&gt;&lt;span class="nt"&gt;div&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="nd"&gt;contains&lt;/span&gt;&lt;span class="o"&gt;(&lt;/span&gt;&lt;span class="s2"&gt;&amp;quot;Lorem ipsum&amp;quot;&lt;/span&gt;&lt;span class="o"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;
&lt;p&gt;In some cases, the text might contain HTML entities, e.g., a non-breaking space:&lt;/p&gt;
&lt;div class="highlight"&gt;&lt;pre&gt;&lt;span&gt;&lt;/span&gt;&lt;code&gt;&lt;span class="p"&gt;&amp;lt;&lt;/span&gt;&lt;span class="nt"&gt;div&lt;/span&gt;&lt;span class="p"&gt;&amp;gt;&lt;/span&gt;Lorem&lt;span class="ni"&gt;&amp;amp;nsbp;&lt;/span&gt;ipsum&lt;span class="p"&gt;&amp;lt;/&lt;/span&gt;&lt;span class="nt"&gt;div&lt;/span&gt;&lt;span class="p"&gt;&amp;gt;&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;
&lt;p&gt;Trying to select &amp;quot;Lorem ipsum&amp;quot; won't match here. In these cases, we provide the
contains selector multiple times to match only elements that contain both words:&lt;/p&gt;
&lt;div class="highlight"&gt;&lt;pre&gt;&lt;span&gt;&lt;/span&gt;&lt;code&gt;&lt;span class="nt"&gt;div&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="nd"&gt;contains&lt;/span&gt;&lt;span class="o"&gt;(&lt;/span&gt;&lt;span class="s2"&gt;&amp;quot;Lorem&amp;quot;&lt;/span&gt;&lt;span class="o"&gt;)&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="nd"&gt;contains&lt;/span&gt;&lt;span class="o"&gt;(&lt;/span&gt;&lt;span class="s2"&gt;&amp;quot;ipsum&amp;quot;&lt;/span&gt;&lt;span class="o"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;
&lt;p&gt;Finally, a caveat/warning: &lt;em&gt;when writing text selectors, be aware of which text
is translated when the page is viewed in a different language.&lt;/em&gt;&lt;/p&gt;
&lt;h2&gt;Tip #4: Target selection with &lt;code&gt;:has&lt;/code&gt;&lt;/h2&gt;
&lt;p&gt;Sometimes we need more precision in targeting a text search. In these cases we
can combine a &lt;code&gt;:contains&lt;/code&gt; selector within a CSS &lt;code&gt;:has&lt;/code&gt; selector. Suppose we
wanted to get the year a property was built
from &lt;a href="https://www.apartments.com/the-max-new-york-ny/24f34qb/"&gt;Apartments.com&lt;/a&gt;:&lt;/p&gt;
&lt;div class="highlight"&gt;&lt;pre&gt;&lt;span&gt;&lt;/span&gt;&lt;code&gt;&lt;span class="p"&gt;&amp;lt;&lt;/span&gt;&lt;span class="nt"&gt;div&lt;/span&gt; &lt;span class="na"&gt;class&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="s"&gt;&amp;quot;specList&amp;quot;&lt;/span&gt;&lt;span class="p"&gt;&amp;gt;&lt;/span&gt;
    &lt;span class="p"&gt;&amp;lt;&lt;/span&gt;&lt;span class="nt"&gt;span&lt;/span&gt;&lt;span class="p"&gt;&amp;gt;&amp;lt;&lt;/span&gt;&lt;span class="nt"&gt;h3&lt;/span&gt;&lt;span class="p"&gt;&amp;gt;&lt;/span&gt;Property Information&lt;span class="p"&gt;&amp;lt;/&lt;/span&gt;&lt;span class="nt"&gt;h3&lt;/span&gt;&lt;span class="p"&gt;&amp;gt;&lt;/span&gt;
      &lt;span class="p"&gt;&amp;lt;&lt;/span&gt;&lt;span class="nt"&gt;ul&lt;/span&gt;&lt;span class="p"&gt;&amp;gt;&lt;/span&gt;
         &lt;span class="p"&gt;&amp;lt;&lt;/span&gt;&lt;span class="nt"&gt;li&lt;/span&gt;&lt;span class="p"&gt;&amp;gt;&amp;lt;&lt;/span&gt;&lt;span class="nt"&gt;span&lt;/span&gt; &lt;span class="na"&gt;class&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="s"&gt;&amp;quot;bullet&amp;quot;&lt;/span&gt;&lt;span class="p"&gt;&amp;gt;&lt;/span&gt;•&lt;span class="p"&gt;&amp;lt;/&lt;/span&gt;&lt;span class="nt"&gt;span&lt;/span&gt;&lt;span class="p"&gt;&amp;gt;&lt;/span&gt;Built in 2018&lt;span class="p"&gt;&amp;lt;/&lt;/span&gt;&lt;span class="nt"&gt;li&lt;/span&gt;&lt;span class="p"&gt;&amp;gt;&lt;/span&gt;
         &lt;span class="p"&gt;&amp;lt;&lt;/span&gt;&lt;span class="nt"&gt;li&lt;/span&gt;&lt;span class="p"&gt;&amp;gt;&amp;lt;&lt;/span&gt;&lt;span class="nt"&gt;span&lt;/span&gt; &lt;span class="na"&gt;class&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="s"&gt;&amp;quot;bullet&amp;quot;&lt;/span&gt;&lt;span class="p"&gt;&amp;gt;&lt;/span&gt;•&lt;span class="p"&gt;&amp;lt;/&lt;/span&gt;&lt;span class="nt"&gt;span&lt;/span&gt;&lt;span class="p"&gt;&amp;gt;&lt;/span&gt;1016 Units/44 Stories&lt;span class="p"&gt;&amp;lt;/&lt;/span&gt;&lt;span class="nt"&gt;li&lt;/span&gt;&lt;span class="p"&gt;&amp;gt;&lt;/span&gt;
      &lt;span class="p"&gt;&amp;lt;/&lt;/span&gt;&lt;span class="nt"&gt;ul&lt;/span&gt;&lt;span class="p"&gt;&amp;gt;&lt;/span&gt;
&lt;span class="p"&gt;&amp;lt;/&lt;/span&gt;&lt;span class="nt"&gt;div&lt;/span&gt;&lt;span class="p"&gt;&amp;gt;&lt;/span&gt;
&lt;span class="p"&gt;&amp;lt;/&lt;/span&gt;&lt;span class="nt"&gt;div&lt;/span&gt;&lt;span class="p"&gt;&amp;gt;&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;
&lt;p&gt;We can find the div with the Property Information header, and then grab the list
item corresponding to the year the property was built:&lt;/p&gt;
&lt;div class="highlight"&gt;&lt;pre&gt;&lt;span&gt;&lt;/span&gt;&lt;code&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nc"&gt;specList&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="nd"&gt;has&lt;/span&gt;&lt;span class="o"&gt;(&lt;/span&gt;&lt;span class="nt"&gt;h3&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="nd"&gt;contains&lt;/span&gt;&lt;span class="o"&gt;(&lt;/span&gt;&lt;span class="s1"&gt;&amp;#39;Property Information&amp;#39;&lt;/span&gt;&lt;span class="o"&gt;))&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nt"&gt;ul&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;&amp;gt;&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nt"&gt;li&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="nd"&gt;contains&lt;/span&gt;&lt;span class="o"&gt;(&lt;/span&gt;&lt;span class="s1"&gt;&amp;#39;Built&amp;#39;&lt;/span&gt;&lt;span class="o"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;
&lt;p&gt;The &lt;code&gt;:has&lt;/code&gt; selector can also be used to find the containing element. The
following selector matches a list containing an item with &amp;quot;Built&amp;quot;, rather than
the individual list item:&lt;/p&gt;
&lt;div class="highlight"&gt;&lt;pre&gt;&lt;span&gt;&lt;/span&gt;&lt;code&gt;&lt;span class="nt"&gt;ul&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="nd"&gt;has&lt;/span&gt;&lt;span class="o"&gt;(&amp;gt;&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nt"&gt;li&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="nd"&gt;contains&lt;/span&gt;&lt;span class="o"&gt;(&lt;/span&gt;&lt;span class="s2"&gt;&amp;quot;Built&amp;quot;&lt;/span&gt;&lt;span class="o"&gt;))&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;
&lt;h2&gt;Tip #5: Use ARIA attributes for more readable selectors&lt;/h2&gt;
&lt;p&gt;&lt;a href="https://developer.mozilla.org/en-US/docs/Web/Accessibility/ARIA"&gt;Accessible Rich Internet Applications (ARIA)&lt;/a&gt;
attributes support the accessibility of a site, e.g., for providing context to
screen readers. Using them for selectors also makes selectors more readable.&lt;/p&gt;
&lt;p&gt;For example, the &lt;code&gt;aria-label&lt;/code&gt; attribute labels elements that otherwise would
depend on visual cues:&lt;/p&gt;
&lt;div class="highlight"&gt;&lt;pre&gt;&lt;span&gt;&lt;/span&gt;&lt;code&gt;&lt;span class="p"&gt;&amp;lt;&lt;/span&gt;&lt;span class="nt"&gt;button&lt;/span&gt; &lt;span class="na"&gt;aria-label&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="s"&gt;&amp;quot;Close&amp;quot;&lt;/span&gt; &lt;span class="na"&gt;onclick&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="s"&gt;&amp;quot;myDialog.close()&amp;quot;&lt;/span&gt;&lt;span class="p"&gt;&amp;gt;&lt;/span&gt;X&lt;span class="p"&gt;&amp;lt;/&lt;/span&gt;&lt;span class="nt"&gt;button&lt;/span&gt;&lt;span class="p"&gt;&amp;gt;&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;
&lt;p&gt;These can be selected against using standard
CSS &lt;a href="https://developer.mozilla.org/en-US/docs/Web/CSS/Attribute_selectors"&gt;attribute selectors&lt;/a&gt;,
e.g.:&lt;/p&gt;
&lt;div class="highlight"&gt;&lt;pre&gt;&lt;span&gt;&lt;/span&gt;&lt;code&gt;&lt;span class="nt"&gt;button&lt;/span&gt;&lt;span class="o"&gt;[&lt;/span&gt;&lt;span class="nt"&gt;aria-label&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="s2"&gt;&amp;quot;Close&amp;quot;&lt;/span&gt;&lt;span class="o"&gt;]&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;
&lt;p&gt;One thing to watch out for with the &lt;code&gt;aria-label&lt;/code&gt; attribute is that it's always
internationalized/translated alongside other UX text.&lt;/p&gt;
&lt;p&gt;ARIA attributes also provide information about the role of elements on the
page (e.g.,
the &lt;a href="https://developer.mozilla.org/en-US/docs/Web/Accessibility/ARIA/Roles/button_role"&gt;role attribute&lt;/a&gt;)
and even the relationship between elements on the page (
e.g., &lt;a href="https://developer.mozilla.org/en-US/docs/Web/Accessibility/ARIA/ARIA_Techniques/Using_the_aria-labelledby_attribute"&gt;aria-labelledby&lt;/a&gt;).&lt;/p&gt;
&lt;h2&gt;Tip #6: Be on the lookout for automated testing attributes&lt;/h2&gt;
&lt;p&gt;Developers using automated testing frameworks, e.g.
Cypress, &lt;a href="https://docs.cypress.io/guides/references/best-practices.html"&gt;add data attributes to make their life easier when writing automated tests&lt;/a&gt;.
Sometimes these attributes will find their way to the production site:&lt;/p&gt;
&lt;div class="highlight"&gt;&lt;pre&gt;&lt;span&gt;&lt;/span&gt;&lt;code&gt;&lt;span class="p"&gt;&amp;lt;&lt;/span&gt;&lt;span class="nt"&gt;div&lt;/span&gt; &lt;span class="na"&gt;data-test-id&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="s"&gt;&amp;quot;content&amp;quot;&lt;/span&gt;&lt;span class="p"&gt;&amp;gt;&lt;/span&gt;&lt;span class="cm"&gt;&amp;lt;!-- some content --&amp;gt;&lt;/span&gt;&lt;span class="p"&gt;&amp;lt;/&lt;/span&gt;&lt;span class="nt"&gt;div&lt;/span&gt;&lt;span class="p"&gt;&amp;gt;&lt;/span&gt;
&lt;span class="p"&gt;&amp;lt;&lt;/span&gt;&lt;span class="nt"&gt;div&lt;/span&gt; &lt;span class="na"&gt;data-cy&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="s"&gt;&amp;quot;content&amp;quot;&lt;/span&gt;&lt;span class="p"&gt;&amp;gt;&lt;/span&gt;&lt;span class="cm"&gt;&amp;lt;!-- some content --&amp;gt;&lt;/span&gt;&lt;span class="p"&gt;&amp;lt;/&lt;/span&gt;&lt;span class="nt"&gt;div&lt;/span&gt;&lt;span class="p"&gt;&amp;gt;&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;
&lt;p&gt;Testing attributes can be selected just like any other attribute, and are often
unique (just like an id):&lt;/p&gt;
&lt;div class="highlight"&gt;&lt;pre&gt;&lt;span&gt;&lt;/span&gt;&lt;code&gt;&lt;span class="o"&gt;[&lt;/span&gt;&lt;span class="nt"&gt;data-cy&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="s2"&gt;&amp;quot;content&amp;quot;&lt;/span&gt;&lt;span class="o"&gt;]&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;
&lt;p&gt;&lt;img src="/assets/images/web-selectors/together.webp" alt="We're all in this together" loading="lazy" decoding="async" /&gt;&lt;/p&gt;
&lt;h2&gt;Bonus Tip: handle flat key/value pairs with the adjacent sibling combinator&lt;/h2&gt;
&lt;p&gt;The &lt;a href="https://developer.mozilla.org/en-US/docs/Web/CSS/Adjacent_sibling_combinator"&gt;adjacent sibling combinator (+)&lt;/a&gt;
matches the element immediately following another element.&lt;/p&gt;
&lt;p&gt;Supposed we have the following HTML, structured as multiple key-value pairs per
row/list item:&lt;/p&gt;
&lt;div class="highlight"&gt;&lt;pre&gt;&lt;span&gt;&lt;/span&gt;&lt;code&gt;&lt;span class="p"&gt;&amp;lt;&lt;/span&gt;&lt;span class="nt"&gt;li&lt;/span&gt; &lt;span class="na"&gt;class&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="s"&gt;&amp;quot;row&amp;quot;&lt;/span&gt;&lt;span class="p"&gt;&amp;gt;&lt;/span&gt;
    &lt;span class="p"&gt;&amp;lt;&lt;/span&gt;&lt;span class="nt"&gt;span&lt;/span&gt; &lt;span class="na"&gt;class&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="s"&gt;&amp;quot;k&amp;quot;&lt;/span&gt;&lt;span class="p"&gt;&amp;gt;&lt;/span&gt;Name:&lt;span class="p"&gt;&amp;lt;/&lt;/span&gt;&lt;span class="nt"&gt;span&lt;/span&gt;&lt;span class="p"&gt;&amp;gt;&lt;/span&gt;
    &lt;span class="p"&gt;&amp;lt;&lt;/span&gt;&lt;span class="nt"&gt;span&lt;/span&gt; &lt;span class="na"&gt;class&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="s"&gt;&amp;quot;v&amp;quot;&lt;/span&gt;&lt;span class="p"&gt;&amp;gt;&lt;/span&gt;Laura Smith&lt;span class="p"&gt;&amp;lt;/&lt;/span&gt;&lt;span class="nt"&gt;span&lt;/span&gt;&lt;span class="p"&gt;&amp;gt;&lt;/span&gt;
    &lt;span class="p"&gt;&amp;lt;&lt;/span&gt;&lt;span class="nt"&gt;span&lt;/span&gt; &lt;span class="na"&gt;class&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="s"&gt;&amp;quot;k&amp;quot;&lt;/span&gt;&lt;span class="p"&gt;&amp;gt;&lt;/span&gt;Year:&lt;span class="p"&gt;&amp;lt;/&lt;/span&gt;&lt;span class="nt"&gt;span&lt;/span&gt;&lt;span class="p"&gt;&amp;gt;&lt;/span&gt;
    &lt;span class="p"&gt;&amp;lt;&lt;/span&gt;&lt;span class="nt"&gt;span&lt;/span&gt; &lt;span class="na"&gt;class&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="s"&gt;&amp;quot;v&amp;quot;&lt;/span&gt;&lt;span class="p"&gt;&amp;gt;&lt;/span&gt;2013&lt;span class="p"&gt;&amp;lt;/&lt;/span&gt;&lt;span class="nt"&gt;span&lt;/span&gt;&lt;span class="p"&gt;&amp;gt;&lt;/span&gt;
&lt;span class="p"&gt;&amp;lt;/&lt;/span&gt;&lt;span class="nt"&gt;li&lt;/span&gt;&lt;span class="p"&gt;&amp;gt;&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;
&lt;p&gt;We can look up the value for the year by finding the &amp;quot;Year:&amp;quot; label and then
selecting the next element, which will be the value:&lt;/p&gt;
&lt;div class="highlight"&gt;&lt;pre&gt;&lt;span&gt;&lt;/span&gt;&lt;code&gt;&lt;span class="nt"&gt;span&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nc"&gt;k&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="nd"&gt;contains&lt;/span&gt;&lt;span class="o"&gt;(&lt;/span&gt;&lt;span class="s1"&gt;&amp;#39;Year:&amp;#39;&lt;/span&gt;&lt;span class="o"&gt;)&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;+&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nt"&gt;span&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;
&lt;p&gt;&lt;em&gt;Also see the
&lt;a href="https://news.ycombinator.com/item?id=25993258"&gt;Hacker News discussion&lt;/a&gt;.&lt;/em&gt;&lt;/p&gt;
</content><category term="Browser Extensions"></category><category term="web browsers"></category><category term="automation"></category><category term="scraping"></category></entry></feed>