Depth Limiting and Path Filtering in Lighthouse Parade
At 4/19/2024
In case you missed it, last month we released Lighthouse Parade, a CLI tool to automatically run and aggregate Lighthouse performance reports across an entire site. One of the most requested features has been the ability to limit which pages are crawled. We’re excited to release Lighthouse Parade 1.1, which introduces three new flags to accommodate these use cases.
We can install and run lighthouse-parade
using npx
, and we will use cloudfour.com as our example site.
npx lighthouse-parade https://cloudfour.com
Code language: JavaScript (javascript)
At a glance this doesn’t look like a large site, but when you consider all the blog posts and indexes, there are a lot of pages to run Lighthouse on, so it will take a while. We can reduce the number of pages that are crawled by limiting the crawl depth using the new --max-crawl-depth
flag. Depth limiting allows you to control how far to traverse—how many “clicks” the crawler will take. We’ll set it to two so that it crawls the home page and only pages that are linked directly from the home page:
npx lighthouse-parade https://cloudfour.com --max-crawl-depth 2
Code language: JavaScript (javascript)
This helps speed up the crawling (only twelve pages get crawled). But maybe we want to crawl more pages than that. Let’s bump up the crawl depth to three, and filter out blog posts (which have URLs like https://cloudfour.com/thinks/*
). The new --exclude-path-glob
flag lets us do that. Keep in mind that in order for the glob to work, it has to be specified in quotes, otherwise your shell will try to expand it.
npx lighthouse-parade https://cloudfour.com --max-crawl-depth 3 --exclude-path-glob "/thinks/*"
Code language: Bash (bash)
This works pretty well. It provides a broader picture of the site’s performance than simply limiting the depth to 2 (specifically, this covers more kinds of pages) without being slowed down by running Lighthouse for every single blog post.
This option is especially useful on e-commerce sites where you wouldn’t want Lighthouse to run on every single product page.
Going back to the cloudfour.com example, maybe we don’t want to limit the depth, but we still want to exclude blog posts. If we tried that, we would see it starting to pick up sitemap pages like https://cloudfour.com/sitemap-pt-post-2020-12.html, and paginated links, so we’ll exclude those too by passing the --exclude-path-glob
flag two more times:
npx lighthouse-parade https://cloudfour.com --exclude-path-glob "/thinks/*" --exclude-path-glob "/sitemap-*" --exclude-path-glob "**/page/*"
Code language: Bash (bash)
We’ll look at another example to show the last new flag, --include-path-glob
. Maybe we want to run Lighthouse only on blog posts, so that we can see which blog posts might have unoptimized images or other resources that slow them down. The --include-path-glob
allows us to ignore any URL that doesn’t match the specified glob:
npx lighthouse-parade https://cloudfour.com --include-path-glob "/thinks/*"
Code language: Bash (bash)
Another example would be sites that are internationalized with URL prefixes like /en/
. The --include-path-glob
flag could be used to make it so that Lighthouse only runs on one version of translated pages.
Combining these new features gives you fine-grained control over which pages are crawled. We hope that these new features are helpful, and feel free to leave feedback on GitHub!