Simon Willison’s Weblog

Subscribe

Fix @covidsewage bot to handle a change to the underlying website. I've been running @covidsewage on Mastodon since February last year tweeting a daily screenshot of the Santa Clara County charts showing Covid levels in wastewater.

A few days ago the county changed their website, breaking the bot. The chart now lives on their new COVID in wastewater page.

It's still a Microsoft Power BI dashboard in an <iframe>, but my initial attempts to scrape it didn't quite work. Eventually I realized that Cloudflare protection was blocking my attempts to access the page, but thankfully sending a Firefox user-agent fixed that problem.

The new recipe I'm using to screenshot the chart involves a delightfully messy nested set of calls to shot-scraper - first using shot-scraper javascript to extract the URL attribute for that <iframe>, then feeding that URL to a separate shot-scraper call to generate the screenshot:

shot-scraper -o /tmp/covid.png $(
  shot-scraper javascript \
    'https://publichealth.santaclaracounty.gov/health-information/health-data/disease-data/covid-19/covid-19-wastewater' \
    'document.querySelector("iframe").src' \
    -b firefox \
    --user-agent 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10.15; rv:128.0) Gecko/20100101 Firefox/128.0' \
    --raw
) --wait 5000 -b firefox --retina