Simon Willison’s Weblog

Subscribe
Atom feed for crawling

11 posts tagged “crawling”

2025

If you're a startup running your own crawlers to gather data for whatever purpose, you should try really hard not to make the world a worse place by driving up costs for the sites you are scraping.

There's really no excuse for crawling Wikipedia ("65% of our most expensive traffic comes from bots") when they offer a comprehensive collection of bulk download options.

Do better!

# 7th April 2025, 7:06 pm / ai-ethics, jeremy-keith, crawling, wikipedia, ai

2024

AI crawlers need to be more respectful (via) Eric Holscher:

At Read the Docs, we host documentation for many projects and are generally bot friendly, but the behavior of AI crawlers is currently causing us problems. We have noticed AI crawlers aggressively pulling content, seemingly without basic checks against abuse.

One crawler downloaded 73 TB of zipped HTML files just in Month, racking up $5,000 in bandwidth charges!

# 25th July 2024, 8:02 pm / eric-holscher, ai, ethics, read-the-docs, ai-ethics, crawling

More than an OpenAI Wrapper: Perplexity Pivots to Open Source. I’m increasingly impressed with Perplexity.ai—I’m using it on a daily basis now. It’s by far the best implementation I’ve seen of LLM-assisted search—beating Microsoft Bing and Google Bard at their own game.

A year ago it was implemented as a GPT 3.5 powered wrapper around Microsoft Bing. To my surprise they’ve now evolved way beyond that: Perplexity has their own search index now and is running their own crawlers, and they’re using variants of Mistral 7B and Llama 70B as their models rather than continuing to depend on OpenAI.

# 13th January 2024, 6:12 am / perplexity, generative-ai, search, ai, llms, crawling

2023

Google was accidentally leaking its Bard AI chats into public search results. I’m quoted in this piece about yesterday’s Bard privacy bug: it turned out the share URL and “Let anyone with the link see what you’ve selected” feature wasn’t correctly setting a noindex parameter, and so some shared conversations were being swept up by the Google search crawlers. Thankfully this was a mistake, not a deliberate design decision, and it should be fixed by now.

# 27th September 2023, 7:35 pm / bard, privacy, google, llms, crawling

2022

curl-impersonate (via) “A special build of curl that can impersonate the four major browsers: Chrome, Edge, Safari & Firefox. curl-impersonate is able to perform TLS and HTTP handshakes that are identical to that of a real browser.”

I hadn’t realized that it’s become increasingly common for sites to use fingerprinting of TLS and HTTP handshakes to block crawlers. curl-impersonate attempts to impersonate browsers much more accurately, using tricks like compiling with Firefox’s nss TLS library and Chrome’s BoringSSL.

# 10th August 2022, 3:34 pm / scraping, curl, crawling

2020

datasette-block-robots. Another little Datasette plugin: this one adds a /robots.txt page with Disallow: / to block all indexing of a Datasette instance from respectable search engine crawlers. I built this in less than ten minutes from idea to deploy to PyPI thanks to the datasette-plugin cookiecutter template.

# 23rd June 2020, 3:28 am / projects, robots-txt, plugins, seo, datasette, crawling

2018

Googlebot’s Javascript random() function is deterministic. random() as executed by Googlebot returns the same predicable sequence. More interestingly, Googlebot runs a much faster timer for setTimeout and setInterval—as Tom Anthony points out, “Why actually wait 5 seconds when you are a bot?”

# 7th February 2018, 2:41 am / seo, crawling, google

2012

Why does Google use “Allow” in robots.txt, when the standard seems to be “Disallow?”

The Disallow command prevents search engines from crawling your site.

[... 59 words]

2009

Official Google Webmaster Blog: A proposal for making AJAX crawlable. It's horrible! The Google crawler would map url#!state to url?_escaped_fragment_=state, then expect your site to provide rendered HTML that reflects that state (they even go as far as to suggest running a headless browser within your web server to do this). Just stick to progressive enhancement instead, it's far less hideous. It looks like the proposal may have originated with the GWT team.

# 8th October 2009, 5:52 pm / javascript, progressive-enhancement, search-engines, google, crawling, ajax, seo, gwt

2007

robots.txt Adventure. Interesting notes from crawling 4.6 million robots.txt, including 69 different ways in which the word “disallow” can be mis-spelled.

# 22nd September 2007, 12:36 am / robots-txt, crawling, andrew-wooster

2003

Calendars and crawlers

Douglas Bowman has been having some amusing problems with robots and his calendar. The calendar, visible on every page of the site, automatically adds a “next month” and “previous month” link to allow surfers to browser through the archive in both directions. Unfortunately, Doug ommitted the logic to stop showing a “previous month” link when there were no earlier entries. An enterprising crawler started following the links, and didn’t stop until it had reached 1542!

[... 113 words]