Simon Willison’s Weblog

Subscribe

5 items tagged “robotstxt”

2024

People share a lot of sensitive material on Quora—controversial political views, workplace gossip and compensation, and negative opinions held of companies. Over many years, as they change jobs or change their views, it is important that they can delete or anonymize their previously-written answers.

We opt out of the wayback machine because inclusion would allow people to discover the identity of authors who had written sensitive answers publicly and later had made them anonymous, and because it would prevent authors from being able to remove their content from the internet if they change their mind about publishing it.

quora.com/robots.txt # 19th March 2024, 11:09 pm

2020

Weeknotes: cookiecutter templates, better plugin documentation, sqlite-generate

I spent this week spreading myself between a bunch of smaller projects, and finally getting familiar with cookiecutter. I wrote about my datasette-plugin cookiecutter template earlier in the week; here’s what else I’ve been working on.

[... 703 words]

datasette-block-robots. Another little Datasette plugin: this one adds a /robots.txt page with “Disallow: /” to block all indexing of a Datasette instance from respectable search engine crawlers. I built this in less than ten minutes from idea to deploy to PyPI thanks to the datasette-plugin cookiecutter template. # 23rd June 2020, 3:28 am

2008

The X-Robots-Tag HTTP header. News to me, but both Google and Yahoo! have supported it since last year. You can add per-page robots exclusion rules in HTTP headers instead of using meta tags, and Google’s version supports unavailable_after which is handy for content with a known limited shelf-life. # 9th June 2008, 9:21 am

2007

robots.txt Adventure. Interesting notes from crawling 4.6 million robots.txt, including 69 different ways in which the word “disallow” can be mis-spelled. # 22nd September 2007, 12:36 am