Simon Willison’s Weblog

Subscribe
Atom feed for robots-txt

10 items tagged “robots-txt”

2024

Nilay Patel reports a hallucinated ChatGPT summary of his own article (via) Here's a ChatGPT bug that's a new twist on the old issue where it would hallucinate the contents of a web page based on the URL.

The Verge editor Nilay Patel asked for a summary of one of his own articles, pasting in the URL.

ChatGPT 4o replied with an entirely invented summary full of hallucinated details.

It turns out The Verge blocks ChatGPT's browse mode from accessing their site in their robots.txt:

User-agent: ChatGPT-User
Disallow: /

Clearly ChatGPT should reply that it is unable to access the provided URL, rather than inventing a response that guesses at the contents!

# 24th May 2024, 6:38 am / openai, chatgpt, ai, llms, nilay-patel, robots-txt

People share a lot of sensitive material on Quora - controversial political views, workplace gossip and compensation, and negative opinions held of companies. Over many years, as they change jobs or change their views, it is important that they can delete or anonymize their previously-written answers.

We opt out of the wayback machine because inclusion would allow people to discover the identity of authors who had written sensitive answers publicly and later had made them anonymous, and because it would prevent authors from being able to remove their content from the internet if they change their mind about publishing it.

quora.com/robots.txt

# 19th March 2024, 11:09 pm / internet-archive, quora, robots-txt

2020

Weeknotes: cookiecutter templates, better plugin documentation, sqlite-generate

I spent this week spreading myself between a bunch of smaller projects, and finally getting familiar with cookiecutter. I wrote about my datasette-plugin cookiecutter template earlier in the week; here’s what else I’ve been working on.

[... 703 words]

datasette-block-robots. Another little Datasette plugin: this one adds a /robots.txt page with Disallow: / to block all indexing of a Datasette instance from respectable search engine crawlers. I built this in less than ten minutes from idea to deploy to PyPI thanks to the datasette-plugin cookiecutter template.

# 23rd June 2020, 3:28 am / projects, robots-txt, plugins, seo, datasette

2010

RFC5785: Defining Well-Known Uniform Resource Identifiers (via) Sounds like a very good idea to me: defining a common prefix of /.well-known/ for well-known URLs (common metadata like robots.txt) and establishing a registry for all such files. OAuth, OpenID and other decentralised identity systems can all benefit from this.

# 11th April 2010, 7:32 pm / rfc, urls, wellknownurls, openid, oauth, robots-txt

2008

The X-Robots-Tag HTTP header. News to me, but both Google and Yahoo! have supported it since last year. You can add per-page robots exclusion rules in HTTP headers instead of using meta tags, and Google’s version supports unavailable_after which is handy for content with a known limited shelf-life.

# 9th June 2008, 9:21 am / google, yahoo, robots-txt, xrobotstag, http

2007

robots.txt Adventure. Interesting notes from crawling 4.6 million robots.txt, including 69 different ways in which the word “disallow” can be mis-spelled.

# 22nd September 2007, 12:36 am / robots-txt, crawling, andrew-wooster

2003

New anti-comment-spam measure

I’ve added a new anti-comment-spam measure to this site. The majority of comment spam exists for one reason and one reason only to increase the Google PageRank of the site linked from the spam and specifically to increase its ranking for the term used in the link. This is why so many comment spams include links like this: Cheap Viagra.

[... 268 words]

2002

How the RIAA was hacked

The Register: Want to know how RIAA.org was hacked? They had an un-password-protected admin panel listed in their robots.txt file. Muppets.

Stupid Danish newspapers

More deep linking stupidity (via Scripting News). A judge in Denmark has ruled in favour of a newspaper who took a search engine to court over “deep linking”, despite the search engine’s spider following the robots.txt standard (it seems the newspaper didn’t bother to implement a robots.txt file). Dave Winer summed things up perfectly:

[... 86 words]