Simon Willison's Weblog: robots-txt

Nilay Patel reports a hallucinated ChatGPT summary of his own article

2024-05-24T06:38:50+00:00

Nilay Patel reports a hallucinated ChatGPT summary of his own article

Here's a ChatGPT bug that's a new twist on the old issue where it would hallucinate the contents of a web page based on the URL.

The Verge editor Nilay Patel asked for a summary of one of his own articles, pasting in the URL.

ChatGPT 4o replied with an entirely invented summary full of hallucinated details.

It turns out The Verge blocks ChatGPT's browse mode from accessing their site in their robots.txt:

User-agent: ChatGPT-User
Disallow: /

Clearly ChatGPT should reply that it is unable to access the provided URL, rather than inventing a response that guesses at the contents!

Via Gemini is the new Google+

Tags: robots-txt, ai, openai, chatgpt, llms, nilay-patel

Quoting quora.com/robots.txt

2024-03-19T23:09:31+00:00

People share a lot of sensitive material on Quora - controversial political views, workplace gossip and compensation, and negative opinions held of companies. Over many years, as they change jobs or change their views, it is important that they can delete or anonymize their previously-written answers.

We opt out of the wayback machine because inclusion would allow people to discover the identity of authors who had written sensitive answers publicly and later had made them anonymous, and because it would prevent authors from being able to remove their content from the internet if they change their mind about publishing it.

— quora.com/robots.txt

Tags: internet-archive, robots-txt, quora

Weeknotes: cookiecutter templates, better plugin documentation, sqlite-generate

2020-06-26T01:39:50+00:00

I spent this week spreading myself between a bunch of smaller projects, and finally getting familiar with cookiecutter. I wrote about my datasette-plugin cookiecutter template earlier in the week; here's what else I've been working on.

sqlite-generate

Datasette is supposed to work against any SQLite database you throw at it, no matter how weird the schema or how unwieldy the database shape or size.

I built a new tool called sqlite-generate this week to help me create databases of different shapes. It's a Python command-line tool which uses Faker to populate a new database with random data. You run it something like this:

sqlite-generate demo.db \
    --tables=20 \
    --rows=100,500 \
    --columns=5,20 \
    --fks=0,3 \
    --pks=0,2 \
    --fts

This command creates a database containing 20 tables, each with between 100 and 500 rows and 5-20 columns. Each table will also have between 0 and 3 foreign key columns to other tables, and will feature between 0 and 2 primary key columns. SQLite full-text search will be configured against all of the text columns in the table.

I always try to include a live demo with any of my projects, and sqlite-generate is no exception. This GitHub Action runs on every push to main and deploys a demo to https://sqlite-generate-demo.datasette.io/ showing the latest version of the code in action.

The demo runs my datasette-search-all plugin in order to more easily demonstrate full-text search across all of the text columns in the generated tables. Try searching for newspaper.

click-app cookiecutter template

I write quite a lot of Click powered command-line tools like this one, so inspired by datasette-plugin I created a new click-app cookiecutter template that bakes in my own preferences about how to set up a new Click project (complete with GitHub Actions). sqlite-generate is the first tool I've built using that template.

Improved Datasette plugin documentation

I've split Datasette's plugin documentation into five separate pages, and added a new page to the documentation about patterns for testing plugins.

The five pages are:

Plugins describing how to install and configure plugins
Writing plugins showing how to write one-off plugins, how to use the datasette-plugin cookiecutter template and how to package templates for release to PyPI
Plugin hooks documenting all of the available plugin hooks
Testing plugins describing my preferred patterns for writing tests for them (using pytest and HTTPX)
Internals for plugins describing the APIs Datasette makes available for use within plugin hook implementations

There's also a list of available plugins on the Datasette Ecosystem page of the documentation, though I plan to move those to a separate plugin directory in the future.

datasette-block-robots

The datasette-plugin template practically eliminates the friction involved in starting a new plugin.

sqlite-generate generates random names for people. I don't particularly want people who search for their own names stumbling across the live demo and being weirded out by their name featured there, so I decided to block it from search engine crawlers using robots.txt.

I wrote a tiny plugin to do this: datasette-block-robots, which uses the new register_routes() plugin hook to add a /robots.txt page.

It's also a neat example of the simplest possible plugin to use that feature - along with the simplest possible unit test for exercising such a page.

datasette-saved-queries

Another new plugin, this time with a bit more substance to it. datasette-saved-queries exercises the new canned_queries() hook I described last week. It uses the new startup() hook to create tables on startup (if they are missing), then lets users insert records into those tables to save their own queries. Queries saved in this way are then returned as canned queries for that particular database.

main, not master

main is a better name for the main GitHub branch than master, which has unpleasant connotations (it apparently derives from master/slave in BitKeeper). My datasette-plugin and click-app cookiecutter templates both include instructions for renaming master to main in their READMEs - it's as easy as running git branch -m master main before running your first push to GitHub.

I'm working towards making the switch for Datasette itself.

Tags: git, plugins, projects, robots-txt, sqlite, datasette, weeknotes, cookiecutter

datasette-block-robots

2020-06-23T03:28:00+00:00

datasette-block-robots

Another little Datasette plugin: this one adds a /robots.txt page with Disallow: / to block all indexing of a Datasette instance from respectable search engine crawlers. I built this in less than ten minutes from idea to deploy to PyPI thanks to the datasette-plugin cookiecutter template.

Tags: plugins, projects, robots-txt, seo, datasette

RFC5785: Defining Well-Known Uniform Resource Identifiers

2010-04-11T19:32:28+00:00

RFC5785: Defining Well-Known Uniform Resource Identifiers

Sounds like a very good idea to me: defining a common prefix of /.well-known/ for well-known URLs (common metadata like robots.txt) and establishing a registry for all such files. OAuth, OpenID and other decentralised identity systems can all benefit from this.

Via Mark Nottingham

Tags: oauth, openid, rfc, robots-txt, urls, wellknownurls

The X-Robots-Tag HTTP header

2008-06-09T09:21:24+00:00

The X-Robots-Tag HTTP header

News to me, but both Google and Yahoo! have supported it since last year. You can add per-page robots exclusion rules in HTTP headers instead of using meta tags, and Google’s version supports unavailable_after which is handy for content with a known limited shelf-life.

Tags: google, http, robots-txt, xrobotstag, yahoo

robots.txt Adventure

2007-09-22T00:36:17+00:00

robots.txt Adventure

Interesting notes from crawling 4.6 million robots.txt, including 69 different ways in which the word “disallow” can be mis-spelled.

Tags: andrew-wooster, crawling, robots-txt

New anti-comment-spam measure

2003-10-13T08:22:09+00:00

I've added a new anti-comment-spam measure to this site. The majority of comment spam exists for one reason and one reason only to increase the Google PageRank of the site linked from the spam and specifically to increase its ranking for the term used in the link. This is why so many comment spams include links like this: Cheap Viagra.

Cut off the PageRank boost and you cut off the advantage of spamming, simple as that. I've altered my comments system to redirect ALL outgoing links through a simple redirect script, and added that script to my robots.txt file. Links still work fine (even the referral information persists across the redirect) but Google will ignore them completely when calculating PageRank.

Will this reduce the floods of comment spam my site receives? Probably not; I've added a note about the restriction to my 'add comment' form but I doubt many spammers bother to read much about the sites they are targetting. What's really needed is for this technique to become widespread by being integrated in to existing blogging tools - are you listening Moveable Type hackers?

Update: Sencer has pointed out in the comments that PageRank persists over redirects, and Google appears to ignore robots.txt when used to hide a redirecting page. I've updated my redirection script to use javascript to power the redirect (with a link for people with javascript disabled) and an extra meta tag to remind Google not to follow the link. This has the unfortunate side effect that referral information no longer persists across the redirect.

Tags: projects, robots-txt, spam

How the RIAA was hacked

2002-09-23T19:02:56+00:00

The Register: Want to know how RIAA.org was hacked? They had an un-password-protected admin panel listed in their robots.txt file. Muppets.

Tags: robots-txt, security

Stupid Danish newspapers

2002-07-05T17:24:24+00:00

More deep linking stupidity (via Scripting News). A judge in Denmark has ruled in favour of a newspaper who took a search engine to court over "deep linking", despite the search engine's spider following the robots.txt standard (it seems the newspaper didn't bother to implement a robots.txt file). Dave Winer summed things up perfectly:

BTW, deep linking is an oxymoron. There's only one kind of linking on the Web. Why would you ever point to the home page of a news oriented site.

Tags: dave-winer, denmark, linking, robots-txt, stupid