Simon Willison’s Weblog

Subscribe

Sunday, 16th June 2024

We're adding the human touch, but that often requires a deep, developmental edit on a piece of writing. The grammar and word choice just sound weird. You're always cutting out flowery words like 'therefore' and 'nevertheless' that don't fit in casual writing. Plus, you have to fact-check the whole thing because AI just makes things up, which takes forever because it's not just big ideas. AI hallucinates these flippant little things in throwaway lines that you'd never notice. [...]

It's tedious, horrible work, and they pay you next to nothing for it.

Catrina Cowart # 8:47 pm

Jina AI Reader. Jina AI provide a number of different AI-related platform products, including an excellent family of embedding models, but one of their most instantly useful is Jina Reader, an API for turning any URL into Markdown content suitable for piping into an LLM.

Add r.jina.ai to the front of a URL to get back Markdown of that page, for example https://r.jina.ai/https://simonwillison.net/2024/Jun/16/jina-ai-reader/ - in addition to converting the content to Markdown it also does a decent job of extracting just the content and ignoring the surrounding navigation.

The API is free but rate-limited (presumably by IP) to 20 requests per minute without an API key or 200 request per minute with a free API key, and you can pay to increase your allowance beyond that.

The Apache 2 licensed source code for the hosted service is on GitHub - it's written in TypeScript and uses Puppeteer to run Readabiliy.js and Turndown against the scraped page.

It can also handle PDFs, which have their contents extracted using PDF.js.

There's also a search feature, s.jina.ai/search+term+goes+here, which uses the Brave Search API. # 7:33 pm

GitHub Copilot Chat: From Prompt Injection to Data Exfiltration (via) Yet another example of the same vulnerability we see time and time again.

If you build an LLM-based chat interface that gets exposed to both private and untrusted data (in this case the code in VS Code that Copilot Chat can see) and your chat interface supports Markdown images, you have a data exfiltration prompt injection vulnerability.

The fix, applied by GitHub here, is to disable Markdown image references to untrusted domains. That way an attack can't trick your chatbot into embedding an image that leaks private data in the URL.

Previous examples: ChatGPT itself, Google Bard, Writer.com, Amazon Q, Google NotebookLM. I'm tracking them here using my new markdownexfiltration tag. # 12:35 am