April 2024
147 posts: 5 entries, 59 links, 26 quotes, 57 beats
April 15, 2024
April 16, 2024
inline-snapshot. I'm a big fan of snapshot testing, where expected values are captured the first time a test suite runs and then asserted against in future runs. It's a very productive way to build a robust test suite.
inline-snapshot by Frank Hoffmann is a particularly neat implementation of the pattern. It defines a snapshot() function which you can use in your tests:
assert 1548 * 18489 == snapshot()
When you run that test using pytest --inline-snapshot=create the snapshot() function will be replaced in your code (using AST manipulation) with itself wrapping the repr() of the expected result:
assert 1548 * 18489 == snapshot(28620972)
If you modify the code and need to update the tests you can run pytest --inline-snapshot=fix to regenerate the recorded snapshot values.
Permissions have three moving parts, who wants to do it, what do they want to do, and on what object. Any good permission system has to be able to efficiently answer any permutation of those variables. Given this person and this object, what can they do? Given this object and this action, who can do it? Given this person and this action, which objects can they act upon?
Google NotebookLM Data Exfiltration (via) NotebookLM is a Google Labs product that lets you store information as sources (mainly text files in PDF) and then ask questions against those sources—effectively an interface for building your own custom RAG (Retrieval Augmented Generation) chatbots.
Unsurprisingly for anything that allows LLMs to interact with untrusted documents, it’s susceptible to prompt injection.
Johann Rehberger found some classic prompt injection exfiltration attacks: you can create source documents with instructions that cause the chatbot to load a Markdown image that leaks other private data to an external domain as data passed in the query string.
Johann reported this privately in the December but the problem has not yet been addressed. UPDATE: The NotebookLM team deployed a fix for this on 18th April.
A good rule of thumb is that any time you let LLMs see untrusted tokens there is a risk of an attack like this, so you should be very careful to avoid exfiltration vectors like Markdown images or even outbound links.
The saddest part about it, though, is that the garbage books don’t actually make that much money either. It’s even possible to lose money generating your low-quality ebook to sell on Kindle for $0.99. The way people make money these days is by teaching students the process of making a garbage ebook. It’s grift and garbage all the way down — and the people who ultimately lose out are the readers and writers who love books.
April 17, 2024
Scammers are targeting teenage boys on social media—and driving some to suicide. (via) Horrifying in depth report describing sextortion scams: a scammer tricks a teenage boy into sending them reciprocal nude photos, then instantly starts blackmailing them by threatening to forward those photos to their friends and family members. Most online scams take weeks or even months to play out—these scams can turn to blackmail within minutes.
But the reality is that you can't build a hundred-billion-dollar industry around a technology that's kind of useful, mostly in mundane ways, and that boasts perhaps small increases in productivity if and only if the people who use it fully understand its limitations.
AI for Data Journalism: demonstrating what we can do with this stuff right now
I gave a talk last month at the Story Discovery at Scale data journalism conference hosted at Stanford by Big Local News. My brief was to go deep into the things we can use Large Language Models for right now, illustrated by a flurry of demos to help provide starting points for further conversations at the conference.
[... 6,081 words]April 18, 2024
In mid-March, we added this line to our system prompt to prevent Claude from thinking it can open URLs:
It cannot open URLs, links, or videos, so if it seems as though the interlocutor is expecting Claude to do so, it clarifies the situation and asks the human to paste the relevant text or image content directly into the conversation.
— Alex Albert, Anthropic
mistralai/mistral-common. New from Mistral: mistral-common, an open source Python library providing "a set of tools to help you work with Mistral models".
So far that means a tokenizer! This is similar to OpenAI's tiktoken library in that it lets you run tokenization in your own code, which crucially means you can count the number of tokens that you are about to use - useful for cost estimates but also for cramming the maximum allowed tokens in the context window for things like RAG.
Mistral's library is better than tiktoken though, in that it also includes logic for correctly calculating the tokens needed for conversation construction and tool definition. With OpenAI's APIs you're currently left guessing how many tokens are taken up by these advanced features.
Anthropic haven't published any form of tokenizer at all - it's the feature I'd most like to see from them next.
Here's how to explore the vocabulary of the tokenizer:
MistralTokenizer.from_model(
"open-mixtral-8x22b"
).instruct_tokenizer.tokenizer.vocab()[:12]
['<unk>', '<s>', '</s>', '[INST]', '[/INST]', '[TOOL_CALLS]', '[AVAILABLE_TOOLS]', '[/AVAILABLE_TOOLS]', '[TOOL_RESULTS]', '[/TOOL_RESULTS]']
llm-reka.
My new plugin for running LLM prompts against the Reka family of API hosted LLM models: reka-core ($10 per million input), reka-flash (80c per million) and reka-edge (40c per million).
All three of those models are trained from scratch by a team that includes several Google Brain alumni.
Reka Core is their most powerful model, released on Monday 15th April and claiming benchmark scores competitive with GPT-4 and Claude 3 Opus.
I have a child who is also 2e and has been part of the NYC G&T program. We've had a positive experience with the citywide program, specifically with the program at The Anderson School.
— Meta AI bot, answering a question on a forum
How cheap, outsourced labour in Africa is shaping AI English. The word “delve” has been getting a lot of attention recently as an example of something that might be an indicator of ChatGPT generated content.
One example: articles on medical research site PubMed now use “delve” 10 to 100 times more than a few years ago!
Nigerian Twitter took offense recently to Paul Graham’s suggestion that “delve” is a sign of bad writing. It turns out Nigerian formal writing has a subtly different vocabulary.
Alex Hern theorizes that the underlying cause may be related. Companies like OpenAI frequently outsource data annotation to countries like Nigeria that have excellent English skills and low wages. RLHF (reinforcement learning from human feedback) involves annotators comparing and voting on the “best” responses from the models.
Are they teaching models to favour Nigerian-English? It’s a pretty solid theory!
Andrej Karpathy’s Llama 3 review. The most interesting coverage I’ve seen so far of Meta’s Llama 3 models (8b and 70b so far, 400b promised later).
Andrej notes that Llama 3 trained on 15 trillion tokens—up from 2 trillion for Llama 2—and they used that many even for the smaller 8b model, 75x more than the chinchilla scaling laws would suggest.
The tokenizer has also changed—they now use 128,000 tokens, up from 32,000. This results in a 15% drop in the tokens needed to represent a string of text.
The one disappointment is the context length—just 8,192, 2x that of Llama 2 and 4x LLaMA 1 but still pretty small by today’s standards.
If early indications hold, the 400b model could be the first genuinely GPT-4 class openly licensed model. We’ll have to wait and see.
April 19, 2024
A POI Database in One Line (via) Overture maps offer an extraordinarily useful freely licensed databases of POI (point of interest) listings, principally derived from partners such as Facebook and including restaurants, shops, museums and other locations from all around the world.
Their new "overturemaps" Python CLI utility makes it easy to quickly pull subsets of their data... but requires you to provide a bounding box to do so.
Drew Breunig came up with this delightful recipe for fetching data using LLM and gpt-3.5-turbo to fill in those bounding boxes:
overturemaps download --bbox=$(llm 'Give me a bounding box for Alameda, California expressed as only four numbers delineated by commas, with no spaces, longitude preceding latitude.') -f geojsonseq --type=place | geojson-to-sqlite alameda.db places - --nl --pk=id
Ruff v0.4.0: a hand-written recursive descent parser for Python. The latest release of Ruff—a Python linter and formatter, written in Rust—includes a complete rewrite of the core parser. Previously Ruff used a parser borrowed from RustPython, generated using the LALRPOP parser generator. Victor Hugo Gomes contributed a new parser written from scratch, which provided a 2x speedup and also added error recovery, allowing parsing of invalid Python—super-useful for a linter.
I tried Ruff 0.4.0 just now against Datasette—a reasonably large Python project—and it ran in less than 1/10th of a second. This thing is Fast.
April 20, 2024
llm-gpt4all. New release of my LLM plugin which builds on Nomic's excellent gpt4all Python library. I've upgraded to their latest version which adds support for Llama 3 8B Instruct, so after a 4.4GB model download this works:
llm -m Meta-Llama-3-8B-Instruct "say hi in Spanish"
Tips on Adding JSON Output to Your CLI App
(via)
Kelly Brazil - also the author of jc, the neat CLI tool that converts the output of common Unix utilities such as dig into JSON - provides some useful do's and don'ts for adding JSON output as an option to a command-line tool.
Kelly recommends defaulting to arrays of flat objects - or newline-delimited objects - and suggests including an "unbuffer" option for streaming tools that discourages the OS from buffering output that is being sent through a pipe.
