Blogmarks
Filters: Sorted by date
Inside the secret list of websites that make AI chatbots sound smart. Washington Post story digging into the C4 dataset—Colossal Clean Crawled Corpus, a filtered version of Common Crawl that’s often used for training large language models. They include a neat interactive tool for searching a domain to see if it’s included—TIL that simonwillison.net is the 106,649th ranked site in C4 by number of tokens, 189,767 total—0.0001% of the total token volume in C4.
LLaVA: Large Language and Vision Assistant (via) Yet another multi-modal model combining a vision model (pre-trained CLIP ViT-L/14) and a LLaMA derivative model (Vicuna). The results I get from their demo are even more impressive than MiniGPT-4. Also includes a new training dataset, LLaVA-Instruct-150K, derived from GPT-4 and subject to the same warnings about the OpenAI terms of service.
RedPajama, a project to create leading open-source models, starts by reproducing LLaMA training dataset of over 1.2 trillion tokens. With the amount of projects that have used LLaMA as a foundation model since its release two months ago—despite its non-commercial license—it’s clear that there is a strong desire for a fully openly licensed alternative.
RedPajama is a collaboration between Together, Ontocord.ai, ETH DS3Lab, Stanford CRFM, Hazy Research, and MILA Québec AI Institute aiming to build exactly that.
Step one is gathering the training data: the LLaMA paper described a 1.2 trillion token training set gathered from sources that included Wikipedia, Common Crawl, GitHub, arXiv, Stack Exchange and more.
RedPajama-Data-1T is an attempt at recreating that training set. It’s now available to download, as 2,084 separate multi-GB jsonl files—2.67TB total.
Even without a trained model, this is a hugely influential contribution to the world of open source LLMs. Any team looking to build their own LLaMA from scratch can now jump straight to the next stage, training the model.
Latest Twitter search results for “as an AI language model” (via) Searching for “as an AI language model” on Twitter reveals hundreds of bot accounts which are clearly being driven by GPT models and have been asked to generate content which occasionally trips the ethical guidelines trained into the OpenAI models.
If Twitter still had an affordable search API someone could do some incredible disinformation research on top of this, looking at which accounts are implicated, what kinds of things they are tweeting about, who they follow and retweet and so-on.
MiniGPT-4 (via) An incredible project with a poorly chosen name. A team from King Abdullah University of Science and Technology in Saudi Arabia combined Vicuna-13B (a model fine-tuned on top of Facebook’s LLaMA) with the BLIP-2 vision-language model to create a model that can conduct ChatGPT-style conversations around an uploaded image. The demo is very impressive, and the weights are available to download—45MB for MiniGPT-4, but you’ll need the much larger Vicuna and LLaMA weights as well.
How I Used Stable Diffusion and Dreambooth to Create A Painted Portrait of My Dog (via) I like posts like this that go into detail in terms of how much work it takes to deliberately get the kind of result you really want using generative AI tools. Jake Dahn trained a Dreambooth model from 40 photos of Queso—his photogenic Golden Retriever—using Replicate, then gathered the prompts from ten images he liked on Lexica and generated over 1,000 different candidate images, picked his favourite, used Draw Things img2img resizing to expand the image beyond the initial crop, then Automatic1111 inpainting to tweak the ears, then Real-ESRGAN 4x+ to upscale for the final print.
codespaces-jupyter (via) This is really neat. Click “Use this template” -> “Open in a codespace” and you get a full in-browser VS Code interface where you can open existing notebook files (or create new ones) and start playing with them straight away.
New prompt injection attack on ChatGPT web version. Markdown images can steal your chat data. An ingenious new prompt injection / data exfiltration vector from Roman Samoilenko, based on the observation that ChatGPT can render markdown images in a way that can exfiltrate data to the image hosting server by embedding it in the image URL. Roman uses a single pixel image for that, and combines it with a trick where copy events on a website are intercepted and prompt injection instructions are appended to the copied text, in order to trick the user into pasting the injection attack directly into ChatGPT.
Update: They finally started mitigating this in December 2023.
Building LLM applications for production. Chip Huyen provides a useful, in-depth review of the challenges involved in taking an app built on top of a LLM from prototype to production, including issues such as prompt ambiguity and unpredictability, cost and latency concerns, challenges in testing and updating to new models. She also lists some promising use-cases she’s seeing for categories of application built on these tools.
The Great Flowering: Why OpenAI is the new AWS and the New Kingmakers still matter (via) James Governor discusses the potential impact of AI-assisted productivity on the wider software engineering industry, and calls me “a bellwether”!
GitHub Accelerator: our first cohort. I’m participating in the first cohort of GitHub’s new open source accelerator program, with Datasette (and related projects). It’s a 10 week program with 20 projects working together “with an end goal of building durable streams of funding for their work”.
Free Dolly: Introducing the World’s First Truly Open Instruction-Tuned LLM (via) Databricks released a large language model called Dolly a few weeks ago. They just released Dolly 2.0 and it is MUCH more interesting—it’s an instruction tuned 12B parameter upgrade of EleutherAI’s Pythia model. Unlike other recent instruction tuned models Databricks didn’t use a training set derived from GPT-3—instead, they recruited 5,000 employees to help put together 15,000 human-generated request/response pairs, which they have released under a Creative Commons Attribution-ShareAlike license. The model itself is a 24GB download from Hugging Face—I’ve run it slowly on a small GPU-enabled Paperspace instance, but hopefully optimized ways to run it will emerge in short order.
Replacing my best friends with an LLM trained on 500,000 group chat messages (via) Izzy Miller used a 7 year long group text conversation with five friends from college to fine-tune LLaMA, such that it could simulate ongoing conversations. They started by extracting the messages from the iMessage SQLite database on their Mac, then generated a new training set from those messages and ran it using code from the Stanford Alpaca repository. This is genuinely one of the clearest explanations of the process of fine-tuning a model like this I’ve seen anywhere.
Sheepy-T—an LLM running on an iPhone. Kevin Kwok has a video on Twitter demonstrating Sheepy-T—his iPhone app which runs a full instruction-tuned large language model, based on EleutherAI’s GPT-J, entirely on an iPhone 14. I applied for the TestFlight beta and I have this running on my phone now: it works!
How we’re building a browser when it’s supposed to be impossible (via) Andreas Kling: “The ECMAScript, HTML, and CSS specifications today are (for the most part) stellar technical documents whose algorithms can be implemented with considerably less effort and guesswork than in the past.” The Ladybird project is such an inspiration, and really demonstrates the enormous value of the work put in by web standards spec authors over the last twenty years.
The AI singularity is here. Can’t say I’m a fan of the headline, but the subhead “The time to figure out how to use generative AI and large language models in your code is now” is much more illustrative of the story. I’m referred to in this one as “One of the most outspoken advocates for LLM-enhanced development” which is a bit of a surprise!
AI is flooding the workplace, and workers love it. The microwave kiln pottery project I helped Natalie with gets a mention in this story about people who are putting AI tools to use.
Floor796 (via) “An ever-expanding animation scene showing the life of the 796th floor of the huge space station” by Russian artist 0x00, who built their own custom browser-based pixel animation tool with which they are constructing this project. Absolutely crammed with pop culture references and easter eggs. The “Changes” link at the top shows almost daily updates, with links to jump to the latest content.
On Endings: Why & How We Retired Elm at Culture Amp (via) Culture Amp made extensive use of Elm—a ML-like functional language that compiles to JavaScript—between 2016 and 2020 while building their company’s frontend. They eventually decided to move away from it, for reasons described at length in this post primarily relating to its integration with React. This piece is worth reading mainly as a thoughtful approach to engineering management challenge of deprecating a well-loved piece of technology from the recommended stack at a company.
Database “sharding” came from UO? (via) Raph Koster coined the term “shard” back in 1996 in a design document proposing a way of scaling Ultima Online: “[...] we realized we would need to run multiple whole copies of Ultima Online for users to connect to, we needed to come up with a fiction for it. [...] the evil wizard Mondain had attempted to gain control over Sosaria by trapping its essence in a crystal. When the Stranger at the end of Ultima I defeated Mondain and shattered the crystal, the crystal shards each held a refracted copy of Sosaria.”
Why ChatGPT and Bing Chat are so good at making things up. I helped review this deep dive by Benj Edwards for Ars Technica into the hallucination/confabulation problem with ChatGPT and other LLMs, which is attracting increasing attention thanks to stories like the recent defamation complaints against ChatGPT. This article explains why this is happening and talks to various experts about potential solutions.
The different uses of Python type hints (via) Luke Plant describes five different categories for how Python optional types are being used today: IDE assistants, type checking, runtime behavior changes via introspection (e.g. Pydantic), code documentation, compiler instructions (ala mypyc)—and a bonus sixth, dependency injection.
image-to-jpeg (via) I built a little JavaScript app that accepts an image, then displays that image as a JPEG with a slider to control the quality setting, plus a copy and paste textarea to copy out that image with a data-uri. I didn't actually write a single line of code for this: I got ChatGPT/GPT-4 to generate the entire thing with some prompts.
Here's the full transcript.
Blinded by Analogies (via) Ethan Mollick discusses how many of the analogies we have for AI right now are hurting rather than helping our understanding, particularly with respect to LLMs.
Eight Things to Know about Large Language Models (via) This unpublished paper by Samuel R. Bowman is succinct, readable and dense with valuable information to help understand the field of modern LLMs.
From Deep Learning Foundations to Stable Diffusion. Brand new free online video course from Jeremy Howard: 30 hours of content, covering everything you need to know to implement the Stable Diffusion image generation algorithm from scratch. I previewed parts of this course back in December and it was fascinating: this field is moving so fast that some of the lectures covered papers that had been released just a few days before.
Guess we could start calling this a ’hallucitation’? Kate Crawford coins an excellent neologism for hallucinated citations in LLMs like ChatGPT.
trurl manipulates URLs.
Brand new command-line tool from curl creator Daniel Stenberg: The tr stands for translate or transpose, and the tool provides various mechanisms for normalizing URLs, adding query strings, changing the path or hostname and other similar modifications. I’ve tried designing APis for this kind of thing in the past—Datasette includes some clumsily named functions such as path_with_removed_args()—and it’s a deceptively deep set of problems.
.
ROOTS search tool (via) BLOOM is one of the most interesting completely openly licensed language models. The ROOTS corpus is the training data that was collected for it, and this tool lets you run searches directly against that corpus. I tried searching for my own name and got an interesting insight into what it knows about me.
Closed AI Models Make Bad Baselines (via) The NLP academic research community are facing a tough challenge: the state-of-the-art in large language models, GPT-4, is entirely closed which means papers that compare it to other models lack replicability and credibility. “We make the case that as far as research and scientific publications are concerned, the “closed” models (as defined below) cannot be meaningfully studied, and they should not become a “universal baseline”, the way BERT was for some time widely considered to be.”
Anna Rogers proposes a new rule for this kind of research: “That which is not open and reasonably reproducible cannot be considered a requisite baseline.”