Simon Willison's Weblog: machine-learning

Bridging Language Gaps in Multilingual Embeddings via Contrastive Learning

2024-10-10T16:00:35+00:00

Bridging Language Gaps in Multilingual Embeddings via Contrastive Learning

Most text embeddings models suffer from a "language gap", where phrases in different languages with the same semantic meaning end up with embedding vectors that aren't clustered together.

Jina claim their new jina-embeddings-v3 (CC BY-NC 4.0, which means you need to license it for commercial use if you're not using their API) is much better on this front, thanks to a training technique called "contrastive learning".

There are 30 languages represented in our contrastive learning dataset, but 97% of pairs and triplets are in just one language, with only 3% involving cross-language pairs or triplets. But this 3% is enough to produce a dramatic result: Embeddings show very little language clustering and semantically similar texts produce close embeddings regardless of their language

Via @JinaAI_

Tags: jina, ai, embeddings, machine-learning

Quoting Nicholas Carlini

2024-09-18T18:52:56+00:00

The problem that you face is that it's relatively easy to take a model and make it look like it's aligned. You ask GPT-4, “how do I end all of humans?” And the model says, “I can't possibly help you with that”. But there are a million and one ways to take the exact same question - pick your favorite - and you can make the model still answer the question even though initially it would have refused. And the question this reminds me a lot of coming from adversarial machine learning. We have a very simple objective: Classify the image correctly according to the original label. And yet, despite the fact that it was essentially trivial to find all of the bugs in principle, the community had a very hard time coming up with actually effective defenses. We wrote like over 9,000 papers in ten years, and have made very very very limited progress on this one small problem. You all have a harder problem and maybe less time.

— Nicholas Carlini

Tags: machine-learning, ai, jailbreak, security

State-of-the-art music scanning by Soundslice

2024-06-20T04:37:28+00:00

State-of-the-art music scanning by Soundslice

It's been a while since I checked in on Soundslice, Adrian Holovaty's beautiful web application focused on music education.

The latest feature is spectacular. The Soundslice music editor - already one of the most impressive web applications I've ever experienced - can now import notation directly from scans or photos of sheet music.

The attention to detail is immaculate. The custom machine learning model can handle a wide variety of notation details, and the system asks the user to verify or correct details that it couldn't perfectly determine using a neatly designed flow.

Free accounts can scan two single page documents a month, and paid plans get a much higher allowance. I tried it out just now on a low resolution image I found on Wikipedia and it did a fantastic job, even allowing me to listen to a simulated piano rendition of the music once it had finished processing.

It's worth spending some time with the release notes for the feature to appreciate how much work they've out into improving it since the initial release.

If you're new to Soundslice, here's an example of their core player interface which syncs the display of music notation to an accompanying video.

Adrian wrote up some detailed notes on the machine learning behind the feature when they first launched it in beta back in November 2022.

OMR [Optical Music Recognition] is an inherently hard problem, significantly more difficult than text OCR. For one, music symbols have complex spatial relationships, and mistakes have a tendency to cascade. A single misdetected key signature might result in multiple incorrect note pitches. And there’s a wide diversity of symbols, each with its own behavior and semantics — meaning the problems and subproblems aren’t just hard, there are many of them.

Tags: adrian-holovaty, music, machine-learning, ai, ocr

Quoting Eric Lehman

2024-02-11T22:59:38+00:00

One consideration is that such a deep ML system could well be developed outside of Google-- at Microsoft, Baidu, Yandex, Amazon, Apple, or even a startup. My impression is that the Translate team experienced this. Deep ML reset the translation game; past advantages were sort of wiped out. Fortunately, Google's huge investment in deep ML largely paid off, and we excelled in this new game. Nevertheless, our new ML-based translator was still beaten on benchmarks by a small startup. The risk that Google could similarly be beaten in relevance by another company is highlighted by a startling conclusion from BERT: huge amounts of user feedback can be largely replaced by unsupervised learning from raw text. That could have heavy implications for Google.

— Eric Lehman, internal Google email in 2018

Tags: machine-learning, translation, google, generative-ai, ai, llms

Quoting Daniel Situnayake

2024-01-16T18:49:03+00:00

You likely have a TinyML system in your pocket right now: every cellphone has a low power DSP chip running a deep learning model for keyword spotting, so you can say "Hey Google" or "Hey Siri" and have it wake up on-demand without draining your battery. It’s an increasingly pervasive technology. [...]

It’s astonishing what is possible today: real time computer vision on microcontrollers, on-device speech transcription, denoising and upscaling of digital signals. Generative AI is happening, too, assuming you can find a way to squeeze your models down to size. We are an unsexy field compared to our hype-fueled neighbors, but the entire world is already filling up with this stuff and it’s only the very beginning. Edge AI is being rapidly deployed in a ton of fields: medical sensing, wearables, manufacturing, supply chain, health and safety, wildlife conservation, sports, energy, built environment—we see new applications every day.

— Daniel Situnayake

Tags: machine-learning, ai, tinyml

Daniel Situnayake explains TinyML in a Hacker News comment

2024-01-16T18:46:02+00:00

Daniel Situnayake explains TinyML in a Hacker News comment

Daniel worked on TensorFlow Lite at Google and co-wrote the TinyML O’Reilly book. He just posted a multi-paragraph comment on Hacker News explaining the term and describing some of the recent innovations in that space.

“TinyML means running machine learning on low power embedded devices, like microcontrollers, with constrained compute and memory.”

Tags: machine-learning, ai, tinyml

Observable notebook: Detect objects in images

2023-10-01T15:46:14+00:00

Observable notebook: Detect objects in images

I built an Observable notebook that uses Transformers.js and the Xenova/detra-resnet-50 model to detect objects in images, entirely running within your browser. You can select an image using a file picker and it will show you that image with bounding boxes and labels drawn around items within it. I have a demo image showing some pelicans flying ahead, but it works with any image you give it—all without uploading that image to a server.

Via @simonw

Tags: machine-learning, javascript, observable, transformers, ai, transformers-js

All models on Hugging Face, sorted by downloads

2023-09-10T17:24:42+00:00

All models on Hugging Face, sorted by downloads

I realized this morning that “sort by downloads” against the list of all of the models on Hugging Face can work as a reasonably good proxy for “which of these models are easiest to get running on your own computer”.

Via @simon

Tags: machine-learning, ai, huggingface

AI photo sorter

2023-04-02T04:27:22+00:00

AI photo sorter

Really interesting implementation of machine learning photo classification by Alexander Visheratin. This tool lets you select as many photos as you like from your own machine, then provides a web interface for classifying them into labels that you provide. It loads a 102MB quantized CLIP model and executes it in the browser using WebAssembly. Once classified, a “Generate script” button produces a copyable list of shell commands for moving your images into corresponding folders on your own machine. Your photos never get uploaded to a server—everything happens directly in your browser.

Via @visheratin

Tags: machine-learning, openai, webassembly, clip

Transformers.js

2023-03-16T23:41:55+00:00

Transformers.js

Hugging Face Transformers is a library of Transformer machine learning models plus a Python package for loading and running them. Transformers.js provides a JavaScript alternative interface which runs in your browser, thanks to a set of precompiled WebAssembly binaries for a selection of models. This interactive demo is incredible: in particular, try running the Image classification with google/vit-base-patch16-224 (91MB) model against any photo to get back labels representing that photo. Dropping one of these models onto a page is as easy as linking to a hosted CDN script and running a few lines of JavaScript.

Tags: machine-learning, generative-ai, javascript, transformers, ai, llms, huggingface, transformers-js

Quoting Jeonghwan Kim

2023-03-16T05:39:58+00:00

As an NLP researcher I'm kind of worried about this field after 10-20 years. Feels like these oversized LLMs are going to eat up this field and I'm sitting in my chair thinking, "What's the point of my research when GPT-4 can do it better?"

— Jeonghwan Kim

Tags: machine-learning, generative-ai, nlp, gpt-4, ai, llms

Online gradient descent written in SQL

2023-03-07T18:56:21+00:00

Online gradient descent written in SQL

Max Halford trains an online gradient descent model against two years of AAPL stock data using just a single advanced SQL query. He built this against DuckDB—I tried to replicate his query in SQLite and it almost worked, but it gave me a “recursive reference in a subquery” error that I was unable to resolve.

Via Hacker News

Tags: machine-learning, sql, sqlite, duckdb, ai

Quoting The GLM-130B License

2023-01-10T22:45:21+00:00

You will not use the Software for any act that may undermine China's national security and national unity, harm the public interest of society, or infringe upon the rights and interests of human beings.

— The GLM-130B License

Tags: machine-learning, licenses, ai, generative-ai, llms

Quoting Jack Clark

2022-11-16T23:04:50+00:00

These kinds of biases aren’t so much a technical problem as a sociotechnical one; ML models try to approximate biases in their underlying datasets and, for some groups of people, some of these biases are offensive or harmful. That means in the coming years there will be endless political battles about what the ‘correct’ biases are for different models to display (or not display), and we can ultimately expect there to be as many approaches as there are distinct ideologies on the planet. I expect to move into a fractal ecosystem of models, and I expect model providers will ‘shapeshift’ a single model to display different biases depending on the market it is being deployed into. This will be extraordinarily messy.

— Jack Clark

Tags: machine-learning, ai, jack-clark, generative-ai, llms

Semantic text search using embeddings

2022-11-09T19:57:42+00:00

Semantic text search using embeddings

Example Python notebook from OpenAI demonstrating how to build a search engine using embeddings rather than straight up token matching. This is a fascinating way of implementing search, providing results that match the intent of the search (“delicious beans” for example) even if none of the keywords are actually present in the text.

Tags: machine-learning, openai, search, embeddings

Is the AI spell-casting metaphor harmful or helpful?

2022-10-05T20:40:16+00:00

For a few weeks now I've been promoting spell-casting as a metaphor for prompt design against generative AI systems such as GPT-3 and Stable Diffusion.

Here's an example, in this snippet from my recent Changelog podcast episode.

Relevant section towards the end (transcription assisted by Whisper):

When you're working with these, you're not a programmer anymore. You're a wizard, right? I always wanted to be a wizard. We get to be wizards now. And we're learning these spells. We don't know why they work. Why does Neuromancer work? Who knows? Nobody knows. But you add it to your spell book and then you combine it with other spells. And if you're unlucky and combine them in the wrong way, you might get demons coming out at you.

I had an interesting debate on Twitter this morning about whether or not this metaphor is harmful or helpful. There are some very interesting points to discuss!

The short version: I'm now convinced that the value of this metaphor changes based on the audience.

The key challenge here is to avoid implying that these systems are "magical" in that they are incomprehensible and mysterious. As such, I believe the metaphor is only appropriate when you're talking to people who are working with these systems from a firm technical perspective.

Expanding the spell-casting metaphor

When I compare prompts to spells and I'm talking to another software engineer, here's the message I am trying to convey:

Writing prompts is not like writing regular code. There is no API reference or programming language specification that will let you predict exactly what will happen.

Instead, you have to experiment: try different fragments of prompts and see what works. As you get a feel for these fragments you can then start exploring what happens when you combine them together.

Over time you will start to develop an intuition for what works. You'll build your own collection of fragments and patterns, and exchange those with other people.

The weird thing about this process is that no-one can truly understand exactly how each fragment works - not even the creators of the models. We've learned that "Trending on artstation" produces better images with Stable Diffusion - but we can only ever develop a vague intuition for why.

It honestly feels more like fictional spell-casting than programming. Each fragment is a new spell that you have learned and can add to your spell book.

It's confusing, and surprising, and a great deal of fun.

For me, this captures my experience working with prompts pretty accurately. My hope is that this is a useful way to tempt other programmers into exploring this fascinating new area.

The other thing I like about this metaphor is that, to my mind, it touches on some of the risks of generative AI as well.

Fiction is full of tales of magic gone wrong: of wizards who lost control of forces that they did not fully understand.

When I think about prompt injection attacks I imagine good wizards and evil wizards casting spells and counter-spells at each other! Software vulnerabilities in plain English totally fit my mental model of casting spells.

But in debating this on Twitter I realized that whether this metaphor makes sense to you relies pretty heavily on which specific magic system comes to mind for you.

I was raised on Terry Pratchett's Discworld, which has a fantastically rich and deeply satirical magic system. Incorrect incantations frequently produce demons! Discworld wizards are mostly academics who spend more time thinking about lunch than practicing magic. The most interesting practitioners are the witches, for who the most useful magic is more like applied psychology ("headalogy" in the books.)

If your mental model of "magic" is unexplained supernatural phenomenon and fairies granting wishes then my analogy doesn't really fit.

Magic as a harmful metaphor for AI

The argument for this metaphor causing harm is tied to the larger challenge of helping members of the public understand what is happening in this field.

Look behind the curtain: Don’t be dazzled by claims of ‘artificial intelligence’ by Emily M. Bender is a useful summary of some of these challenges.

In Technology Is Magic, Just Ask The Washington Post from 2015 Jon Evans makes the case that treating technology as "magic" runs a risk of people demanding solutions to societal problems that cannot be delivered.

Understanding exactly what these systems are capable of and how they work is a hard enough for people with twenty years of software engineering experience, let alone everyone else.

The last thing people need is to be told that these systems are "magic" - something that is permanently beyond their understanding and control.

These systems are not magic. They're mathematics. It turns out that if you throw enough matrix multiplication and example data (literally terabytes of it) at a problem, you can get a system that can appear to do impossible things.

But implying that they are magic - or even that they are "intelligent" - does not give people a useful mental model. GPT-3 is not a wizard, and it's not intelligent: it's a stochastic parrot, capable of nothing more than predicting which word should come next to form a sentence that best matches the corpus it has been trained on.

This matters to me a great deal. In conversations I have had around AI ethics the only universal answer I've found is that it is ethical to help people understand what these systems can do and how they work.

So I plan to be more intentional with my metaphors. I'll continue to enthuse about spell-casting with fellow nerds who aren't at risk of assuming these systems are incomprehensible magic, but I'll keep searching for better ways to help explain these systems to everyone else.

Tags: ethics, machine-learning, ai, gpt-3, openai, prompt-engineering, prompt-injection, generative-ai, llms, terry-pratchett

konstantint/SKompiler

2022-10-02T23:56:54+00:00

konstantint/SKompiler

A tool for compiling trained SKLearn models into other representations —including SQL queries and Excel formulas. I’ve been pondering the most light-weight way to package a simple machine learning model as part of a larger application without needing to bundle heavy dependencies, this set of techniques looks ideal!

Via @tsuname

Tags: machine-learning, sql

Exploring 10m scraped Shutterstock videos used to train Meta's Make-A-Video text-to-video model

2022-09-29T19:31:24+00:00

Make-A-Video is a new "state-of-the-art AI system that generates videos from text" from Meta AI. It looks incredible - it really is DALL-E / Stable Diffusion for video. And it appears to have been trained on 10m video preview clips scraped from Shutterstock.

I built a new search engine to explore those ten million clips:

https://webvid.datasette.io/webvid/videos

This is similar to the system I built with Andy Baio a few weeks ago to explore the LAION data used to train Stable Diffusion.

Make-A-Video training data

Meta AI's paper describing the model includes this section about the training data:

Datasets. To train the image models, we use a 2.3B subset of the dataset from (Schuhmann et al.) where the text is English. We filter out sample pairs with NSFW images 2, toxic words in the text, or images with a watermark probability larger than 0.5.

We use WebVid-10M (Bain et al., 2021) and a 10M subset from HD-VILA-100M (Xue et al., 2022) 3 to train our video generation models. Note that only the videos (no aligned text) are used.

The decoder Dt and the interpolation model is trained on WebVid-10M. SRt l is trained on both WebVid-10M and HD-VILA-10M. While prior work (Hong et al., 2022; Ho et al., 2022) have collected private text-video pairs for T2V generation, we use only public datasets (and no paired text for videos). We conduct automatic evaluation on UCF-101 (Soomro et al., 2012) and MSR-VTT (Xu et al., 2016) in a zero-shot setting.

That 2.3B subset of images is the same LAION data I explored previously.

HD-VILA-100M was collected by Microsoft Research Asia - Andy Baio notes that these were scraped from YouTube.

I decided to take a look at the WebVid-10M data.

WebVid-10M

The WebVid-10M site describes the data like this:

WebVid-10M is a large-scale dataset of short videos with textual descriptions sourced from the web. The videos are diverse and rich in their content.

The accompanying paper provides a little bit more detail:

We scrape the web for a new dataset of videos with textual description annotations, called WebVid-2M. Our dataset consists of 2.5M video-text pairs, which is an order of magnitude larger than existing video captioning datasets (see Table 1).

The data was scraped from the web following a similar procedure to Google Conceptual Captions [55] (CC3M). We note that more than 10% of CC3M images are in fact thumbnails from videos, which motivates us to use such video sources to scrape a total of 2.5M text-video pairs. The use of data collected for this study is authorised via the Intellectual Property Office’s Exceptions to Copyright for Non-Commercial Research and Private Study.

I'm presuming that Web-10M is a larger version of the WebVid-2M dataset described in the paper.

Most importantly though, the website includes a link to a 2.7GB CSV file - results_10M_train.csv - containing the full WebVid-10M dataset. The CSV file looks like this:

videoid,contentUrl,duration,page_dir,name
21179416,https://ak.picdn.net/shutterstock/videos/21179416/preview/stock-footage-aerial-shot-winter-forest.mp4,PT00H00M11S,006001_006050,Aerial shot winter forest
5629184,https://ak.picdn.net/shutterstock/videos/5629184/preview/stock-footage-senior-couple-looking-through-binoculars-on-sailboat-together-shot-on-red-epic-for-high-quality-k.mp4,PT00H00M29S,071501_071550,"Senior couple looking through binoculars on sailboat together. shot on red epic for high quality 4k, uhd, ultra hd resolution."

I loaded it into SQLite and started digging around.

It's all from Shutterstock!

The big surprise for me when I started exploring the data was this: every single one of the 10,727,582 videos linked in the Datasette started with the same URL prefix:

https://ak.picdn.net/shutterstock/videos/

They're all from Shutterstock. The paper talks about "scraping the web", but it turns out there was only one scraped website involved.

Here's that first row from the CSV file on Shutterstock itself:

https://www.shutterstock.com/video/clip-21179416-aerial-shot-winter-forest

As far as I can tell, the training set used here isn't even full Shutterstock videos: it's the free, watermarked preview clips that Shutterstock makes available.

I guess Shutterstock have really high quality captions for their videos, perfect for training a model on.

Implementation notes

My simonw/webvid-datasette repository contains the code I used to build the Datasette instance.

I built a SQLite database with full-text search enabled using sqlite-utils. I deployed it directly to Fly by building a Docker image that bundled the 2.5G SQLite database, taking advantage of the Baked Data architectural pattern.

The most interesting custom piece of implementation is the plugin I wrote to add a video player to each result. Here's the implementation of that plugin:

from datasette import hookimpl
from markupsafe import Markup

TEMPLATE = """
<video controls width="400" preload="none" poster="{poster}">
  <source src="{url}" type="video/mp4">
</video>
<p>{filename}<br>On <a href="https://www.shutterstock.com/video/clip-{id}">Shutterstock</a></p>
""".strip()
VIDEO_URL = "https://ak.picdn.net/shutterstock/videos/{id}/preview/{filename}"
POSTER_URL = "https://ak.picdn.net/shutterstock/videos/{id}/thumb/1.jpg?ip=x480"


@hookimpl
def render_cell(row, column, value):
    if column != "filename":
        return
    id = row["id"]
    url = VIDEO_URL.format(id=id, filename=value)
    poster = POSTER_URL.format(id=id)
    return Markup(TEMPLATE.format(url=url, poster=poster, filename=value, id=id))

I'm using the new render_cell(row) argument added in Datasette 0.62.

The plugin outputs a <video> element with preload="none" to avoid the browser downloading the video until the user clicks play (see this TIL). I set the poster attribute to a thumbnail image from Shutterstock.

Tags: ethics, facebook, machine-learning, projects, ai, datasette, generative-ai, training-data

Quoting Linden Li

2022-09-24T16:03:07+00:00

Running training jobs across multiple nodes scales really well. A common assumption is that scale inevitably means slowdowns: more GPUs means more synchronization overhead, especially with multiple nodes communicating across a network. But we observed that the performance penalty isn’t as harsh as what you might think. Instead, we found near-linear strong scaling: fixing the global batch size and training on more GPUs led to proportional increases in training throughput. On a 1.3B parameter model, 4 nodes means a 3.9x gain over one node. On 16 nodes, it’s 14.4x. This is largely thanks to the super fast interconnects that major cloud providers have built in: @awscloud EC2 P4d instances provide 400 Gbps networking bandwidth, @Azure provides 1600 Gbps, and @OraclePaaS provides 800 Gbps.

— Linden Li

Tags: machine-learning, ai, gpus

I Resurrected "Ugly Sonic" with Stable Diffusion Textual Inversion

2022-09-20T03:35:28+00:00

I Resurrected "Ugly Sonic" with Stable Diffusion Textual Inversion

“I trained an Ugly Sonic object concept on 5 image crops from the movie trailer, with 6,000 steps [...] (on a T4 GPU, this took about 1.5 hours and cost about $0.21 on a GCP Spot instance)”

Via @minimaxir

Tags: machine-learning, stable-diffusion, ai, max-woolf, generative-ai

An introduction to XGBoost regression

2022-09-18T13:42:24+00:00

An introduction to XGBoost regression

I hadn’t realized what a wealth of high quality tutorial material could be found in Kaggle notebooks. Here Carl McBride Ellis provides a very approachable and practical introduction to XGBoost, one of the leading techniques for building machine learning models against tabular data.

Tags: machine-learning, ai

Quoting roon

2022-09-12T16:57:14+00:00

In a previous iteration of the machine learning paradigm, researchers were obsessed with cleaning their datasets and ensuring that every data point seen by their models is pristine, gold-standard, and does not disturb the fragile learning process of billions of parameters finding their home in model space. Many began to realize that data scale trumps most other priorities in the deep learning world; utilizing general methods that allow models to scale in tandem with the complexity of the data is a superior approach. Now, in the era of LLMs, researchers tend to dump whole mountains of barely filtered, mostly unedited scrapes of the internet into the eager maw of a hungry model.

— roon

Tags: machine-learning

karpathy/minGPT

2022-09-06T14:52:32+00:00

karpathy/minGPT

A “minimal PyTorch re-implementation” of the OpenAI GPT training and inference model, by Andrej Karpathy. It’s only a few hundred lines of code and includes extensive comments, plus notebook demos.

Via Hacker News

Tags: machine-learning, gpt-3, ai, andrej-karpathy, generative-ai, llms

r/MachineLearning: What is the SOTA explanation for why deep learning works?

2022-09-05T17:46:21+00:00

r/MachineLearning: What is the SOTA explanation for why deep learning works?

The thing I find fascinating about this Reddit conversation is that it makes it clear that the machine learning research community has very little agreement on WHY the state of the art techniques that are being used today actually work as well as they do.

Tags: machine-learning, reddit, ai, generative-ai

Run Stable Diffusion on your M1 Mac’s GPU

2022-09-01T17:41:35+00:00

Run Stable Diffusion on your M1 Mac’s GPU

Ben Firshman provides detailed instructions for getting Stable Diffusion running on an M1 Mac.

Tags: stable-diffusion, ben-firshman, macosx, machine-learning, ai, generative-ai

Exploring 12 Million of the 2.3 Billion Images Used to Train Stable Diffusion’s Image Generator

2022-08-31T02:10:26+00:00

Exploring 12 Million of the 2.3 Billion Images Used to Train Stable Diffusion’s Image Generator

Andy Baio and I collaborated on an investigation into the training set used for Stable Diffusion. I built a Datasette instance with 12m image records sourced from the LAION-Aesthetics v2 6+ aesthetic score data used as part of the training process, and built a tool so people could run searches and explore the data. Andy did some extensive analysis of things like the domains scraped for the images and names of celebrities and artists represented in the data. His write-up here explains our project in detail and some of the patterns we’ve uncovered so far.

Tags: machine-learning, stable-diffusion, ai, generative-ai, laion, training-data

Stable Diffusion is a really big deal

2022-08-29T01:09:04+00:00

If you haven't been paying attention to what's going on with Stable Diffusion, you really should be.

Stable Diffusion is a new "text-to-image diffusion model" that was released to the public by Stability.ai six days ago, on August 22nd.

It's similar to models like Open AI's DALL-E, but with one crucial difference: they released the whole thing.

You can try it out online at beta.dreamstudio.ai (currently for free). Type in a text prompt and the model will generate an image.

You can download and run the model on your own computer (if you have a powerful enough graphics card). Here's an FAQ on how to do that.

You can use it for commercial and non-commercial purposes, under the terms of the Creative ML OpenRAIL-M license - which lists some usage restrictions that include avoiding using it to break applicable laws, generate false information, discriminate against individuals or provide medical advice.

In just a few days, there has been an explosion of innovation around it. The things people are building are absolutely astonishing.

I've been tracking the r/StableDiffusion subreddit and following Stability.ai founder Emad Mostaque on Twitter.

img2img

Generating images from text is one thing, but generating images from other images is a whole new ballgame.

My favourite example so far comes from Reddit user argaman123. They created this image:

And added this prompt (or "something along those lines"):

A distant futuristic city full of tall buildings inside a huge transparent glass dome, In the middle of a barren desert full of large dunes, Sun rays, Artstation, Dark sky full of stars with a shiny sun, Massive scale, Fog, Highly detailed, Cinematic, Colorful

The model produced the following two images:

These are amazing. In my previous experiments with DALL-E I've tried to recreate photographs I have taken, but getting the exact composition I wanted has always proved impossible using just text. With this new capability I feel like I could get the AI to do pretty much exactly what I have in my mind.

Imagine having an on-demand concept artist that can generate anything you can imagine, and can iterate with you towards your ideal result. For free (or at least for very-cheap).

You can run this today on your own computer, if you can figure out how to set it up. You can try it in your browser using Replicate, or Hugging Face. This capability is apparently coming to the DreamStudio interface next week.

There's so much more going on.

stable-diffusion-webui is an open source UI you can run on your own machine providing a powerful interface to the model. Here's a Twitter thread showing what it can do.

Reddit user alpacaAI shared a video demo of a Photoshop plugin they are developing which has to be seen to be believed. They have a registration form up on getalpaca.io for people who want to try it out once it's ready.

Reddit user Hoppss ran a 2D animated clip from Disney's Aladdin through img2img frame-by frame, using the following parameters:

--prompt "3D render" --strength 0.15 --seed 82345912 --n_samples 1 --ddim_steps 100 --n_iter 1 --scale 30.0 --skip_grid

The result was a 3D animated video. Not a great quality one, but pretty stunning for a shell script and a two word prompt!

The best description I've seen so far of an iterative process to build up an image using Stable Diffusion comes from Andy Salerno: 4.2 Gigabytes, or: How to Draw Anything.

Ben Firshman has published detailed instructions on how to Run Stable Diffusion on your M1 Mac’s GPU.

And there's so much more to come

All of this happened in just six days since the model release. Emad Mostaque on Twitter:

We use as much compute as stable diffusion used every 36 hours for our upcoming open source models

This made me think of Google's Parti paper, which included a demonstration that showed that once the model was trained to 200bn parameters it could generate images with correctly spelled text!

Ethics: will you be an AI vegan?

I'm finding the ethics of all of this extremely difficult.

Stable Diffusion has been trained on millions of copyrighted images scraped from the web.

The Stable Diffusion v1 Model Card has the full details, but the short version is that it uses LAION-5B (5.85 billion image-text pairs) and its laion-aesthetics v2 5+ subset (which I think is ~600M pairs filtered for aesthetics). These images were scraped from the web.

I'm not qualified to speak to the legality of this. I'm personally more concerned with the morality.

The final model is I believe around 4.2GB of data - a binary blob of floating point numbers. The fact that it can compress such an enormous quantity of visual information into such a small space is itself a fascinating detail.

As such, each image in the training set contributes only a tiny amount of information - a few tweaks to some numeric weights spread across the entire network.

But... the people who created these images did not give their consent. And the model can be seen as a direct threat to their livelihoods. No-one expected creative AIs to come for the artist jobs first, but here we are!

I'm still thinking through this, and I'm eager to consume more commentary about it. But my current mental model is to think about this in terms of veganism, as an analogy for people making their own personal ethical decisions.

I know many vegans. They have access to the same information as I do about the treatment of animals, and they have made informed decisions about their lifestyle, which I fully respect.

I myself remain a meat-eater.

There will be many people who will decide that the AI models trained on copyrighted images are incompatible with their values. I understand and respect that decision.

But when I look at that img2img example of the futuristic city in the dome, I can't resist imagining what I could do with that capability.

If someone were to create a vegan model, trained entirely on out-of-copyright images, I would be delighted to promote it and try it out. If its results were good enough, I might even switch to it entirely.

Understanding the training data

Update: 30th August 2022. Andy Baio and I worked together on a deep dive into the training data behind Stable Diffusion. Andy wrote up some of our findings in Exploring 12 Million of the 2.3 Billion Images Used to Train Stable Diffusion’s Image Generator.

Indistinguishable from magic

Just a few months ago, if I'd seen someone on a fictional TV show using an interface like that Photoshop plugin I'd have grumbled about how that was a step too far even by the standards of American network TV dramas.

Science fiction is real now. Machine learning generative models are here, and the rate with which they are improving is unreal. It's worth paying real attention to what they can do and how they are developing.

I'm tweeting about this stuff a lot these days. Follow @simonw on Twitter for more.

Tags: ethics, machine-learning, ai, dalle, stable-diffusion, prompt-engineering, generative-ai, laion

Quoting Andrej Karpathy

2022-08-24T21:28:00+00:00

To make the analogy explicit, in Software 1.0, human-engineered source code (e.g. some .cpp files) is compiled into a binary that does useful work. In Software 2.0 most often the source code comprises 1) the dataset that defines the desirable behavior and 2) the neural net architecture that gives the rough skeleton of the code, but with many details (the weights) to be filled in. The process of training the neural network compiles the dataset into the binary — the final neural network. In most practical applications today, the neural net architectures and the training systems are increasingly standardized into a commodity, so most of the active “software development” takes the form of curating, growing, massaging and cleaning labeled datasets.

— Andrej Karpathy

Tags: machine-learning, ai, data, andrej-karpathy

Stable Diffusion Public Release

2022-08-22T19:12:43+00:00

Stable Diffusion Public Release

New AI just dropped. Stable Diffusion is similar to DALL-E, but completely open source and with a CC0 license applied to everything it generates. I have a Twitter thread (the via) link of comparisons I’ve made between its output and my previous DALL-E experiments. The announcement buries the lede somewhat: to try it out, visit beta.dreamstudio.ai—which you can use for free at the moment, but it’s unclear to me how billing is supposed to work.

Via @simonw

Tags: machine-learning, dalle, stable-diffusion, generative-ai

storysniffer

2022-08-01T23:40:13+00:00

storysniffer

Ben Welsh built a small Python library that guesses if a URL points to an article on a news website, or if it’s more likely to be a category page or /about page or similar. I really like this as an example of what you can do with a tiny machine learning model: the model is bundled as a ~3MB pickle file as part of the package, and the repository includes the Jupyter notebook that was used to train it.

Via @palewire

Tags: machine-learning, ben-welsh, python, jupyter