Simon Willison’s Weblog

Subscribe

193 items tagged “opensource”

2024

Codestral: Hello, World! Mistral's first code-specific model, trained to be "fluent" in 80 different programming languages.

The weights are released under a new Mistral AI Non-Production License, which is extremely restrictive:

3.2. Usage Limitation

  • You shall only use the Mistral Models and Derivatives (whether or not created by Mistral AI) for testing, research, Personal, or evaluation purposes in Non-Production Environments;
  • Subject to the foregoing, You shall not supply the Mistral Models or Derivatives in the course of a commercial activity, whether in return for payment or free of charge, in any medium or form, including but not limited to through a hosted or managed service (e.g. SaaS, cloud instances, etc.), or behind a software layer.

To Mistral's credit at least they don't misapply the term "open source" in their marketing around this model - they consistently use the term "open-weights" instead. They also state that they plan to continue using Apache 2 for other model releases.

Codestral can be used commercially when accessed via their paid API. # 30th May 2024, 7:19 am

Bullying in Open Source Software Is a Massive Security Vulnerability. The Xz story from last month, where a malicious contributor almost managed to ship a backdoor to a number of major Linux distributions, included a nasty detail where presumed collaborators with the attacker bullied the maintainer to make them more susceptible to accepting help.

Hans-Christoph Steiner from F-Droid reported a similar attempt from a few years ago:

A new contributor submitted a merge request to improve the search, which was oft requested but the maintainers hadn't found time to work on. There was also pressure from other random accounts to merge it. In the end, it became clear that it added a SQL injection vulnerability.

404 Media's Jason Koebler ties the two together here and makes the case for bullying as a genuine form of security exploit in the open source ecosystem. # 9th May 2024, 10:26 pm

in July 2023, we [Hugging Face] wanted to experiment with a custom license for this specific project [text-generation-inference] in order to protect our commercial solutions from companies with bigger means than we do, who would just host an exact copy of our cloud services.

The experiment however wasn’t successful.

It did not lead to licensing-specific incremental business opportunities by itself, while it did hamper or at least complicate the community contributions, given the legal uncertainty that arises as soon as you deviate from the standard licenses.

Julien Chaumond # 8th April 2024, 6:35 pm

Cally: Accessibility statement (via) Cally is a neat new open source date (and date range) picker Web Component by Nick Williams.

It’s framework agnostic and weighs less than 9KB grilled, but the best feature is this detailed page of documentation covering its accessibility story, including how it was tested—in JAWS, NVDA and VoiceOver.

I’d love to see other open source JavaScript libraries follow this example. # 2nd April 2024, 7:38 pm

Merge pull request #1757 from simonw/heic-heif. I got a PR into GCHQ’s CyberChef this morning! I added support for detecting heic/heif files to the Forensics -> Detect File Type tool.

The change was landed by the delightfully mysterious a3957273. # 28th March 2024, 5:37 am

gchq.github.io/CyberChef (via) CyberChef is “the Cyber Swiss Army Knife—a web app for encryption, encoding, compression and data analysis”—entirely client-side JavaScript with dozens of useful tools for working with different formats and encodings.

It’s maintained and released by GCHQ—the UK government’s signals intelligence security agency.

I didn’t know GCHQ had a presence on GitHub, and I find the URL to this tool absolutely delightful. They first released it back in 2016 and it has over 3,700 commits.

The top maintainers also have suitably anonymous usernames—great work, n1474335, j433866, d98762625 and n1073645. # 26th March 2024, 5:08 pm

Reviving PyMiniRacer (via) PyMiniRacer is “a V8 bridge in Python”—it’s a library that lets Python code execute JavaScript code in a V8 isolate and pass values back and forth (provided they serialize to JSON) between the two environments.

It was originally released in 2016 by Sqreen, a web app security startup startup. They were acquired by Datadog in 2021 and the project lost its corporate sponsor, but in this post Ben Creech announces that he is revitalizing the project, with the approval of the original maintainers.

I’m always interested in new options for running untrusted code in a safe sandbox. PyMiniRacer has the three features I care most about: code can’t access the filesystem or network by default, you can limit the RAM available to it and you can have it raise an error if code execution exceeds a time limit.

The documentation includes a newly written architecture overview which is well worth a read. Rather than embed V8 directly in Python the authors chose to use ctypes—they build their own V8 with a thin additional C++ layer to expose a ctypes-friendly API, then the Python library code uses ctypes to call that.

I really like this. V8 is a notoriously fast moving and complex dependency, so reducing the interface to just a thin C++ wrapper via ctypes feels very sensible to me.

This blog post is fun too: it’s a good, detailed description of the process to update something like this to use modern Python and modern CI practices. The steps taken to build V8 (6.6 GB of miscellaneous source and assets!) across multiple architectures in order to create binary wheels are particularly impressive—the Linux aarch64 build takes several days to run on GitHub Actions runners (via emulation), so they use Mozilla’s Sccache to cache compilation steps so they can retry until it finally finishes.

On macOS (Apple Silicon) installing the package with “pip install mini-racer” got me a 37MB dylib and a 17KB ctypes wrapper module. # 24th March 2024, 5 pm

Redis Adopts Dual Source-Available Licensing (via) Well this sucks: after fifteen years (and contributions from more than 700 people), Redis is dropping the 3-clause BSD license going forward, instead being “dual-licensed under the Redis Source Available License (RSALv2) and Server Side Public License (SSPLv1)” from Redis 7.4 onwards. # 21st March 2024, 2:24 am

Paying people to work on open source is good actually. In which Jacob expands his widely quoted (including here) pithy toot about how quick people are to pick holes in paid open source contributor situations into a satisfyingly comprehensive rant. This is absolutely worth your time—there’s so much I could quote from here, but I’m going to go with this:

“Many, many more people should be getting paid to write free software, but for that to happen we’re going to have to be okay accepting impure or imperfect mechanisms.” # 17th February 2024, 1:42 am

Aya (via) “A global initiative led by Cohere For AI involving over 3,000 independent researchers across 119 countries. Aya is a state-of-art model and dataset, pushing the boundaries of multilingual AI for 101 languages through open science.”

Both the model and the training data are released under Apache 2. The training data looks particularly interesting: “513 million instances through templating and translating existing datasets across 114 languages”—suggesting the data is mostly automatically generated. # 13th February 2024, 5:14 pm

“We believe that open source should be sustainable and open source maintainers should get paid!”

Maintainer: *introduces commercial features*
“Not like that”

Maintainer: *works for a large tech co*
“Not like that”

Maintainer: *takes investment*
“Not like that”

Jacob Kaplan-Moss # 12th February 2024, 5:18 am

The Open Source Sustainability Crisis (via) Chad Whitacre: “What is Open Source sustainability? Why do I say it is in crisis? My answers are that sustainability is when people are getting paid without jumping through hoops, and we’re in a crisis because people aren’t and they’re burning out.”

I really like Chad’s focus on “jumping through hoops” in this piece. It’s possible to build a financially sustainable project today, but it requires picking one or more activities that aren’t directly aligned with working on the core project: raising VC and starting a company, building a hosted SaaS platform and becoming a sysadmin, publishing books and courses and becoming a content author.

The dream is that open source maintainers can invest all of their effort in their projects and make a good living from that work. # 23rd January 2024, 4:48 pm

We estimate the supply-side value of widely-used OSS is $4.15 billion, but that the demand-side value is much larger at $8.8 trillion. We find that firms would need to spend 3.5 times more on software than they currently do if OSS did not exist. [...] Further, 96% of the demand-side value is created by only 5% of OSS developers.

The Value of Open Source Software, Harvard Business School Strategy Unit # 22nd January 2024, 4:35 pm

DSF calls for applicants for a Django Fellow. The Django Software Foundation employs contractors to manage code reviews and releases, responsibly handle security issues, coach new contributors, triage tickets and more.

This is the Django Fellows program, which is now ten years old and has proven enormously impactful.

Mariusz Felisiak is moving on after five years and the DSF are calling for new applicants, open to anywhere in the world. # 20th January 2024, 8:35 am

Talking about Open Source LLMs on Oxide and Friends

I recorded an episode of the Oxide and Friends podcast on Monday, talking with Bryan Cantrill and Adam Leventhal about Open Source LLMs.

[... 1995 words]

Open Source LLMs with Simon Willison. I was invited to the Oxide and Friends weekly audio show (previously on Twitter Spaces, now using broadcast using Discord) to talk about open source LLMs, and to respond to a very poorly considered op-ed calling for them to be regulated as “uniquely dangerous”. It was a really fun conversation, now available to listen to as a podcast or YouTube audio-only video. # 17th January 2024, 8:53 pm

Marimo (via) This is a really interesting new twist on Python notebooks.

The most powerful feature is that these notebooks are reactive: if you change the value or code in a cell (or change the value in an input widget) every other cell that depends on that value will update automatically. It’s the same pattern implemented by Observable JavaScript notebooks, but now it works for Python.

There are a bunch of other nice touches too. The notebook file format is a regular Python file, and those files can be run as “applications” in addition to being edited in the notebook interface. The interface is very nicely built, especially for such a young project—they even have GitHub Copilot integration for their CodeMirror cell editors. # 12th January 2024, 9:17 pm

Microsoft Research relicense Phi-2 as MIT (via) Phi-2 was already an interesting model—really strong results for its size—made available under a non-commercial research license. It just got significantly more interesting: Microsoft relicensed it as MIT open source. # 6th January 2024, 6:06 am

NPM: modele-social (via) This is a fascinating open source package: it’s an NPM module containing an implementation of the rules for calculating social security contributions in France, maintained by a team at Urssaf, the not-quite-government organization in France that manages the collection of social security contributions there.

The rules themselves can be found in the associated GitHub repository, encoded in a YAML-like declarative language called Publicodes that was developed by the French government for this and similar purposes. # 2nd January 2024, 5:55 pm

2023

tldraw/draw-a-ui (via) Absolutely spectacular GPT-4 Vision API demo. Sketch out a rough UI prototype using the open source tldraw drawing app, then select a set of components and click “Make Real” (after giving it an OpenAI API key). It generates a PNG snapshot of your selection and sends that to GPT-4 with instructions to turn it into a Tailwind HTML+JavaScript prototype, then adds the result as an iframe next to your mockup.

You can then make changes to your mockup, select it and the previous mockup and click “Make Real” again to ask for an updated version that takes your new changes into account.

This is such a great example of innovation at the UI layer, and everything is open source. Check app/lib/getHtmlFromOpenAI.ts for the system prompt that makes it work. # 16th November 2023, 4:42 pm

Financial sustainability for open source projects at GitHub Universe

I presented a ten minute segment at GitHub Universe on Wednesday, ambitiously titled Financial sustainability for open source projects.

[... 2485 words]

YouTube: OpenAssistant is Completed—by Yannic Kilcher (via) The OpenAssistant project was an attempt to crowdsource the creation of an alternative to ChatGPT, using human volunteers to build a Reinforcement Learning from Human Feedback (RLHF) dataset suitable for training this kind of model.

The project started in January. In this video from 24th October project founder Yannic Kilcher announces that the project is now shutting down.

They’ve declared victory in that the dataset they collected has been used by other teams as part of their training efforts, but admit that the overhead of running the infrastructure and moderation teams necessary for their project is more than they can continue to justify. # 4th November 2023, 10:14 pm

LLM now provides tools for working with embeddings

LLM is my Python library and command-line tool for working with language models. I just released LLM 0.9 with a new set of features that extend LLM to provide tools for working with embeddings.

[... 3466 words]

I like to make sure almost every line of code I write is under a commercially friendly OS license (usually Apache 2) for genuinely selfish reasons: I never want to have to solve that problem ever again, so OS licensing my code now ensures I can use it for the rest of my life no matter who I happen to be working for in the future

Me # 18th August 2023, 7:33 pm

Overnight, tens of thousands of businesses, ranging from one-person shops to the Fortune 500, woke up to a new reality where the underpinnings of their infrastructure suddenly became a potential legal risk. The BUSL and the additional use grant written by the HashiCorp team are vague, and now every company, vendor, and developer using Terraform has to wonder whether what they are doing could be construed as competitive with HashiCorp’s offerings.

The OpenTF Manifesto # 17th August 2023, 5:15 am

Databricks Signs Definitive Agreement to Acquire MosaicML, a Leading Generative AI Platform. MosaicML are the team behind MPT-7B and MPT-30B, two of the most impressive openly licensed LLMs. They just got acquired by Databricks for $1.3 billion dollars. # 30th June 2023, 1:43 am

abacaj/mpt-30B-inference. MPT-30B, released last week, is an extremely capable Apache 2 licensed open source language model. This repo shows how it can be run on a CPU, using the ctransformers Python library based on GGML. Following the instructions in the README got me a working MPT-30B model on my M2 MacBook Pro. The model is a 19GB download and it takes a few seconds to start spitting out tokens, but it works as advertised. # 29th June 2023, 3:27 am

Thunderbird Is Thriving: Our 2022 Financial Report (via) Astonishing numbers: in 2022 the Thunderbird project received $6,442,704 in donations from 300,000 users. These donations are now supporting 24 staff members. Part of their success is credited to an “in-app donations appeal” that they launched at the end of 2022. # 10th May 2023, 12:14 am

GitHub code search is generally available. I’ve been a beta user of GitHub’s new code search for a year and a half now and I wouldn’t want to be without it. It’s spectacularly useful: it provides fast, regular-expression-capable search across every public line of code hosted by GitHub—plus code in private repos you have access to.

I mainly use it to compensate for libraries with poor documentation—I can usually find an example of exactly what I want to do somewhere on GitHub.

It’s also great for researching how people are using libraries that I’ve released myself—to figure out how much pain deprecating a method would cause, for example. # 8th May 2023, 6:52 pm

Introducing MPT-7B: A New Standard for Open-Source, Commercially Usable LLMs (via) There’s a lot to absorb about this one. Mosaic trained this model from scratch on 1 trillion tokens, at a cost of $200,000 taking 9.5 days. It’s Apache-2.0 licensed and the model weights are available today.

They’re accompanying the base model with an instruction-tuned model called MPT-7B-Instruct (licensed for commercial use) and a non-commercially licensed MPT-7B-Chat trained using OpenAI data. They also announced MPT-7B-StoryWriter-65k+—“a model designed to read and write stories with super long context lengths”—with a previously unheard of 65,000 token context length.

They’re releasing these models mainly to demonstrate how inexpensive and powerful their custom model training service is. It’s a very convincing demo! # 5th May 2023, 7:05 pm