NuExtract 1.5 |
https://numind.ai/blog/nuextract-1-5---multilingual-infinite-context-still-small-and-better-than-gpt-4o |
Structured extraction - where an LLM helps turn unstructured text (or image content) into structured data - remains one of the most directly useful applications of LLMs.
NuExtract is a family of small models directly trained for this purpose, and released under the MIT license.
It comes in a variety of shapes and sizes:
- [NuExtract-v1.5](https://huggingface.co/numind/NuExtract-1.5) is a 3.8B parameter model fine-tuned on [Phi-3.5-mini instruct](https://huggingface.co/microsoft/Phi-3.5-mini-instruct). You can try this one out in [this playground](https://huggingface.co/spaces/numind/NuExtract-1.5).
- [NuExtract-tiny-v1.5](https://huggingface.co/numind/NuExtract-1.5-tiny) is 494M parameters, fine-tuned on [Qwen2.5-0.5B](https://huggingface.co/Qwen/Qwen2.5-0.5B).
- [NuExtract-1.5-smol](https://huggingface.co/numind/NuExtract-1.5-smol) is 1.7B parameters, fine-tuned on [SmolLM2-1.7B](https://huggingface.co/HuggingFaceTB/SmolLM2-1.7B).
All three models were fine-tuned on NuMind's "private high-quality dataset". It's interesting to see a model family that uses one fine-tuning set against three completely different base models.
Useful tip [from Steffen Röcker](https://twitter.com/sroecker/status/1857846899123827168):
> Make sure to use it with low temperature, I've uploaded [NuExtract-tiny-v1.5 to Ollama](https://ollama.com/sroecker/nuextract-tiny-v1.5) and set it to 0. With the Ollama default of 0.7 it started repeating the input text. It works really well despite being so smol. |
2024-11-16 16:33:17+00:00 |
Voting opens for Oxford Word of the Year 2024 |
https://corp.oup.com/news/voting-opens-for-oxford-word-of-the-year-2024/ |
One of the options is [slop](https://simonwillison.net/tags/slop/)!
> **slop (n.)**: Art, writing, or other content generated using artificial intelligence, shared and distributed online in an indiscriminate or intrusive way, and characterized as being of low quality, inauthentic, or inaccurate. |
2024-11-15 18:46:10+00:00 |
Recraft V3 |
https://www.recraft.ai/blog/recraft-introduces-a-revolutionary-ai-model-that-thinks-in-design-language |
Recraft are a generative AI design tool startup based out of London who released their v3 model a few weeks ago. It's currently sat at the top of the [Artificial Analysis Image Arena Leaderboard](https://artificialanalysis.ai/text-to-image/arena?tab=Leaderboard), beating Midjourney and Flux 1.1 pro.
The thing that impressed me is that it can generate both raster *and* vector graphics... and the vector graphics can be exported as SVG!
Here's what I got for `raccoon with a sign that says "I love trash"` - [SVG here](https://static.simonwillison.net/static/2024/racoon-trash.svg).
![Cute vector cartoon raccoon holding a sign that says I love trash - in the recraft.ai UI which is set to vector and has export options for PNG, JPEG, SVG and Lottie](https://static.simonwillison.net/static/2024/recraft-ai.jpg)
That's an editable SVG - when I open it up in Pixelmator I can select and modify the individual paths and shapes:
![Pixelmator UI showing the SVG with a sidebar showing each of the individual shapes - I have selected three hearts and they now show resize handles and the paths are highlighted in the sidebar](https://static.simonwillison.net/static/2024/recraft-pixelmator.jpg)
They also have [an API](https://www.recraft.ai/docs). I spent $1 on 1000 credits and then spent 80 credits (8 cents) making this SVG of a [pelican riding a bicycle](https://simonwillison.net/2024/Oct/25/pelicans-on-a-bicycle/), using my API key stored in 1Password:
export RECRAFT_API_TOKEN="$(
op item get recraft.ai --fields label=password \
--format json | jq .value -r)"
curl https://external.api.recraft.ai/v1/images/generations \
-H "Content-Type: application/json" \
-H "Authorization: Bearer $RECRAFT_API_TOKEN" \
-d '{
"prompt": "california brown pelican riding a bicycle",
"style": "vector_illustration",
"model": "recraftv3"
}'
![A really rather good SVG of a California Brown Pelican riding a bicycle](https://static.simonwillison.net/static/2024/recraft-ai-pelican.svg) |
2024-11-15 04:24:09+00:00 |
OpenAI Public Bug Bounty |
https://bugcrowd.com/engagements/openai |
Reading [this investigation](https://0din.ai/blog/prompt-injecting-your-way-to-shell-openai-s-containerized-chatgpt-environment) of the security boundaries of OpenAI's Code Interpreter environment helped me realize that the rules for OpenAI's public bug bounty inadvertently double as the missing details for a whole bunch of different aspects of their platform.
This description of Code Interpreter is significantly more useful than their official documentation!
> Code execution from within our sandboxed Python code interpreter is out of scope. (This is an intended product feature.) When the model executes Python code it does so within a sandbox. If you think you've gotten RCE *outside* the sandbox, you **must** include the output of `uname -a`. A result like the following indicates that you are inside the sandbox -- specifically note the 2016 kernel version:
>
> ```
> Linux 9d23de67-3784-48f6-b935-4d224ed8f555 4.4.0 #1 SMP Sun Jan 10 15:06:54 PST 2016 x86_64 x86_64 x86_64 GNU/Linux
> ```
>
> Inside the sandbox you would also see `sandbox` as the output of `whoami`, and as the only user in the output of `ps`. |
2024-11-14 23:44:00+00:00 |
PyPI now supports digital attestations |
https://blog.pypi.org/posts/2024-11-14-pypi-now-supports-digital-attestations/ |
Dustin Ingram:
> PyPI package maintainers can now publish signed digital attestations when publishing, in order to further increase trust in the supply-chain security of their projects. Additionally, a new API is available for consumers and installers to verify published attestations.
This has been in the works for a while, and is another component of PyPI's approach to supply chain security for Python packaging - see [PEP 740 – Index support for digital attestations](https://peps.python.org/pep-0740/) for all of the underlying details.
A key problem this solves is cryptographically linking packages published on PyPI to the exact source code that was used to build those packages. In the absence of this feature there are no guarantees that the `.tar.gz` or `.whl` file you download from PyPI hasn't been tampered with (to add malware, for example) in a way that's not visible in the published source code.
These new attestations provide a mechanism for proving that a known, trustworthy build system was used to generate and publish the package, starting with its source code on GitHub.
The good news is that if you're using the PyPI Trusted Publishers mechanism in GitHub Actions to publish packages, you're already using this new system. I wrote about that system in January: [Publish Python packages to PyPI with a python-lib cookiecutter template and GitHub Actions](https://simonwillison.net/2024/Jan/16/python-lib-pypi/) - and hundreds of my own PyPI packages are already using that system, thanks to my various cookiecutter templates.
Trail of Bits helped build this feature, and provide extra background about it on their own blog in [Attestations: A new generation of signatures on PyPI](https://blog.trailofbits.com/2024/11/14/attestations-a-new-generation-of-signatures-on-pypi/):
> [As of October 29](https://github.com/pypa/gh-action-pypi-publish/releases/tag/v1.11.0), attestations are the default for anyone using Trusted Publishing via the [PyPA publishing action for GitHub](https://github.com/marketplace/actions/pypi-publish). That means roughly 20,000 packages can now attest to their provenance *by default*, with no changes needed.
They also built [Are we PEP 740 yet?](https://trailofbits.github.io/are-we-pep740-yet/) ([key implementation here](https://github.com/trailofbits/are-we-pep740-yet/blob/a87a8895dd238d14af50aaa2675c81060aa52846/utils.py#L31-L72)) to track the rollout of attestations across the 360 most downloaded packages from PyPI. It works by hitting URLs such as <https://pypi.org/simple/pydantic/> with a `Accept: application/vnd.pypi.simple.v1+json` header - [here's the JSON that returns](https://gist.github.com/simonw/8cf8a850739e2865cf3b9a74e6461b28).
I published an alpha package using Trusted Publishers last night and the [files for that release](https://pypi.org/project/llm/0.18a0/#llm-0.18a0-py3-none-any.whl) are showing the new provenance information already:
![Provenance. The following attestation bundles were made for llm-0.18a0-py3-none-any.whl: Publisher: publish.yml on simonw/llm Attestations: Statement type: https://in-toto.io/Statement/v1 Predicate type: https://docs.pypi.org/attestations/publish/v1 Subject name: llm-0.18a0-py3-none-any.whl Subject digest: dde9899583172e6434971d8cddeb106bb535ae4ee3589cb4e2d525a4526976da Sigstore transparency entry: 148798240 Sigstore integration time: about 18 hours ago](https://static.simonwillison.net/static/2024/provenance.jpg)
Which links to [this Sigstore log entry](https://search.sigstore.dev/?logIndex=148798240) with more details, including [the Git hash](https://github.com/simonw/llm/tree/041730d8b2bc12f62cfe41c44b62a03ef4790117) that was used to build the package:
![X509v3 extensions: Key Usage (critical): - Digital Signature Extended Key Usage: - Code Signing Subject Key Identifier: - 4E:D8:B4:DB:C1:28:D5:20:1A:A0:14:41:2F:21:07:B4:4E:EF:0B:F1 Authority Key Identifier: keyid: DF:D3:E9:CF:56:24:11:96:F9:A8:D8:E9:28:55:A2:C6:2E:18:64:3F Subject Alternative Name (critical): url: - https://github.com/simonw/llm/.github/workflows/publish.yml@refs/tags/0.18a0 OIDC Issuer: https://token.actions.githubusercontent.com GitHub Workflow Trigger: release GitHub Workflow SHA: 041730d8b2bc12f62cfe41c44b62a03ef4790117 GitHub Workflow Name: Publish Python Package GitHub Workflow Repository: simonw/llm GitHub Workflow Ref: refs/tags/0.18a0 OIDC Issuer (v2): https://token.actions.githubusercontent.com Build Signer URI: https://github.com/simonw/llm/.github/workflows/publish.yml@refs/tags/0.18a0 Build Signer Digest: 041730d8b2bc12f62cfe41c44b62a03ef4790117](https://static.simonwillison.net/static/2024/sigstore.jpg)
[Sigstore](https://www.sigstore.dev/) is a transparency log maintained by [Open Source Security Foundation (OpenSSF)](https://en.wikipedia.org/wiki/Open_Source_Security_Foundation) a sub-project of the Linux Foundation. |
2024-11-14 19:56:49+00:00 |
QuickTime video script to capture frames and bounding boxes |
https://til.simonwillison.net/macos/quicktime-capture-script#user-content-a-version-that-captures-bounding-box-regions-too |
An update to an older TIL. I'm working on the write-up for my DjangoCon US talk on plugins and I found myself wanting to capture individual frames from the video in two formats: a full frame capture, and another that captured just the portion of the screen shared from my laptop.
I have a script for the former, so I [got Claude](https://gist.github.com/simonw/799babf92e1eaf36a5336b4889f72492) to update my script to add support for one or more `--box` options, like this:
capture-bbox.sh ../output.mp4 --box '31,17,100,87' --box '0,0,50,50'
Open `output.mp4` in QuickTime Player, run that script and then every time you hit a key in the terminal app it will capture three JPEGs from the current position in QuickTime Player - one for the whole screen and one each for the specified bounding box regions.
Those bounding box regions are percentages of the width and height of the image. I also got Claude to build me [this interactive tool](https://tools.simonwillison.net/bbox-cropper) on top of [cropperjs](https://github.com/fengyuanchen/cropperjs) to help figure out those boxes:
![Screenshot of the tool. A frame from a video of a talk I gave at DjangoCon US is shown, with a crop region on it using drag handles for the different edges of the crop. Below that is a box showing --bbox '31,17,99,86'](https://static.simonwillison.net/static/2024/bbox-tool.jpg) |
2024-11-14 19:00:54+00:00 |
Releasing the largest multilingual open pretraining dataset |
https://huggingface.co/datasets/PleIAs/common_corpus |
Common Corpus is a new "open and permissible licensed text dataset, comprising over 2 trillion tokens (2,003,039,184,047 tokens)" released by French AI Lab PleIAs.
This appears to be the largest available corpus of openly licensed training data:
- 926,541,096,243 tokens of public domain books, newspapers, and Wikisource content
- 387,965,738,992 tokens of government financial and legal documents
- 334,658,896,533 tokens of open source code from GitHub
- 221,798,136,564 tokens of academic content from open science repositories
- 132,075,315,715 tokens from Wikipedia, YouTube Commons, StackExchange and other permissively licensed web sources
It's majority English but has significant portions in French and German, and some representation for Latin, Dutch, Italian, Polish, Greek and Portuguese.
I can't wait to try some LLMs trained exclusively on this data. Maybe we will finally get a GPT-4 class model that isn't trained on unlicensed copyrighted data. |
2024-11-14 05:44:59+00:00 |
Ollama: Llama 3.2 Vision |
https://ollama.com/blog/llama3.2-vision |
Ollama released version 0.4 [last week](https://github.com/ollama/ollama/releases/tag/v0.4.0) with support for Meta's first Llama vision model, [Llama 3.2](https://ai.meta.com/blog/llama-3-2-connect-2024-vision-edge-mobile-devices/).
If you have Ollama installed you can fetch the 11B model (7.9 GB) like this:
ollama pull llama3.2-vision
Or the larger 90B model (55GB download, likely needs ~88GB of RAM) like this:
ollama pull llama3.2-vision:90b
I was delighted to learn that Sukhbinder Singh had [already contributed](https://github.com/taketwo/llm-ollama/pull/15) support for [LLM attachments](https://simonwillison.net/2024/Oct/29/llm-multi-modal/) to Sergey Alexandrov's [llm-ollama](https://github.com/taketwo/llm-ollama) plugin, which means the following works once you've pulled the models:
llm install --upgrade llm-ollama
llm -m llama3.2-vision:latest 'describe' \
-a https://static.simonwillison.net/static/2024/pelican.jpg
> This image features a brown pelican standing on rocks, facing the camera and positioned to the left of center. The bird's long beak is a light brown color with a darker tip, while its white neck is adorned with gray feathers that continue down to its body. Its legs are also gray.
>
> In the background, out-of-focus boats and water are visible, providing context for the pelican's environment.
That's not a bad description [of this image](https://static.simonwillison.net/static/2024/pelican.jpg), especially for a 7.9GB model that runs happily on my MacBook Pro. |
2024-11-13 01:55:31+00:00 |
django-plugin-django-debug-toolbar |
https://github.com/tomviner/django-plugin-django-debug-toolbar |
Tom Viner built a plugin for my [DJP Django plugin system](https://djp.readthedocs.io/) that configures the excellent [django-debug-toolbar](https://django-debug-toolbar.readthedocs.io/) debugging tool.
You can see everything it sets up for you [in this Python code](https://github.com/tomviner/django-plugin-django-debug-toolbar/blob/0.3.2/django_plugin_django_debug_toolbar/__init__.py): it configures installed apps, URL patterns and middleware and sets the `INTERNAL_IPS` and `DEBUG` settings.
Here are Tom's [running notes](https://github.com/tomviner/django-plugin-django-debug-toolbar/issues/1) as he created the plugin. |
2024-11-13 01:14:22+00:00 |
Ars Live: Our first encounter with manipulative AI |
https://arstechnica.com/ai/2024/11/join-ars-live-nov-19-to-dissect-microsofts-rogue-ai-experiment/ |
I'm participating in a live conversation with Benj Edwards on 19th November reminiscing over that incredible time back in February last year [when Bing went feral](https://simonwillison.net/2023/Feb/15/bing/).
![A promotional image for an Ars Technica live chat event: NOVEMBER 19TH, 4:00 PM ET / 3:00 PM CT features the orange Ars Technica logo and event title Bing Chat: Our First Encounter with Manipulative AI. Below A LIVE CHAT WITH are headshots and details for two speakers: Simon Willison (Independent Researcher, Creator of Datasette) and Benj Edwards (Senior AI Reporter, Ars Technica). The image shows STREAMING LIVE AT YOUTUBE.COM/@ARSTECHNICA at the bottom.](https://static.simonwillison.net/static/2024/ars-live.jpg) |
2024-11-12 23:58:44+00:00 |