Simon Willison’s Weblog

Subscribe

34 items tagged “c”

2024

Introducing sqlite-lembed: A SQLite extension for generating text embeddings locally (via) Alex Garcia's latest SQLite extension is a C wrapper around the llama.cpp that exposes just its embedding support, allowing you to register a GGUF file containing an embedding model:

INSERT INTO temp.lembed_models(name, model)
  select 'all-MiniLM-L6-v2',
  lembed_model_from_file('all-MiniLM-L6-v2.e4ce9877.q8_0.gguf');

And then use it to calculate embeddings as part of a SQL query:

select lembed(
  'all-MiniLM-L6-v2',
  'The United States Postal Service is an independent agency...'
); -- X'A402...09C3' (1536 bytes)

all-MiniLM-L6-v2.e4ce9877.q8_0.gguf here is a 24MB file, so this should run quite happily even on machines without much available RAM.

What if you don't want to run the models locally at all? Alex has another new extension for that, described in Introducing sqlite-rembed: A SQLite extension for generating text embeddings from remote APIs. The rembed is for remote embeddings, and this extension uses Rust to call multiple remotely-hosted embeddings APIs, registered like this:

INSERT INTO temp.rembed_clients(name, options)
  VALUES ('text-embedding-3-small', 'openai');
select rembed(
  'text-embedding-3-small',
  'The United States Postal Service is an independent agency...'
); -- X'A452...01FC', Blob<6144 bytes>

Here's the Rust code that implements Rust wrapper functions for HTTP JSON APIs from OpenAI, Nomic, Cohere, Jina, Mixedbread and localhost servers provided by Ollama and Llamafile.

Both of these extensions are designed to complement Alex's sqlite-vec extension, which is nearing a first stable release.

# 25th July 2024, 8:30 pm / c, sqlite, rust, alex-garcia, embeddings

Tagged Pointer Strings (2015) (via) Mike Ash digs into a fascinating implementation detail of macOS.

Tagged pointers provide a way to embed a literal value in a pointer reference. Objective-C pointers on macOS are 64 bit, providing plenty of space for representing entire values. If the least significant bit is 1 (the pointer is a 64 bit odd number) then the pointer is "tagged" and represents a value, not a memory reference.

Here's where things get really clever. Storing an integer value up to 60 bits is easy. But what about strings?

There's enough space for three UTF-16 characters, with 12 bits left over. But if the string fits ASCII we can store 7 characters.

Drop everything except a-z A-Z.0-9 and we need 6 bits per character, allowing 10 characters to fit in the pointer.

Apple take this a step further: if the string contains just eilotrm.apdnsIc ufkMShjTRxgC4013 ("b" is apparently uncommon enough to be ignored here) they can store 11 characters in that 60 bits!

# 8th May 2024, 2:23 pm / c, objectivec, strings

I’m writing a new vector search SQLite Extension. Alex Garcia is working on sqlite-vec, a spiritual successor to his sqlite-vss project. The new SQLite C extension will have zero other dependencies (sqlite-vss used some tricky C++ libraries) and will work using virtual tables, storing chunks of vectors in shadow tables to avoid needing to load everything into memory at once.

# 3rd May 2024, 3:16 am / c, sqlite, vectors, alex-garcia, embeddings

llm.c (via) Andrej Karpathy implements LLM training—initially for GPT-2, other architectures to follow—in just over 1,000 lines of C on top of CUDA. Includes a tutorial about implementing LayerNorm by porting an implementation from Python.

# 9th April 2024, 3:24 pm / c, ai, andrej-karpathy, generative-ai, llms

Hello World (via) Lennon McLean dives deep down the rabbit hole of what happens when you execute the binary compiled from “Hello world” in C on a Linux system, digging into the details of ELF executables, objdump disassembly, the C standard library, stack frames, null-terminated strings and taking a detour through musl because it’s easier to read than Glibc.

# 9th April 2024, 1:06 am / c, linux

Building and testing C extensions for SQLite with ChatGPT Code Interpreter

Visit Building and testing C extensions for SQLite with ChatGPT Code Interpreter

I wrote yesterday about how I used Claude and ChatGPT Code Interpreter for simple ad-hoc side quests—in that case, for converting a shapefile to GeoJSON and merging it into a single polygon.

[... 4,612 words]

Beej’s Guide to Networking Concepts (via) Beej’s Guide to Network Programming is a legendary tutorial on network programming in C, continually authored and updated by Brian “Beej” Hall since 1995.

This is NOT that. Beej’s Guide to Networking Concepts is brand new—started in March 2023—and illustrates a whole bunch of networking concepts using Python instead of C.

From the forward: “Is it Beej’s Guide to Network Programming in Python? Well, kinda, actually. The C book is more about how C’s (well, Unix’s) network API works. And this book is more about the concepts underlying it, using Python as a vehicle.”

# 30th January 2024, 10:08 pm / c, networking, python

Find a level of abstraction that works for what you need to do. When you have trouble there, look beneath that abstraction. You won’t be seeing how things really work, you’ll be seeing a lower-level abstraction that could be helpful. Sometimes what you need will be an abstraction one level up. Is your Python loop too slow? Perhaps you need a C loop. Or perhaps you need numpy array operations.

You (probably) don’t need to learn C.

Ned Batchelder

# 24th January 2024, 6:25 pm / abstractions, c, ned-batchelder, programming, python

2023

jo (via) Neat little C utility (available via brew/apt-get install etc) for conveniently outputting JSON from a shell: “jo -p name=jo n=17 parser=false” will output a JSON object with string, integer and boolean values, and you can nest it to create nested objects. Looks very handy.

# 8th October 2023, 5:20 am / c, json

TG: Polygon indexing (via) TG is a brand new geospatial library by Josh Baker, author of the Tile38 in-memory spatial server (kind of a geospatial Redis). TG is written in pure C and delivered as a single C file, reminiscent of the SQLite amalgamation.

TG looks really interesting. It implements almost the exact subset of geospatial functionality that I find most useful: point-in-polygon, intersect, WKT, WKB, and GeoJSON—all with no additional dependencies.

The most interesting thing about it is the way it handles indexing. In this documentation Josh describes two approaches he uses to speeding up point-in-polygon and intersection using a novel approach that goes beyond the usual RTree implementation.

I think this could make the basis of a really useful SQLite extension—a lighter-weight alternative to SpatiaLite.

# 23rd September 2023, 4:32 am / c, geospatial, gis, spatialite, sqlite, geojson, tg

Dynamic linker tricks: Using LD_PRELOAD to cheat, inject features and investigate programs (via) This tutorial by Rafał Cieślak from 2013 filled in a bunch of gaps in my knowledge about how C works on Linux.

# 8th September 2023, 10:05 pm / c, linux

2022

redbean (via) “redbean makes it possible to share web applications that run offline as a single-file αcτµαlly pδrταblε εxεcµταblε zip archive which contains your assets. All you need to do is download the redbean.com program below, change the filename to .zip, add your content in a zip editing tool, and then change the extension back to .com”.

redbean is implemented as a single C file with a dazzling array of clever tricks—most impressively, the single executable works on Linux, macOS, Windows and various BSDs!

It embeds Lua, and in June last year added SQLite too—so self-contained distributable web applications built with Redbean can now use Lua and SQLite for dynamic scripting. Performance sounds incredible: “redbean can serve 1 million+ gzip encoded responses per second on a cheap personal computer”.

# 17th February 2022, 6:01 am / c, lua, sqlite, redbean, cosmopolitan

Running C unit tests with pytest (via) Brilliant, detailed tutorial by Gabriele Tornetta on testing C code using pytest, which also doubles up as a ctypes tutorial. There’s a lot of depth here—in addition to exercising C code through ctypes, Gabriele shows how to run each test in a separate process so that segmentation faults don’t fail the entire suite, then adds code to run the compiler as part of the pytest run, and then shows how to use gdb trickery to generate more useful stack traces.

# 12th February 2022, 5:14 pm / c, ctypes, pytest

Mypyc (via) Spotted this in the Black release notes: “Black is now compiled with mypyc for an overall 2x speed-up”. Mypyc is a tool that compiles Python modules (written in a subset of Python) to C extensions—similar to Cython but using just Python syntax, taking advantage of type annotations to perform type checking and type inference. It’s part of the mypy type checking project, which has been using it since 2019 to gain a 4x performance improvement over regular Python.

# 30th January 2022, 1:31 am / c, performance, python, mypy

2021

How to look at the stack with gdb. Useful short tutorial on gdb from first principles.

# 24th May 2021, 6:23 pm / c, debugger, julia-evans

cosmopolitan libc (via) “Cosmopolitan makes C a build-once run-anywhere language, similar to Java, except it doesn’t require interpreters or virtual machines be installed beforehand. [...] Instead, it reconfigures stock GCC to output a POSIX-approved polyglot format that runs natively on Linux + Mac + Windows + FreeBSD + OpenBSD + BIOS with the best possible performance and the tiniest footprint imaginable.” This is a spectacular piece of engineering.

# 27th February 2021, 6:02 am / c, cosmopolitan

2020

Unravelling `not` in Python (via) Part of a series where Brett Cannon looks at how fundamental Python syntactic sugar works, including a clearly explained dive into the underlying op codes and C implementation.

# 27th November 2020, 5:59 pm / c, python

CG-SQL (via) This is the toolkit the Facebook Messenger team wrote to bring stored procedures to SQLite. It implements a custom version of the T-SQL language which it uses to generate C code that can then be compiled into a SQLite module.

# 22nd October 2020, 6:25 pm / c, facebook, sqlite

Pikchr. Interesting new project from SQLite creator D. Richard Hipp. Pikchr is a new mini language for describing visual diagrams, designed to be embedded in Markdown documentation. It’s already enabled for the SQLite forum. Implementation is a no-dependencies C library and output is SVG.

# 21st October 2020, 4:02 pm / c, sqlite, svg, markdown, d-richard-hipp

A Compiler Writing Journey (via) Warren Toomey has been writing a self-compiling compiler for a subset of C, and extensively documenting every step of the journey here on GitHub. The result is an extremely high quality free textbook on compiler construction.

# 8th January 2020, 3:33 am / c, compilers

2019

Calling C functions from BigQuery with web assembly (via) Google BigQuery lets you define custom SQL functions in JavaScript, and it turns out they expose the WebAssembly.instantiate family of APIs. Which means you can write your UDD in C or Rust, compile it to WebAssembly and run it as part of your query!

# 27th October 2019, 5:55 am / c, sql, rust, webassembly

2017

A Regular Expression Matcher: Code by Rob Pike, Exegesis by Brian Kernighan (via) Delightfully clear and succinct 30-line C implementation of a regular expression matcher that supports $, ^, . and * operations.

# 5th December 2017, 6:36 pm / c, regular-expressions, rob-pike

C is a bit like Latin these days. We no longer write everything in it, but knowing it affords deeper knowledge of more-recent languages.

Norman Wilson

# 8th October 2017, 4:03 pm / c

2012

Do people still write and compile programs from the command line, instead of an IDE? Why or why not?

Being an expert with command line tools gives you super powers.

[... 94 words]

2010

Building a GeoIP server with ZeroMQ. ZeroMQ makes it trivially easy to write a network service in raw C that makes functionality from a C library (in this case the MaxMind GeoIP library) available to clients written in many different client languages.

# 9th November 2010, 9:36 am / c, geoip, zeromq, recovered

Mongrel2 is “Self-Hosting”. Zed Shaw’s Mongrel2 is shaping up to be a really interesting project. “A web server simply written in C that loves all languages equally”, the two most interesting new ideas are the ability to handle HTTP, Flash Sockets and WebSockets all on the same port (thanks to an extension to the Mongrel HTTP parser that can identify all three protocols) and the ability to hook Mongrel2 up to the backend servers using either TCP/IP or ZeroMQ. I’m guessing this means Mongrel2 could hold an HTTP request open, fire off some messages and wait for various backends to send messages back to construct the response, making async processing just as easy as a regular blocking request/response cycle.

# 17th June 2010, 8:11 pm / async, c, http, webserver, websockets, zed-shaw, zeromq, recovered, mongrel2

Redis weekly update #3—Pub/Sub and more. Redis is now a publish/subscribe server—and it ended up only taking 150 lines of C code since Redis internals were already based on that paradigm.

# 30th March 2010, 3:15 pm / c, nosql, pubsub, redis

2009

I think that what's particularly hard with C is not the details about pointers, automatic memory management, and so forth, but the fact that C is at the same time so low level and so flexible. So basically if you want to create a large project in C you have to build a number of intermediate layers (otherwise the code will be a complete mess full of bugs and 10 times bigger than required). This continue design exercise of creating additional layers is the hard part about C. You have to get very good at understanding when to write a function or not, when to create a layer of abstraction, and when it's worth to generalize or when it is an overkill.

Salvatore Sanfilippo

# 18th December 2009, 3:50 pm / c, redis, salvatore-sanfilippo

Mac OS X 10.6 Snow Leopard: the Ars Technica review. The essential review: 23 pages of information-dense but readable goodness. Pretty much everything I know about Mac OS X internals I learnt from reading John Siracusa’s reviews—this one is particularly juice when it gets to Grand Central Dispatch and blocks (aka closures) in C and Objective-C.

# 1st September 2009, 7:05 pm / apple, blocks, c, closures, grandcentraldispatch, john-siracusa, objectivec, osx, snowleopard

The compiler only pays attention to the semicolons and braces while ignoring the line breaks and indentation, but humans usually only pay attention to the line breaks and indentation while ignoring the semicolons and braces. This gives the code the opportunity to lie about what it’s really doing. Consequently we need to take extra care when writing in C, Java, C++, C# etc.

Elliotte Rusty Harold

# 2nd January 2009, 10:26 am / c, codestyle, elliotte-rusty-harold, indentation, java, syntax