Simon Willison’s Weblog

Subscribe

5 items tagged “strings”

2024

Tagged Pointer Strings (2015) (via) Mike Ash digs into a fascinating implementation detail of macOS.

Tagged pointers provide a way to embed a literal value in a pointer reference. Objective-C pointers on macOS are 64 bit, providing plenty of space for representing entire values. If the least significant bit is 1 (the pointer is a 64 bit odd number) then the pointer is "tagged" and represents a value, not a memory reference.

Here's where things get really clever. Storing an integer value up to 60 bits is easy. But what about strings?

There's enough space for three UTF-16 characters, with 12 bits left over. But if the string fits ASCII we can store 7 characters.

Drop everything except a-z A-Z.0-9 and we need 6 bits per character, allowing 10 characters to fit in the pointer.

Apple take this a step further: if the string contains just eilotrm.apdnsIc ufkMShjTRxgC4013 ("b" is apparently uncommon enough to be ignored here) they can store 11 characters in that 60 bits!

# 8th May 2024, 2:23 pm / c, objectivec, strings

2019

datasette-jellyfish. I learned about a handy Python library called Jellyfish which implements approximate and phonetic matching of strings—soundex, metaphone, porter stemming, levenshtein distance and more. I’ve built a simple Datasette plugin which wraps the library and makes each of those algorithms available as a SQL function.

# 9th March 2019, 6:29 pm / strings, datasette

String length—Rosetta Code (via) Calculating the length of a string is surprisingly difficult once Unicode is involved. Here's a fascinating illustration of how that problem can be attached dozens of different programming languages. From that page: the string "J̲o̲s̲é̲" ("J\x{332}o\x{332}s\x{332}e\x{301}\x{332}") has 4 user-visible graphemes, 9 characters (code points), and 14 bytes when encoded in UTF-8.

# 22nd February 2019, 3:27 pm / programming-languages, strings, unicode

2007

String types in Python 3. bytes are now immutable (just like the bytestrings they are replacing) and a new mutable buffer type has been introduced.

# 9th October 2007, 2:08 am / buffers, bytes, bytestrings, python, python3, strings, unicode

How should JSON strings be represented in Erlang? Erlang’s poor support for strings makes this a surprisingly tricky question.

# 14th September 2007, 8:17 am / erlang, json, strings, tonygarnockjones