Practical Unicode, please!
Joel Spolsky has joined Tim Bray in the quest to educate the masses as to the importance of Unicode. Dan Sugalski kicks in as well with What the heck is: A string, a lengthy essay about string handling and why it really is a lot more complicated than you think it is.
These should all be required reading for anyone involved in programming and web development. Unfortunately, they all lack one critical aspect: practical advice. Having read all three I feel like I could lecture for an hour on code points, glyphs, ASCII, byte-order and a whole bunch of other topics. When it comes to updating my blogging system to support comments written in Japanese I’m still almost as clueless as I was before I read any of the above.
Enough of the theory: the web needs practical advice on developing Unicode enabled web pages and web applications. Is it just a case of ensuring my text editor is “saving as Unicode”? What about storage—can I throw Unicode at MySQL and expect it to come out again? If I serve a page up with Japanese characters in it, what will my users have to do to be able to read them? It’s a big, confusing world out there.
More recent articles
- Understanding GPT tokenizers - 8th June 2023
- Weeknotes: Parquet in Datasette Lite, various talks, more LLM hacking - 4th June 2023
- It's infuriatingly hard to understand how closed models train on their input - 4th June 2023
- ChatGPT should include inline tips - 30th May 2023
- Lawyer cites fake cases invented by ChatGPT, judge is not amused - 27th May 2023
- llm, ttok and strip-tags - CLI tools for working with ChatGPT and other LLMs - 18th May 2023
- Delimiters won't save you from prompt injection - 11th May 2023
- Weeknotes: sqlite-utils 3.31, download-esm, Python in a sandbox - 10th May 2023
- Leaked Google document: "We Have No Moat, And Neither Does OpenAI" - 4th May 2023
- Midjourney 5.1 - 4th May 2023