Simon Willison’s Weblog

Subscribe

29 items tagged “unicode”

2021

Re-assessing the automatic charset decoding policy in HTTPX (via) Tom Christie ran an analysis of the top 1,000 most accessed websites (according to an older extract from Google’s Ad Planner service) and found that a full 5% of them both omitted a charset parameter and failed to decode as UTF-8. As a result, HTTPX will be depending on the charset-normalizer Python library to handle those cases. # 13th August 2021, 10:07 pm

2019

String length—Rosetta Code (via) Calculating the length of a string is surprisingly difficult once Unicode is involved. Here’s a fascinating illustration of how that problem can be attached dozens of different programming languages. From that page: the string “J̲o̲s̲é̲” (“J\x{332}o\x{332}s\x{332}e\x{301}\x{332}”) has 4 user-visible graphemes, 9 characters (code points), and 14 bytes when encoded in UTF-8. # 22nd February 2019, 3:27 pm

2018

Big tech warns of ’Japan’s millennium bug’ ahead of Akihito’s abdication (via) Emperor Akihito’s abdication in April 2019 triggers a new era, and the Japanese calendar counts years from the coronation of the current emperor. The era hasn’t changed since 1989 and a great deal of software is unable to handle a change. To make things more complicated... the name of the new era will be announced in late February, but it needs to be represented in unicode as a single new character... and the next version of Unicode (v12) is due out in early March. There may have to be a Unicode 12.1 released shortly afterwards that includes the new codepoint. # 28th July 2018, 2:04 pm

ftfy—fix unicode that’s broken in various ways (via) I shipped a small web UI wrapper around the excellent Python FTFY library, which can take broken unicode strings and suggest a sequence of operations that can be applied to get back sensible text. # 9th January 2018, 3:22 am

2017

I’m concerned that this character will open the floodgates for an open-ended set of PILE OF POO emoji with emotions, such as CRYING PILE OF POO, PILE OF POO WITH LOOK OF TRIUMPH, PILE OF POO SCREAMING IN FEAR, etc. Is there really any need to add a range of emotions to PILE OF POO? I personally think that changing PILE OF POO to a de facto SMILING PILE OF POO was wrong, but adding F|FROWNING PILE OF POO as a counterpart is even worse. If this is accepted then there will be no neutral, expressionless PILE OF POO, so at least a PILE OF POO WITH NO FACE would be required to be encoded to restore some balance.

Andrew West # 2nd November 2017, 4:45 pm

The idea that our 5 committees would sanction further cute graphic characters based on this should embarrass absolutely everyone who votes yes on such an excrescence. Will we have a CRYING PILE OF POO next? PILE OF POO WITH TONGUE STICKING OUT? PILE OF POO WITH QUESTION MARKS FOR EYES? PILE OF POO WITH KARAOKE MIC? Will we have to encode a neutral FACELESS PILE OF POO?

Michael Everson # 2nd November 2017, 4:41 pm

2012

What is an intuitive explanation of Unicode and why a programmer needs to know it?

Check out “The Absolute Minimum Every Software Developer Absolutely, Positively Must Know About Unicode and Character Sets (No Excuses!)” by Joel Spolsky: http://www.joelonsoftware.com/ar...

[... 55 words]

2010

Reexamining Python 3 Text I/O. Python 3.1’s IO performance is a huge improvement over 3.0, but still considerably slower than 2.6. It turns out it’s all to do with Python 3’s unicode support: When you read a file in to a string, you’re asking Python to decode the bytes in to UTF-8 (the new default encoding) at the same time. If you open the file in binary mode Python 3 will read raw bytes in to a bytestring instead, avoiding the conversion overhead and performing only 4% slower than the equivalent code in Python 2.6.4. # 28th January 2010, 1:28 pm

2009

Unicode code converter (via) Fantastically useful tool to convert strings of characters in to every unicode and/or escaping syntax you can possibly imagine. # 15th December 2009, 10:10 pm

Understanding Bidirectional (BIDI) Text in Unicode. It turns out you need to sanitise user input to ensure there are no unicode characters that switch your site’s regular text to RTL. # 15th March 2009, 4:37 am

2008

UnicodeDictWriter—write unicode strings out to Excel compatible CSV files using Python. Stuart Langridge and I spent quite a while this morning battling with Excel. The magic combination for storing unicode text in a CSV file such that Excel correctly reads it is UTF-16, a byte order mark and tab delimiters rather than commas. # 20th August 2008, 12:19 pm

Django 1.0 alpha release notes. The big features are newforms-admin, unicode everywhere, the queryset-refactor ORM improvements and auto-escaping in templates. # 22nd July 2008, 6:04 am

PortingDjangoTo3k. Martin von Loewis has started assembling a patch. His write-up illustrates some key differences between Python 2.X and Python 3—it looks like Django’s unicode handling is going to require the most work. # 19th June 2008, 5:53 pm

2007

Sam Ruby: Ruby 1.9 Strings—Updated. A follow up to yesterday’s post: Sam’s principle complaints about Ruby 1.9’s character encoding support were down to a bug which has now been fixed. # 29th December 2007, 7:34 pm

I definitely like Python 3K’s Unicode support better [...] In fact, I think I prefer Ruby 1.8’s non-support for Unicode over Ruby 1.9’s “support”. The problem is one that is all to familiar to Python programmers. You can have a fully unit tested library and have somebody pass you a bad string, and you will fall over.

Sam Ruby # 28th December 2007, 7:05 pm

Ruby 1.9—Right for You? Dave Thomas on the just-released Ruby 1.9. It’s a development release that breaks backwards compatibility in a few minor ways, but new features include the YARV virtual machine (hence significant speed improvements) and unicode support via associating encodings with bytestrings. # 26th December 2007, 12:09 pm

Unicode code converter (via) Richard Ishida’s tool for converting pretty much any unicode representation to any other. # 28th October 2007, 6:26 pm

String types in Python 3. bytes are now immutable (just like the bytestrings they are replacing) and a new mutable buffer type has been introduced. # 9th October 2007, 2:08 am

The larger question is why on earth, in 2007 and ten years after XML came out, we are still using text files that don’t label their encoding?

Rick Jelliffe # 8th October 2007, 12:27 pm

Sam Ruby: 2to3. Sam’s report on an attempt to port the Universal Feed Parser to Python 3.0. The 2to3 tool does most of the work, but it seems the unicode changes can be pretty tricky. # 3rd September 2007, 1:38 am

Announcing Babel. Impressive new Python i18n / l10n package, with improved message extraction and a huge amount of bundled locale data. # 20th July 2007, 12:20 pm

UnicodeBranch: Porting Applications. A checklist for porting Django applications to handle the new unicode changes. If your application only handles ASCII text at the moment you shouldn’t have to change a thing. # 4th July 2007, 2:41 pm

Unicode data in Django. Documentation for Django’s new unicode support. # 4th July 2007, 2:24 pm

Django changeset 5609. “Merged Unicode branch into trunk. This should be fully backwards compatible for all practical purposes.” # 4th July 2007, 2:22 pm

HTML Entity Character Lookup. Look up HTML entities by characters that are a similar shape. # 3rd July 2007, 3:41 pm

Django unicode-branch: testers wanted. Malcolm’s outstanding work on the unicode branch appears to be nearing completion. # 24th May 2007, 11:46 pm

2006

Javascript character set screw-ups (via) Some browsers treat JavaScript files as having the same content-type as the page from which they are linked. This could cause problems with UTF-8 encoded JSON; the workaround is serving up ASCII with unicode escape sequences. # 21st December 2006, 3:20 pm

Unicode character information. Useful lookup tool. # 30th November 2006, 11:55 am