Feed Sign in with OpenID OpenID

Simon Willison’s Weblog

Practical Unicode, please!

Joel Spolsky has joined Tim Bray in the quest to educate the masses as to the importance of Unicode. Dan Sugalski kicks in as well with What the heck is: A string, a lengthy essay about string handling and why it really is a lot more complicated than you think it is.

These should all be required reading for anyone involved in programming and web development. Unfortunately, they all lack one critical aspect: practical advice. Having read all three I feel like I could lecture for an hour on code points, glyphs, ASCII, byte-order and a whole bunch of other topics. When it comes to updating my blogging system to support comments written in Japanese I’m still almost as clueless as I was before I read any of the above.

Enough of the theory: the web needs practical advice on developing Unicode enabled web pages and web applications. Is it just a case of ensuring my text editor is “saving as Unicode”? What about storage—can I throw Unicode at MySQL and expect it to come out again? If I serve a page up with Japanese characters in it, what will my users have to do to be able to read them? It’s a big, confusing world out there.

This is Practical Unicode, please! by Simon Willison, posted on 13th October 2003.

View blog reactions

Next: Learning to use Floats

Previous: New anti-comment-spam measure

20 comments

  1. About MySQL and Unicode, MySQL will come with proper Unicode support in 4.1. Until then, you can expect to get back what you inserted, but do not expect to have support for string ordering or case-insensitive searches. More on MySQL here: http://www.mysql.com/doc/en/Charset.html

    Goba - 13th October 2003 14:04 - #

  2. Thanks. I also just found some useful information on Unicode in Python, via a link from Simon Brunning.

    Simon Willison - 13th October 2003 14:11 - #

  3. At Twisted Eric Ritz says:

    Chadou (茶道) is the tea ceremony that so often characterizes the Japanese culture.

    Michael - 13th October 2003 14:13 - #

  4. Simon Willison said:

    When it comes to updating my blogging system to support comments written in Japanese ...

    Apparently, it does. I can see the character I posted above. I guess, since the character set for the page is set to utf-8, they'll display.

    But I should add OS X on my Mac. has Japanese fonts as standard. Anyone, who hasn't got the fonts will probably get a question mark in place of the Japanese character above.

    Michael - 13th October 2003 14:19 - #

  5. I'm just seeing a pair of question marks, presumably because I don't have the correct fonts installed.

    Simon Willison - 13th October 2003 14:26 - #

  6. Judging by the Japanese characters above, it looks like your blog is already doing what I recommend: converting any characters above 127 to &#nnn; syntax, optionally using the htmlentitydefs when possible.

    Since you appear to be in the process of switching to Python, perhaps some of my code may be helpful. Feel free to lift any of the code you find in atomef.py and use as you like.

    Sam Ruby - 13th October 2003 14:46 - #

  7. Whether or not the correct characters are displayed seems to depend on the browser that is used, not the fonts that are installed. I've got no Japanese fonts installed, but when using Mozilla 1.4, IE 5.5 SP2 or Opera 6/7 this page displays correctly. Opera 5 fails. I'm using the Windows Millennium platform.

    Bas Hamar de la Brethoniere - 13th October 2003 14:49 - #

  8. Principally Unicode: is the standard for representing characters as integers thus it has many practical implications. Many analysts believe that as the software industry becomes increasingly global, Unicode may eventually replace ASCII.

    Also, I was most satisfied several years back when the archaic Netscape 6.0 starting showing the results of various web designers not using encoding on various websites and noticed if a page did not have character encoding, or used 'illegal characters'.

    You should see lots of little "diamond shapes", all over the place called the 'Replacement Character' FFFD wherever there is a mismatch in characters compared to its encoding if you use a modern user-agent.

    Robert Wellock - 13th October 2003 15:18 - #

  9. Bas Hamar de la Brethoniere says:

    Whether or not the correct characters are displayed seems to depend on the browser...

    Absolutely. I've always thought that the IPA is quite an interesting concept. Looking around awhile back, I found that it can be represented in Unicode. This makes a nice test page:

    The International Phonetic Alphabet in Unicode

    Now, in Camino I get plenty of question marks. However, in Safari I see most of the characters.

    You need the fonts, but it's not enough. I'm taking this as another (unexpected) proof of Safari's quality.

    Michael - 13th October 2003 18:57 - #

  10. It's time for your to read this http://www.w3.org/TR/2003/WD-i18n-html-tech-200310 09/

    Karl Dubost - 13th October 2003 19:35 - #

  11. Simon, I feel your pain. I bumbled along when creating the 15 or so translations I currently have, but somewhere along the way it seems to have worked.

    This is my guide on how to encode a page. It's not canonical in any way, but more meant to cut down on the "how the hell do I even begin?" sort of questions that everyone has had at one point or another.

    Dave S. - 13th October 2003 19:46 - #

  12. I should mention - after writing that, I have seen enough extra opinion on the matter that leads me to believe if you use UTF-8 encoding for all international concerns, you will come out okay. No real need to mess around with custom ISO codes, just go with Unicode and make sure your XML language is defined.

    Dave S. - 13th October 2003 19:50 - #

  13. For what it's worth, the International Phonetic Alphabet page works fine with Mozilla Firebird on my Redhat 9 system - all the characters seem to display as expected, although I'm not sure which fonts they are being pulled from. If anyone has a way of determining this (short of a compiling a debug build), I'd be interested to know.

    Fonts and character encodings also turn out to be a big problem when trying to do MathML on the internet, at least when using Mozilla. It turns out that you need to install an array of different fonts and, if you start digging through the relevant bugs reports, the amount of work the backend has to do to pull the correct glyphs from the various fonts (taking into account diffent font formats and so on) is really quite stunning. This has the unfortunate side effect that setting up MathML is somewhat non-trivial.

    Incidentally, I know that Camino is missing all the MathML code from Mozilla (I believe it was crashing with MathML at one point), so maybe some of the font rendering code is missing too. That would explain the faliure to render the IPA page. Of course, it might just be that Apple's programmers have a better grasp of OSX font handling than the Mozilla team.

    jgraham - 13th October 2003 22:35 - #

  14. UTF-8 is only part of the story. Some kind of language identification is also important. There are two reasons why I believe this is so:

    First, because of Han unification, some Chinese characters require an identification of the language (or language/locale combination) in order to be rendered correctly. Note that the simplified glyphs used in PRC share code points with the traditional glyphs used in Taiwan. When the character encoding is GB-2312 (PRC) or BIG5 (Taiwan), the language is implicitly identified. Not so with UTF-8. Also, Japanese Kanji use the same code points for the Chinese characters but also use different glyphs for certain characters.

    There may be a similar situation with other languages. I only know about languages of the Far East. (I cannot read or understand a single language of the Far East, though.)

    2. Some indication of the language is necessary also for processing by machines. Specifically, search engines need to have some indication of what constitutes a "word," and that requires some knowledge of the language.

    Doug Sauder - 14th October 2003 00:23 - #

  15. Doug,

    As I understand it, UTF-8 can indeed represent all code points, but variations occur through combining characters (as discussed in the linked "what the heck is" article).

    Certainly, different languages might and do affect glyphs, but I'd expect that particular code points can be derived from UTF-8 as well as any other encoding. Your argument that a language could affect the meaning of a particular code point seems to violate the Unicode concept of a code point as the platonic ideal of a particular character. Do I understand incorrectly?

    Now, on the bit about machine processing-- I agree that it's necessary for something somewhere to know what different pertinent constructs for a language are, and how to recognize them, and so it is likely important to store the language associated with a string. But I don't think knowledge of language constructs has any business in a search engine.

    Search engines currently take the trivial case of searching for words (that is, series of alpha-numerics surrounded by whitespace), but they don't approach actual knowledge of the language. Passing a string to some subprocessor to extract search tokens (likely words) would be more appropriate than trying to make the search engine -also- be good at deciphering all the world's languages.

    There's no reason that subprocessor couldn't be reusable, as far as I understand. You might need an adapter for various languages, such that in English, the search construct(s) is a word, but in Chinese, it's a syllable, and in Klingon it's a frobnitz.

    Jeremy Dunck - 14th October 2003 04:55 - #

  16. Just for the record, PostgreSQL fully support Unicode. +1

    Sérgio Nunes - 14th October 2003 08:18 - #

  17. Jeremy,

    Your understanding is incorrect, or at leat incomplete--Unicode places very little (arguably no, but that argument would be something of a stretch) meaning on individual characters. How a stream of code points is rendered and how they are interpreted is, with some very broad guidelines, outside the scope of the standard. It does, certainly, place meaning on classes of characters (what's a combining character, what isn't, and what order combining characters can occur in) but past that, it doesn't say much.

    It's potentially bad enough in the western languages, but in the Asian languages, courtesy of Han unification (amongst other things) there's a huge issue of interpretation of the resulting code point stream.

    Han unification, for those that don't know, is Unicode's way of cutting down on the number of unique characters in the standard. Chinese, Japanese, and Korean all use chinese characters though in the case of Japanese and Korean, not exclusively--they mix them in with language-specific characters. In cases where there is overlap--essentially the same character in two or more languages--Unicode just chose one.

    Simple, right? I mean, if it's the same character in chinese and japanese, or chinese and korean, why have two code points? Well...

    The problem is that it turns out that it isn't the same character, or worse it's the same but different character. So you've got a code point that maps to some character in the Han range, but what it looks like depends on whether it's the Chinese, Japanese, or Korean version, and what it means (which you need for searching and figuring word breaks) depends on what language it comes from.

    And just for fun, many of the Chinese characters have changed shape and form over the decades and centuries, but Unicode has just a single code point for the character. Unlike, say, english where it's just a font issue there are apparently issues of shift of meaning. (The meaning of "A" has been pretty constant these past centuries. but the meanings and representations of the various chinese characters hasn't been)

    There are also a variety of cultural and political issues involved with unicode, enough to make your head spin. Often made worse because the technical folks building the standards have the standard set of Geek social skills (i.e. none :) which tends to exacerbate smouldering issues, but that's a separate problem.

    Building a unified character set is, so far as I can tell, about as much fun as diving into a pool of acid filled with exploding, starved pirhanna, while being shot at and hit by lightning. Even if your form is perfect, you're still really screwed... :)

    Dan - 14th October 2003 15:48 - #

  18. Actually, this is all not such a big problem, but to solve it as a whole, you probably have to examine the whole set of your working software

    We have succesfully deployed a few servers with PHP/MySQL (and now Postgres) that are in Russia, and we are steadily moving towards Unicode. What I can say about MySQL et al - always consult the manual. We used not Unicode for MySQL, but instead a "flat" carset addition for Windows, called Win-1251, and everything worked pretty well with this solution, except for the sort order - for this you have to modify the mysql.conf file (and if you use shared hosting - this is not always possible).

    With PostgreSQL we made a move and store all the text in Unicode. In contrast with MySQL, Postgres has extensive charset support. It is sometimes probematic when you feed in strings in Unicode, but we rarely have this problem. The other advantage of PG is that it allows you to maintain your database with one encoding, and perform queries to it in another one (from all applications, be it PHP or something else - essentially anything built upon LibPQ) - and the sorts will not be affected. With MySQL you are mostly limited to using only the charset your server is tuned for. This way you can store your data in UTF till you have some solution to move your whole codebase to Unicode as well.

    I did try PHP's functions on Unicode pages, and usually it works well (especially in the newest builds as you can specify charset when doing entity quoting etc.)

    IIt is also advisable (still), if you use only single-byte alphabets (as we do in russia/english pages etc.) to use the most widespread charset for this language instead of Unicode, and this can easily be acheived programmatically (just set the iconv as a buffer callback and you are set). If you want to use Japanese text, for instance, this can be tricky.

    But where you are really in trouble - is with proprietary software, as companies often proclaim Unicode-compliancy, but not pay time to implement it, fill it with bugs etc. So if you stay with opensource tools I think you can do it pretty well with a little experimentation. And also don;t forget about the locales setup on your server - it can affect some daemons in unpredictable ways as they try to figure out which charset and sortorder is your favorite.

    Julik Tarkhanov - 14th October 2003 15:57 - #

  19. Dan:

    And just for fun, many of the Chinese characters have changed shape and form over the decades and centuries, but Unicode has just a single code point for the character. Unlike, say, English where it's just a font issue there are apparently issues of shift of meaning. (The meaning of "A" has been pretty constant these past centuries. but the meanings and representations of the various chinese characters hasn't been)

    I am not going to claim that any of my opinion is based on sound technical understandings of the issue at hand, but from a linguistic point of view I see no reason for code points to be mapped to different versions of the same Chinese character simply because different Eastern languages assign different pronounciations and meanings to them. Following that train of thought we may as well assign different code points for the "A" used in English, the "A" used in French, the "A" used in Spanish, and so on, because linguistically all of these "A"'s do not have the same value or meaning whatsoever. The interpretation of their value is something too complex for a standard like Unicode to bear on its shoulders; all that really matters (in my opinion) is that there is a code point that maps to something that looks like "A." I, as well as my readers, will interpret the meaning of that "A" as they see fit when they happen to come across it. Likewise when I type the Chinese character sen all that matters to me is that it ends up looking like the character I want. It doesn't matter if the character is read as hsien in Mandarin Chinese or son in Korean, they are merely the same character with different pronounciations just as "A" is a character with many pronounciations.

    It is true that Chinese characters bear with them an innate meaning that alphabetic letters do not. When I see the character koi it's not just a letter, it means something. But I do not feel it is the nature of Unicode to worry itself with these meanings; Unicode was not intended to function like a dictionary for the meanings of Chinese characters.

    Eric Ritz - 14th October 2003 17:11 - #

  20. I have tried to change the table character set to utf8 infact i have changed it (using mysql) but i am unable to store unicode data (that is arabic ) data in it. i have done the same with sql but unable to do with mysql.i am using mysql 4.0.2. please help me. Regards Munazzah

    Munazzah - 7th October 2005 07:23 - #

Comments are closed.

Previously hosted at http://simon.incutio.com/archive/2003/10/13/practicalUnicode

A django site