Saturday, April 16, 2011

UTF-8

We all know what UTF-8 is—sort of. It's a character encoding that vastly expands the number of available characters (to 221 characters) yet is still compatible with ASCII. This is a good trick and I occasionally wondered how it worked. I tried reading Pike and Thompson's paper about it when they first introduced it in Plan 9 but while the paper is good on history and “whys,” it doesn't go into much detail on how the encoding works.

Happily, via reddit, I just stumbled across the best explanation of UTF-8 that I have ever seen. It's short and simple and explains everything in a very clear way. If you don't already know all about UTF-8, you should definitely spend 5 minutes with this article.

In conjunction with the above, you might also want to read Xah Lee's excellent post on Emacs and Unicode Tips. It shows you how to get access to these extra characters in Emacs and gives some guidance on integrating UTF-8 into your day-to-day work with Emacs. For that matter, Lee has a Unicode Tutorial page that you might find useful.

No comments:

Post a Comment