The Truth About Unicode In Python

3 July 2008
10:13

The unicode support in Python is generally considered to be pretty good. And in comparison to many other languages, it's good indeed.

But compared to what is provided by the International Components for Unicode (ICU) project, there's also a lot missing, including collation, special case conversions, regular expressions, text segmentation, and bidirectional text handling. Not to mention extensive support for locale-specific formatting of dates and numbers and time calculations with different calendars.

Basically what Python does provide out of the box is “only” encoding/decoding, normalization, and some other bits such as simple case conversion and splitting on whitespace. It's the absolute minimum you need to do anything useful with unicode, but often not enough to build truly internationalized applications. (Fortunately, most applications get away without true internationalization.)

In this post, I'm going to talk about a couple of the problems with unicode in Python. Please note that this is not intended as a criticism of Python's unicode support or the people who designed and implemented it. Most of those people probably know a whole lot more about unicode than I do, and the limitations discussed here are the result of a pragmatic approach to implementing unicode support, rather than due to a lack of knowledge.

Collation

Collation in this context refers to how strings are sorted. Though that might seem simple enough, there are a number of challenges here. First, you can't simply compare unicode code points, as the order in which characters appear in the unicode code charts says little to nothing about how they compare in collation.

Instead, basic collation is defined by the Unicode Collation Algorithm (UCA). This algorithm converts code points into collation elements based on the Default Unicode Collation Element Table (DUCET). The collation elements are then used for the comparison, rather than the code points themselves.

Unfortunately, Python does not (yet) come with support for unicode collation, and instead uses the code point comparison approach, which results in incorrect sorting as soon as your strings move beyond basic ASCII letters and digits:

>>> u'cafe' < u'café' < u'caff'
False

If string comparison adhered to the UCA, that example should have printed True. But Python compares strings based on the character code points, and because the code point “é” (\uE9) is greater than “f” (\u66), “café” is sorted after “caff”.

James Tauber wrote a Python implementation of the UCA a while ago.

The other challenge is that collation is often locale-specific, and even within the same country and language there may be different types of collation, such as phonebook versus dictionary sorting. The Common Locale Data Repository (CLDR) addresses locale-specific collation requirements.

Case Conversion

Case conversion refers to string operations such as converting a string to lower case, upper case or title case. It is a bit simpler than collation: the unicode code charts include information about lowercase/uppercase variants of each code-point. But that only covers the 1:1 mappings. Some lower case characters (such as “ß”) actually map to two or more upper case code points (“SS” in this case). These special case mappings are defined in a supplemental unicode data file called SpecialCasing.txt.

The Python unicode methods lower(), upper(), and title() are restricted to 1:1 case mappings.

>>> u'ß'.lower()
ß
>>> u'ß'.upper()
ß
>>> u'ß'.title()
ß

And similar to collation, case mapping may depend on the locale. This time, locale-specific tailoring is not specified by the CLDR, but rather in SpecialCasing.txt.

Regular Expressions

Of course, regular expressions need some extensions to make them really usable in a unicode environment. For example, picture the way you'd match Wiki words using a regular expression, assuming only ASCII text. It might look something like:

>>> import re
>>> re.findall(r'(?:[A-Z][a-z]+){2,}', 'Hello WikiWord, bye!')
['WikiWord']

But what about non-ASCII Wiki words such as “TürÖffner”? That regular expression would fail, as it only works on lower and upper case ASCII letters. For such cases, the annex Unicode Regular Expressions defines the use of character properties in regular expressions. Instead of explicitly checking for the ranges [A-Z] and [a-z], you'd use r'(?:\p{Lu}\p{Ll}+){2,}', where \p{Lu} matches all upper case characters, and \p{Ll} matches all lower case characters.

Unfortunately the use of unicode character property matching is not supported by the Python re module. And that's just the basics. There's a lot more to regular expressions that are really aware of unicode, such as case-insensitive matching that takes special case mappings into account.

Text Segmentation

Text segmentation refers to the splitting of text into units such as user-perceived characters (grapheme clusters), words, and sentences. It is specified by Unicode Text Segmentation. For example, what the user perceives as a single character may be composed from multiple unicode code point. “Ǵ” is perceived as a single character by the user, but it is actually two unicode code points. When you segment on a user-perceived character basis, you don't want to split that into two characters.

Bi-directional Text

There are a couple of languages (mostly arabic) that are generally written right-to-left, but that can contain certain passages written left-to-right, such as numbers or english words. Bidi text should only be a concern at input/output boundaries, as it does not affect the logical order of text, and the logical order is what should be processed internally in applications. But when you render bidi text, it needs to be reordered.

Arguably bidi text is outside the domain of the vast majority of Python applications. They'd either use HTML bidi markup, and rely on the browser getting the reordering right, or they'd be using a proper text layout engine.

Locale Context

Many of the the unicode algorithms (such as collation and case folding) need to be tailored for specific languages, countries, or scripts.

Python has a locale module, but unfortunately (for web applications, at least), it manages the locale as a global, process-wide setting. It's basically just a thin wrapper on top of the POSIX locale functionality. For command-line scripts and desktop applications, this works okay, but for web applications the locale depends on the current request (for example based on the Accept-Language HTTP header.) The locale module is basically unusable here if you're not using a CGI-like request handling model, that is without multi-threading or asynchronous processing.

This module also provides some basic date and number formatting, but due to to the global nature of the locale, that is not usable in web applications either.

Internal Representation

And finally, there's the way Python represents unicode strings internally. Back in version 2.1, the unicode type in Python was simply a string of 2-byte characters, each character being the unicode code point as short integer (that's UCS-2, not UTF-16). When the unicode standard got extended to supports characters outside the Basic Multilingual Plane (BMP), using fixed-length 2-byte-per-character storage was no longer sufficient. You'd either need to use 4 bytes per character (UTF-32 or UCS-4), or you'd need to do variable-length character encoding (such as UTF-16 with surrogate pairs).

The Python developers chose to go with the former option: using 4 bytes per character in every unicode string. However, this was made a compile-time switch, and the default remains UCS-2 to this day (even though many Linux distributions default to UCS-4). That effectively means that on a UCS-2 Python (the common case), characters outside the BMP will screw up your string operations in the same way that multi-byte characters in UTF-8 screw up naive handling of bytestrings:

>>> char = u"\N{MUSICAL SYMBOL G CLEF}"
>>> len(char)
2
>>> import unicodedata
>>> unicodedata.name(char)
Traceback (most recent call last):
  File "…"

Another result of making this a compile-time switch is that Python extensions written against the C API that are compiled against a UCS-2 Python need to be recompiled to run on UCS-4, and vice versa.

Back in 2001 when this decision was made, Guido had this to say to back it up:

I see only one remaining argument against choosing 3 [UCS-4] over 2 [UTF-16 with surrogate pairs]: FUD about disk and primary memory space usage.

[…]

The primary memory space problem will go away with time; assuming that most textual documents contain at most a few millions of characters, it's already not that much of a problem on modern machines. Applications that are required to deal efficiently with larger documents should support some way of streaming or chunking the data anyway.

Based on some testing with Trac against both UCS-2 and UCS-4 builds of Python, I can confirm that the difference in memory usage is negligible. In fact I needed to make a ticket query request that returned a ridiculously large amount of text to get the difference into the 5% range.

But while Python 3.0 is going to (finally) make unicode the default for strings, there still doesn't seem to be interest in either making UCS-4 the default, or even dropping UCS-2.

To properly support characters outside the BMP in Python, either the UCS-2 configuration needs to be dropped, or the Python string type (and other parts such as the regular expression engine) needs to be made aware of surrogate pairs, thereby basically upgrading to UTF-16.

Moving Forward

So there's a couple of things missing from the unicode support in Python.

I started the Babel project to address some of the shortcomings of Python with respect to internationalization (such as localizable date and number formatting). But the scope of a project to bring more advanced unicode features to Python would be huge, and it would largely duplicate what is already provided by the ICU project.

So here's what I'd suggest: stand on the shoulders of ICU by providing a high-level, pythonic API on top of it. Oh, and that probably isn't PyICU. Would this need to be part of the standard library? Not sure.

Is anyone else interested in reviving the I18N-SIG to take the internationalization support in Python to the next level?

Reactions

  1. John Millikin says:

    3 July 2008
    16:00

    Something struck me as odd about the Regex complaint, so I'm trying to test it (and variations) on my local system. Unfortunately, it does not work as written:

    >>> import re
    >>> re.findall(r'(?:[A-Z][a-z]+)+', 'Hello WikiWord, bye!')
    ['Hello', 'WikiWord']
    

    There's also another issue with Unicode that wasn't mentioned -- there are no \u or \U escapes in the regex engine, and Py3 r'\u0040' -> Py2 u'\\u0040', so it will be more difficult to use unicode characters in the regex pattern itself.

  2. Christopher Lenz says:

    3 July 2008
    16:13

    John: Thanks for spotting the regex error, should be fixed now.

  3. David Reid says:

    3 July 2008
    16:39

    Great post, but it might be useful to point out the most glaring problem with unicode in python. Glyph Lefkowitz says it best in his article Encoding:

    I believe that in the context of this discussion, the term "string" is meaningless. There is text, and there is byte-oriented data (which may very well represent text, but is not yet converted to it). In Python types, Text is unicode. Data is str. The idea of "non-Unicode text" is just a programming error waiting to happen.

  4. Matt Good says:

    18 August 2008
    17:52

    The Ponygurama project in the Pocoo Sandbox looks like a promising option for much improved unicode regex support. After seeing a couple questions on work mailing lists about unicode properties in regexes I played with wrapping PCRE with ctypes. I got a basic example working (besides a bug deallocating their objects which was crashing the interpreter). However, PCRE's support for properties is very limited, and more annoyingly it only supports UTF-8 data, so Python unicode objects would need re-encoded before matching which could have a considerable performance impact.