Unicode strings

Introduction to Unicode

All text that is handled by computers must be encoded. Every letter in a text has to be represented by a numeric value. For a long time, it was assumed that 7 bits would provide enough values to encode all necessary letters; this was the basis for the ASCII character set. However, with the spread of computers all over the world, it became clear that this was not enough. A whole host of different encodings were designed, varying from the obscure (TISCII) to the pervasive (latin-1). Of course, this leads to problems when you are trying to exchange texts. A western-european latin-1 user cannot easily read a Russian koi-8 text on his system. Another problem is that those small, one-byte, eight-bit character sets don't have room for useful stuff, such as extensive mathematical symbols. The solution has been to create a monster character set consisting of at least 65000 code-points including every possible character someone might want to use. This is ISO/IED-10646. The Unicode standard (http://www.unicode.org) is the official implementation of ISO/IED-10646.

Unicode is an essential feature of any modern application. Unicode is mandatory for every e-mail client, for instance, but also for all XML processing, web browsers, many modern programming languages, all Windows applications (such as Word), and KDE 2.0 translation files.

Unicode is not perfect, though. Some programmers, such as Jamie Zawinski of XEmacs and Netscape fame, lament the extra bytes that Unicode needs — two bytes for every character instead of one. Japanese experts oppose the unification of Chinese characters and Japanese characters. Japanese characters are derived from Chinese characters, historically, and even their modern meaning is often identical, but there are some slight visual differences. These complainers are often very vociferous, but Unicode is the best solution we have for representing the wide variety of scripts humanity has invented.

There are a few other practical problems concerning Unicode. Since the character set is so very large, there are no fonts that include all characters. The best font available is Microsoft's Arial Unicode, which can be downloaded for free. The Unicode character set also includes interesting scripts such as Devanagari, a script where single letters combine to from complicated ligatures. The total number of Devanagari letters is fairly small, but the set of ligatures runs into the hundreds. Those ligatures are not defined in the character set, but have to be present in fonts. Scripts like Arabic or Burmese are even more complicated. For those scripts, special rendering engines have to be written in order to display a text correctly.

From version 3, Qt includes capable rendering engines for a number of scripts, such as Arabic, and promises to include more. With Qt 3, you can also combine several fonts to form a more complete set of characters, which means that you no longer have use have one monster font with tens of thousands of glyphs.

The next problem is inputting those texts. Even with remappable keyboards, it's still a monster job to support all scripts. Japanese, for instance, needs a special-purpose input mechanism with dictionary lookups that decide which combination of sounds must be represented using Kanji (Chinese-derived characters) or one of the two syllabic scripts, kana and katakana.

There are still more complications, that have to do with sort order, bidirectional text (Hebrew going from right to left, Latin from left to right) — then there are vested problems with determining which language is the language of preference for the user, which country he is in (I prefer to write in English, but have the dates show up in the Dutch format, for instance). All these problems have their bearing upon programming using Unicode, but are so complicated that a separate book should be written to deal with them.

However, both Python strings and Qt strings support Unicode — and both Python and Qt strings support conversion from Unicode to legacy character sets such as the wide-spread Latin-1, and vice-versa. As said above, Unicode is a multi-byte encoding: that means that a single Unicode character is encoded using two bytes. Of course, this doubles memory requirements compared to single-byte character sets such as Latin-1. This can be circumvented by encoding Unicode using a variable number of bytes, known as UTF-8. In this scheme, Unicode characters that are equivalent to ASCII characters use just one byte, while other characters take up to three bytes. UTF-8 is a wide-spread standard, and both Qt and Python support it.

I'll first describe the pitfalls of working with Unicode from Python, and then bring in the Qt complications.

Python and Unicode

Python actually makes a difference between Unicode strings and 'normal' strings — that is, strings where every byte represents one character. Plain Python strings are often used as character arrays representing immutable binary data. In fact, plain strings are semantically very similar to Java's byte array, or Qt's QByteArray class — they represent a simple sequence of bytes, where every byte may represent a character, but could also represent something quite different, not a human readable text at all.

Creating a Unicode string is a bootstrapping problem. Whether you use BlackAdder's Scintilla editor or another editor, it will probably not support Unicode input, so you cannot type Chinese characters directly. However, there are clever ways around this problem: you can either type hex codes, or construct your strings from other sources. In the third part of this book we will create a small but fully functional Unicode editor.

String literals

You can create a Unicode string literal by prefixing the string with the letter u, or convert a plain string to Unicode with the unicode keyword. You cannot, however, write Python code using anything but ASCII. If you look at the following script, you will notice that there is a function defined in Chinese characters (yin4shua1 means print), that tries to print the opening words of the Nala —, a Sanskrit epos. Python cannot handle this, so all actual code must be in ASCII.

A Python script written in Unicode.

Of course, it would be nice if we could at least type the strings directly in UTF-8, as shown in the next screenshot:

A Python script with the strings written in Unicode.

Unfortunately, this won't work either. Hidden deep in the bowels of the Python startup process, a default encoding is set for all strings. This encoding is used to convert from Unicode whenever the Unicode string has to be presented to outside world components that don't talk Unicode, such as print. By default this is 7-bits ASCII. Running the script gives the following error:

boudewijn@maldar:~/doc/opendoc/ch4 > python unicode2.py

Traceback (most recent call last):
  File "unicode2.py", line 4, in ?
    nala()
  File "unicode2.py", line 2, in nala
    print u"à¤à¤¸à¥à¤¦ राà¤à¤¾ नलॠनाम "
UnicodeError: ASCII encoding error: ordinal not in range(128)
        

The default ASCII encoding that Python assumes when creating Unicode strings means that you cannot create Unicode strings directly, without explicitly telling Python what is happening. This is because Python tries to convert from ASCII to utf8, and every byte with a value greater than the maximum ASCII knows (127) will lead to the above error. The solution is to use an explicit encoding.The following script will work better:

Explicitly telling Python that a string literal is in the utf-8 encoding.

If you run this script in a Unicode-enabled terminal, like a modern xterm, you will see the first line of the Nala neatly printed. Quite an achievement!

You can find out which encodings your version of Python supports by looking in the encodings folder of your Python installation. It will certainly include mainstays such as: ascii, iso8859-1 to iso8859-15, utf-8, latin-1 and a host of MacIntosh encodings as well as MS-DOS codepage encodings. Simply substitute a dash for every underscore in the filename to arrive at the string you can use in the encode() and decode() functions.

boudewijn@maldar:/usr/local/lib/python2.0/encodings > ls *py
__init__.py cp1254.py cp852.py cp869.py      iso8859_5.py    
aliases.py  cp1255.py cp855.py cp874.py      iso8859_6.py    
ascii.py    cp1256.py cp856.py cp875.py      iso8859_7.py    
charmap.py  cp1257.py cp857.py iso8859_1.py  iso8859_8.py    
cp037.py    cp1258.py cp860.py iso8859_10.py iso8859_9.py    
cp1006.py   cp424.py  cp861.py iso8859_13.py koi8_r.py       
cp1026.py   cp437.py  cp862.py iso8859_14.py latin_1.py      
cp1250.py   cp500.py  cp863.py iso8859_15.py mac_cyrillic.py 
cp1251.py   cp737.py  cp864.py iso8859_2.py  mac_greek.py    
cp1252.py   cp775.py  cp865.py iso8859_3.py  mac_iceland.py  
cp1253.py   cp850.py  cp866.py iso8859_4.py  mac_latin2.py   

Reading from files

The same problem will occur when reading text from a file. Python has to be explicitly told when the file is in an encoding different from the default encoding. Python's file object reads files as bytes and returns a plain string. If the contents are not encoded in Python's default encoding (ASCII), you will have to be explicit about it. Let's try reading the preceding script, unicode3.py, which was saved in utf-8 format.

Example 8-6. Loading an utf-8 encoded text

#
# readutf8.py - read an utf-8 file into a Python Unicode string
#

import sys, codecs

def usage():
    print """
Usage:

python readutf8.py file1 file2 ... filen
"""

def main(args):
    if len(args) < 1:
        usage()
        return

    files=[]
    print "Reading",
    for arg in args:
        print arg,
        f=open(arg,)
        s=f.read()
        u=unicode(s, 'utf-8')
        files.append(u)
    print

    files2=[]
    print "Reading directly as Unicode",
    for arg in args:
        print arg,
        f=codecs.open(arg, "rb", "utf-8")
        u=f.read()
        files2.append(u)
    print

    for i in range(len(files)):
        if files[i]==files2[i]:
            print "OK"

if __name__=="__main__":
    main(sys.argv[1:])
          

As you can see, you either load the text in a string and convert it to a Unicode string, or use the special open function defined in the codecs module. The latter option allows you to specify the encoding when opening the file, instead of only when writing to the file.

Other ways of getting Unicode characters into Python string objects

We've now seen how to get Unicode data in our strings from either literal text entered in the Python code or from files. There are several other ways of constructing Unicode strings. You can build strings using the Unicode escape codes, or from a sequence of Unicode characters.

For this purpose, Python offers unichr, which returns a Unicode string of exactly one character wide, when called with a numerical argument between 0 and 65535. This can be useful when building tables. The resultant character can, of course, only be printed when encoded with the right encoding.

Example 8-7. Building a string from single Unicode characters

#
# unichar.py Building strings from single chars.
#
import string, codecs

CYRILLIC_BASE=0x0400

uList=[]
for c in range(255):
    uList.append(unichr(CYRILLIC_BASE + c))

# Combine the characters into a string - this is
# faster than doing u=u+uniChr(c) in the loop
u=u"" + string.join(uList,"")

f=codecs.open("cyrillic1.ut8", "aw+", "utf-8")
f.write(u)
f.flush()

f=open("cyrillic2.ut8", "aw+")
f.write(u.encode("utf-8"))
f.flush()
          

Note that even if you construct your Unicode string from separate Unicode characters, you will still need to provide an encoding when printing (utf-8, to be exact). Note also that when writing text to a file, you will need to explicitly tell Python that you are not using ASCII.

Another way of adding the occasional Unicode character to a string is by using the \uXXXX escape codes. Here XXXX is a hexadecimal number between 0x0000 and 0xFFFF:

Python 2.1 (#1, Apr 17 2001, 20:50:35)
[GCC 2.95.2 19991024 (release)] on linux2
Type "copyright", "credits" or "license" for more information.
>>> u=u"\u0411\u0412
        

About codecs and locales: With all this messing about with codecs you will no doubt have wondered why Python can't figure out that you live in, say, Germany, and want the iso-8950-1 codec by default, just like the rest of your system (such as your mail client, your wordprocessor and your file system) uses. The answer is twofold. Python does have the ability to determine from your system which codec it should use by default. This feature, however, is disabled, because it is not one-hundred percent reliable. You can enable that code, or change the default codec system-wide, for all Python programs you use, by hacking the site.py file in your Python library directory:

# Set the string encoding used by the Unicode implementation.  The
# default is 'ascii', but if you're willing to experiment, you can
# change this.

encoding = "ascii" # Default value set by _PyUnicode_Init()

if 0:
    # Enable to support locale aware default string encodings.
    import locale
    loc = locale.getdefaultlocale()
    if loc[1]:
        encoding = loc[1]

...

if encoding != "ascii":
    sys.setdefaultencoding(encoding)
          

Either change the line encoding = "ascii" to the codec associated with the locale you live in, or enable the locale aware default string encodings by setting the line if 0: to if 1:.

It would be nice if you could call sys.setdefaultencoding(encoding) to set a default encoding for your application, such as utf-8. But, and you don't want to hear this, this useful function is intentionally deleted from the sys module when Python is started, just after the file site.py is run on startup.

What can one do? Of course, it's very well to assume that all users on a system work with one encoding and never make trips to other encodings; or to assume that developers don't need to set a default encoding per application, because the system will take care of that, but I'd still like the power.

Fortunately, there's a solution. I'll probably get drummed out of the regiment for suggesting it, but it's so useful, I'll tell it anyway. Create a file called sitecustomize.py as follows:

Example 8-8. sitecustomize.py — saving a useful function from wanton destruction

#
# sitecustomize.py - saving a useful function. Copy to the
# somewhere on the Python path, like the site-packages directory
#
import sys

sys.setappdefaultencoding=sys.setdefaultencoding
          

Make this file a part of your application distribution and have it somewhere on the Python path which is used for your application. This file is run automatically before site.py and saves the useful function setdefaultencoding under another name. Since functions are simply references to objects and those objects are only deleted when the last reference is deleted, the function is saved for use in your applications.

Now you can set UTF-8 as the default encoding for your application by calling the function as soon as possible in the initialization part of your application:

Example 8-9. uniqstring3.py - messing with Unicode strings using utf-8 as default encoding

#
# uniqstring3.py - coercing Python strings into and from QStrings
#
from qt import QString
import sys

sys.setappdefaultencoding("utf-8")

s="A string that contains just ASCII characters"
u=u"\u0411\u0412 - a string with a few Cyrillic characters"

print s
print u
            

Qt and Unicode

As mentioned earlier, QString is the equivalent of a Python Unicode string. You can coerce any Python string or any Python Unicode object into a QString, and vice versa: you can convert a QString to either a Python string object, or to a Python Unicode object.

If you want to create a plain Python string from a QString object, you can simply apply the str() function to it: this is done automatically when you print a QString.

Unfortunately, there's a snake in the grass. If the QString contains characters outside the ASCII range, you will hit the limits dictated by the default ASCII codec defined in Python's site.py.

Example 8-10. uniqstring1.py - coercing Python strings into and from QStrings

#
# uniqstring1.py - coercing Python strings into and from QStrings
#
from qt import QString

s="A string that contains just ASCII characters"
u=u"\u0411\u0412 - a string with a few Cyrillic characters"

qs=QString(s)
qu=QString(u)

print str(qs)
print str(qu)
        
boud@calcifer:~/doc/opendoc/ch4 > python uniqstring1.py
A string that contains just ASCII characters

Traceback (most recent call last):
  File "uniqstring1.py", line 13, in ?
    print qu
  File "/usr/local/lib/python2.1/site-packages/qt.py", line 954, in __str__
    return str(self.sipThis)
UnicodeError: ASCII encoding error: ordinal not in range(128)
        

If there's a chance that there are non-ASCII characters in the QString you want to convert to Python, you should create a Python unicode object, instead of a string object, by applying unicode to the QString.

Example 8-11. uniqstring2.py - coercing Python strings into and from QStrings

#
# uniqstring2.py - coercing Python strings into and from QStrings
#
from qt import QString

s="A string that contains just ASCII characters"
u=u"\u0411\u0412 - a string with a few Cyrillic characters"

qs=QString(s)
qu=QString(u)

print unicode(qs)
print unicode(qu)