The Updated Guide to Unicode on Python

I figured that it might be the right time to do an updated introduction tounicode in Python. Primarily because the unicode chapter got a whole lotof new confusing chapters on Python 3 that a developer needs to know.

Let’s start first with how unicode worked on Python 2.

Unicode on Python 2

Unicode on Python 2 is a fairly simple thing. There are two types ofstring literals: bytestrings (look like this on 2.x: 'foo') andunicode strings (which have a leading u prefix like this: u'foo').Since 2.6 you can also be explicit about bytestrings and write them with aleading b prefix like this: b'foo'.

Python 2’s biggest problem with unicode was that some APIs did not supportit. The most common ones were many filesystem operations, the datetimemodule, the csv reader and quite a few interpreter internals. In additionto that a few APIs only ever worked with non unicode strings or caused alot of confusion if you introducted unicode. For instance docstringsbreak some tools if they are unicode instead of bytestrings, the returnvalue of __repr__ must only ever be bytes and not unicode strings etc.

Aside from that Python had one feature that usually confused developers: abyte string for as long as it only contained ASCII characters could beupgraded to a unicode string implicitly. If however it was not ASCII safeit would have caused some form of UnicodeError. Either aUnicodeEncodeErrror or a UnicodeDecodeError depending on when itfailed.

Because of all that the rule of thumb on 2.x was this:

the first time you know your encoding properly decode from bytes intounicode.when it’s most convenient for you and you know the target encoding,encode back to bytes.internally feel free to use bytes literals for as long as they arerestricted to the ascii subset.

This worked really well for many 2.x libraries. On Flask for instance youwill only encounter unicode issues if you try to pass byte string literalswith non ascii characters to the templates or if you try to use Flask withAPIs that do not support unicode. Aside from that it takes a lot of workto create a unicode error.

This is accomplished because the whole WSGI layer is byte based and thewhole Flask layer is unicode based (for text). As such Flask just doesthe encoding when it transfers from WSGI over to Flask. Likewise thereturn value is inspected and if the return type is unicode it willautomatically encode it before handling data back to the WSGI layer.

Basic Unicode on Python 3

On Python 3 two things happened that make unicode a whole lot morecomplicated. The biggest one is that the bytestring was removed. It wasreplaced with an object called bytes which is created by the Python 3bytes syntax: b'foo'. It might look like a string at first, but it’snot. Unfortunately it does not share much of the API with strings.

The following code example shows that the bytes object is very differentof the string object indeed:

>>> 'key=%s' % 'value''key=value'>>> 'key=%s' % b'value'"key=b'value'">>> b'key=%s' % b'value'Traceback (most recent call last):  File "", line 1, in TypeError: unsupported operand type(s) for %: 'bytes' and 'bytes'>>> str(10)'10'>>> bytes(10)b'\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00'>>> list('foo')['f', 'o', 'o']>>> list(b'foo')[102, 111, 111]>>> 'foo' == b'foo'False

One could argue that that’s fine, because you will no longer mix bytes andunicode, but unfortunately that’s not the case. The reason for this isthat a whole bunch of APIs work on bytes and unicode stringsinterchangeably. For instance all the file system APIs operate on bothunicode and bytes:

>>> os.listdir('/tmp/test')['Schei?_Encoding']>>> os.listdir(b'/tmp/test')[b'Schei\xc3\x9f_Encoding']

That might not seem like a big deal at first, but APIs have the attitudeof spreading further. For instance opening a file will set the nameattribute to a “string” of that type:

>>> open(b'/tmp/test/Schei\xc3\x9f_Encoding').nameb'/tmp/test/Schei\xc3\x9f_Encoding'

As a result every user of the .name attribute will have to force it tothe right type before interacting with it. The same thing also has beentrue on 2.x, however on 3.x this behavior is mostly undocumented.

It’s not just file operations, it also happens on other APIs like theurllib parsing module which can produce both bytes and unicode strings:

>>> x.parse_qs(b'foo=bar'){b'foo': [b'bar']}>>> x.parse_qs('foo=bar'){'foo': ['bar']}
Magic Defaults in 3.x

Python 3 unfortunately made a choice of guessing a little bit too muchwith unicode in some places. When I asked the question at one conferencebefore about what people believe the default encoding for text files onPython 3 was, most were replying UTF-8. This is correct on some operatingsystems. It’s definitely true for OS X and it’s true for most linuxdistributions I tried. However how does Python determine that encoding?The answer is by looking into the locale settings in the environmentvariables.

Unfortunately those break very quickly. A good example for instance isSSH’ing from a german locale into a US linux box that does not support thegerman locale. Linux will then attempt to set the locale and fails, anddefault to C which is ASCII. Python then very happily opens a file inASCII mode. Here is the logic that Python applies to guessing the defaultencoding on files:

    it first starts out finding the device the file is located on and willtry to get the encoding from that device. This function currentlyonly ever does something for terminals. As far as I know this onlyever does something really interesting on windows where it mightreturn a codepage (which totally is not unicode, but that’s expected).The same function that finds out the device encoding might also callnl_langinfo(CODESET) which returns the current encoding that thelocale system is aware of. Traditionally the locale support was notinitialized on the Python interpreter but it definitely getsinitialized somewhere. This call is also the one that can fail when alocale is not available but set (SSH example from above).If for whatever reason device_encoding does not return anything(for instance because the device was not a terminal) it will try toimport the locale module (which BTW is written in Python, alwaysinteresting to see when the stuff written in C imports a Pythonmodule) and call into the locale.getpreferredencoding function anduse the return value of that.

Because it does not set the locale there it basically only calls intonl_langinfo(CODESET) again. Because that call sometimes fails on OS Xit converts the return value for OS X into utf-8 if it does not get auseful result otherwise.

I am not a fan of that behavior and I strongly recommend explicitlypassing the encoding of text files as third parameter. That’s how we didit on 2.x and that’s also how I recommend doing it on Python 3. I reallywish the default encoding was changed to utf-8 in all cases except forterminal devices and maybe have some encoding=’auto’ flag that guesses.

I failed installing a package on python 3 a while ago because acontributor name was containing a non ASCII name and the setup.py file wasopening the README file for the docstring. Worked fine on OS X and normalLinux, but broke hard when I SSH’ed into my Linux box from an Austrian OSX. I am not sure how many people run into that (I assume not a lot) butit’s annoying when it happens and there is literally nothing thatguarantees that a file opened in text mode and without a defined encodingis UTF-8. So do the world a favor and open text files like this:

with open(filename, 'r', encoding='utf-8') as f:    ...
Different Types of Unicode Strings

In addition to regular unicode strings, on Python 3 you have to deal withtwo additional types of unicode strings. The reason for this is thata library (or the Python interpreter) does not have enough knowledge aboutthe encoding so it has to apply some tricks. Where in Python 2.x we madea string stick to being bytes in that case, on Python 3 there are two morechoices you have. These strings don’t have proper names and look likeregular unicode strings, so I am going to give them names for the sake ofthe argument. Let’s call the regular unicode string a “text” string.Each character in that string is correctly internally represented and nosurprises are to be expected.

In addition to that there are strings I would call “transport decoded”strings. Those strings are used in a few places. The most common casewhere you are dealing with those strings is the WSGI protocol and mostthings that interface with HTTP. WSGI declares that strings in the WSGIenvironment are represented as incorrectly decoded latin1 strings. Inother words what happens is that all unicode strings in the Python 3 WSGIenvironment are actually incorrectly encoded for any codepoint aboveASCII. In order to properly decode that strings you will need to encodethe string back to latin 1 and decode from the intended encoding.Werkzeug internally refers to such strings as “dance encoded” strings.The following logic has to be applied to properly re-decode them to theactual character set:

def wsgi_decoding_dance(s, charset='utf-8', errors='replace'):    return s.encode('latin1').decode(charset, errors)def wsgi_encoding_dance(s, charset='utf-8', errors='replace'):    if isinstance(s, bytes):        return s.decode('latin1', errors)    return s.encode(charset).decode('latin1', errors)

This logic is not just required for WSGI however, the same requirementcomes up for any MIME and HTTP header. Theoretically it’s not a problemfor these headers because they are limited to latin1 out of the box anduse explicit encoding information if a string does not fit into latin1.Unfortunately in practical terms it’s not uncommon for certain headers tobe utf-8 encoded. This is incredibly common with custom headers emittedby applications as well as the cookie headers if the cookie header is setvia JavaScript as the browser API does not provide automatic encoding.

The second string type that is common on Python 3 is the “surrogateescaped string”. These are unicode strings that cannot be encoded to anunicode encoding because they are actually invalid. These strings arecreated by APIs that think an encoding is a specific one but cannotguarantee it because the underlying system does not fully enforce that.This functionality is provided by the 'surrogateescape' error handler:

>>> letter = '\N{LATIN CAPITAL LETTER U WITH DIAERESIS}'.encode('latin1')>>> decoded_letter = letter.decode('utf-8', 'surrogateescape')>>> decoded_letter'\udcdc'

This is for instance happening for os.environ as well as all theunicode based filesystem functions. If you try to encode such a string toutf-8 for instance you will receive an UnicodeEncodeError:

>>> decoded_letter.encode('utf-8')Traceback (most recent call last):  File "", line 1, in UnicodeEncodeError: 'utf-8' codec can't encode character  '\udcdc' in position 0: surrogates not allowed

To solve this problem you need to encode such strings with the encodingerror handling set to 'surrogateescape'. As an extension this meansthat strings received from functions that might carry surrogates need tobe resolved before passed to APIs not dealing with such strings.

This primarily means that you have two options: change all yourencode() errorhandling anywhere in your codebase from 'strict'(which is the default) to 'surrogateescape' or remove surrogates fromyour strings. The easiest form I believe is going through a encode/decodedance. I believe that currently that’s also the only simple way to checkif something was indeed surrogate escaped.

My suggestion is that every time you deal with an API that might producesurrogate escaped strings (os.environ etc.) you should just do a basiccheck if the value is surrogate escaped and raise an error (or remove thesurrogate escaping and call it a day). But don’t forward those stringsonwards as it will make it very painful to figure out what’s wrong later.

If you for instance pass such a string to a template engine you will getan error somewhere else entirely and because the encoding happens at amuch later stage you no longer know why the string was incorrect. If youdetect that error when it happens the issue becomes much easier to debug(basically restores 2.x behavior).

These functions might be useful:

def remove_surrogate_escaping(s, method='ignore'):    assert method in ('ignore', 'replace'), 'invalid removal method'    return s.encode('utf-8', method).decode('utf-8')def is_surrogate_escaped(s):    try:        s.encode('utf-8')    except UnicodeEncodeError as e:        if e.reason == 'surrogates not allowed':            return True        raise    return False

Both “transport decoded” and “surrogate escaped” strings are the same typeas regular strings so the best way to keep them apart is memorize wherethey come from. In Werkzeug I wrote helper functions that fetch thestrings from their container (WSGI environ) and immediately decode them sothat a user never has to deal with the low level details.

The following interfaces produce some of those strings:

API String Type os.environsurrogate escapedos.listdirsurrogate escapedWSGI environtransport decoded (latin1)HTTP/MIME headerstransport decoded (latin1)email text payloadsurrogate escapednntplib all datasurrogate escapedos.exec* functionssurrogate escaped (except on windows)subprocess environsurrogate escaped (except on windows)subprocess argumentssurrogate escaped (except on windows)

There are also some special cases in the stdlib where strings are veryconfusing. The cgi.FieldStorage module which WSGI applications aresometimes still using for form data parsing is now treatingQUERY_STRING as surrogate escaping, but instead of using utf-8 ascharset for the URLs (as browsers) it treats it as the encoding returnedby locale.getpreferredencoding(). I have no idea why it would dothat, but it’s incorrect. As workaround I recommend not usingcgi.FieldStorage for query string parsing.

Unfortunately the docs generally are very quiet about where they are usingsurrogate escaping or not. Generally the best way is to look at thesource currently.

Detecting Errors

On Python 2.x detecting misuse of Unicode was quite simple. Generally ifyou did dodgy things you got some form of UnicodeError orUnicodeWarning. Usually you either got a fatal UnicodeEncodeErroror UnicodeDecodeError or you got logged a UnicodeWarning. Thelatter for instance happened when comparing bytes and unicode where thebytes could not be decoded from ASCII. On Python 3 the situation looksvery different unfortunately.

AttributeError: this usually happens if you try to use a stringonly API on a bytes object. Usually this happens for calls tocasefold(), encode(), or format().TypeError: this can happen for a variety of different reasons.The most common one is string formatting which does not work on bytes.If you try to do foo % bar and foo turns out to be a bytesobject you will get a TypeError. Another form of this is thatsomething iterates over a string and expects a one-character string tobe returned but actually an integer is produced.UnicodeEncodeError: usually happens now due to surrogate escapingproblems when you’re not using the 'surrogateescape' error handleron encoding strings or forget to remove surrogates from strings.garbled unicode: happens if you’re not dealing with transport decodedstrings properly. This usually happens with WSGI. The best to catchthis is to never expose WSGI strings directly and always go through anextra level of indirection. That way you don’t accidentally mixunicode strings of different types.no error: that happens for instance when you compare bytes and stringsand the comparison will just return False without giving awarning. This can be remedied by running the Python interpreter withthe -b flag which will emit warnings for bytes and textcomparisons.running out of memory / huge strings: this happens when you try topass a large integer to the bytes() constructor. I have seen thishappen a few times when porting to Python 3 where the pattern was aform of “if object not an instance of bytes, call bytes() on it”.This is dangerous because integers are valid input values to thebytes() constructor that will allocate as many null bytes asthe integer passed. The recommendation there is to stop using thatpattern and write a soft_bytes function that catches integerparameters before passing it to bytes.

Writing Unicode/Bytes Combination APIs

Because there are so many cases where an API can return both bytes orunicode strings depending on where they come from, new patterns need to becreated. In Python 2 that problem solved itself because bytestrings werepromoted to unicode strings automatically. On Python 3 that is no longerthe case which makes it much harder to implement with APIs that do both.

Werkzeug and Flask use the following helpers to provide (or work with)APIs that deal with both strings and bytes:

def normalize_string_tuple(tup):    """Ensures that all types in the tuple are either strings    or bytes.    """    tupiter = iter(tup)    is_text = isinstance(next(tupiter, None), str)    for arg in tupiter:        if isinstance(arg, str) != is_text:            raise TypeError('Cannot mix str and bytes arguments (got %s)'                % repr(tup))    return tupdef make_literal_wrapper(reference):    """Given a reference string it returns a function that can be    used to wrap ASCII native-string literals to coerce it to the    given string type.    """    if isinstance(reference, str):        return lambda x: x    return lambda x: x.encode('ascii')

These functions together go quite far to make APIs work for both stringsand bytes. For instance this is how URL joining works in Werkzeug whichis enabled by the normalize_string_tuple and make_literal_wrapperhelpers:

def url_unparse(components):    scheme, netloc, path, query, fragment = \        normalize_string_tuple(components)    s = make_literal_wrapper(scheme)    url = s('')    if netloc or (scheme and path.startswith(s('/'))):        if path and path[:1] != s('/'):            path = s('/') + path        url = s('//') + (netloc or s('')) + path    elif path:        url += path    if scheme:        url = scheme + s(':') + url    if query:        url = url + s('?') + query    if fragment:        url = url + s('#') + fragment    return url

This way the function only needs to be written once for handling bothbytes and strings which is in my mind a nicer solution than what thestandard library does which is implementing every function twice whichmeans a lot of copy/pasting.

Another problem is wrapping file objects in Python 3 because they now onlysupport either texts or bytes but there is no documented interface tofigure out what they accept. Flask uses the following workaround:

def is_text_reader(s):    """Given a file object open for reading this function checks if    the reader is text based.    """    return type(s.read(0)) is strdef is_bytes_reader(s):    """Given a file object open for reading this function checks if    the reader is bytes based.    """    return type(s.read(0)) is bytesdef is_text_writer(s):    """Given a file object open for writing this function checks if    the reader is text based.    """    try:        s.write('')        return True    except TypeError:        return Falsedef is_bytes_writer(s):    """Given a file object open for writing this function checks if    the reader is bytes based.    """    try:        s.write(b'')        return True    except TypeError:        return False

For instance Flask uses this to make JSON work with both text and bytesagain similar to how it worked in 2.x:

import ioimport json as _jsondef load(fp, **kwargs):    encoding = kwargs.pop('encoding', None) or 'utf-8'    if is_bytes_reader(fp):        fp = io.TextIOWrapper(io.BufferedReader(fp), encoding)    return _json.load(fp, **kwargs)def dump(obj, fp, **kwargs):    encoding = kwargs.pop('encoding', None)    if encoding is not None is_bytes_reader(fp):        fp = io.TextIOWrapper(fp, encoding)    _json.dump(obj, fp, **kwargs)
Unicode is Hard

Unicode is still hard, and in my experience it’s not much easier on 3.xthan it was on 2.x. While the transition forced me to make some APIs workbetter with unicode (and now more correct) I still had to add a lot ofextra code that was not necessary on Python 2. If someone does anotherdynamic language in the future I believe the correct solution would havebeen this:

    take the approach of Python 2.x and allow mixing of bytes and unicodestrings.Make 'foo' mean unicode strings and b'foo' mean byte strings.Make byte strings have an encoding attribute that defaults toASCIIAdd a method to replace the encoding information (eg:b'foo'.replace_encoding_hint('latin1')When comparing strings and bytes use the encoding hint instead of theASCII default (or more correct system default encoding which forbetter or worse was always ASCII).Have a separate bytes type that works exactly like stringsthat is not hashable and cannot carry encoding information andgenerally just barks when trying to convert it to strings. That wayyou can tag true binary data which can be useful sometimes (forinstance for serialization interfaces).

If someone wants to see how much complexity the new unicode support inPython 3 caused have a look at the code of the os module on 3.x, theinternal io module file operation utilities and things likeurllib.parse.

On the bright side: nothing changes much for high level users of Python.I think Flask provides for instance a painless experience for unicode onboth 2.x and 3.x. Users are almost entirely shielded from thecomplexities of unicode handling. The higher level the API, the fewerdoes encoding play a role in it.

The Updated Guide to Unicode on Python

相关文章:

你感兴趣的文章:

标签云: