More About Unicode in Python 2 and 3

It’s becoming increasingly harder to have reasonable discussions about thedifferences between Python 2 and 3 because one language is dead and theother is actively developed. So when someone starts a discussion aboutthe Unicode support between those two languages it’s not an even playingfield. So I won’t discuss the actual Unicode support of those twolanguages but the core model of how to deal with text and bytes in both.

I will use this post to show that from the pure design of the language andstandard library why Python 2 the better language for dealing with textand bytes.

Since I have to maintain lots of code that deals exactly with the pathbetween Unicode and bytes this regression from 2 to 3 has caused me lotsof grief. Especially when I see slides by core Python maintainers abouthow I should trust them that 3.3 is better than 2.7 makes me more than angry.

The Text Model

The main difference between Python 2 and Python 3 is the basic types thatexist to deal with texts and bytes. On Python 3 we have one text type:str which holds Unicode data and two byte types bytes andbytearray.

On the other hand on Python 2 we have two text types: str which forall intents and purposes is limited to ASCII + some undefined data abovethe 7 bit range, unicode which is equivalent to the Python 3 strtype and one byte type bytearray which it inherited from Python 3.

Looking at that you can see that Python 3 removed something: support fornon Unicode data text. For that sacrifice it gained a hashable byte type,the bytes object. bytearray is a mutable type, so it’s notsuitable for hashing. I very rarely use true binary data as dictionarykeys though so it does not show up as big problem. Especially not becausein Python 2, you can just put bytes into the str type without issues.

The Lost Type

Python 3 essentially removed the byte-string type which in 2.x was calledstr. On the paper there is nothing inherently wrong with it. From apurely theoretical point of view text always in Unicode sounds awesome.And it is. If your whole world is just your interpreter. Unfortunatelythat’s not how it works in the real world where you need to interface withbytes and different encodings on a regular basis and for that, the Python3 model completely breaks down.

Let me be clear upfront: Python 2’s way of dealing with Unicode is errorprone and I am all in favour of improving it. My point though is thatthe one in Python 3 is a step backwards and brought so many more issuesthat I absolutely hate working with it.

Unicode Errors

Before I go into the details, we need to understand what the differencesof the Unicode support in Python 2 and 3 is, and why the decision was madeto change it.

Python 2, like many languages before it, was created without support fordealing with strings of different encodings. A string was a string and itcontained bytes. It was up to the developer to properly deal withdifferent encodings manually. This actually works remarkably fine formany situations. The Django framework for many years did not supportUnicode at all and used the byte-string interface in Python exclusively.

Python 2 however also gained better and better support for Unicodeinternally over the years and through this Unicode support it gainedsupport for different encodings to represent that Unicode data.

In Python 2 the way of dealing with strings of a specific encoding wasactually remarkably simple when it started out. You took a string you gotfrom somewhere (which was a byte-string) and decoded it from the encodingyou got from a side-channel (header data, metadata etc., specification)into an Unicode string. Once it was an Unicode string, it supported thesame operations as a regular byte-string but it supported a much largercharacter range. When you needed to send that string elsewhere forprocessing you usually encoded it back into an encoding that the othersystem can deal with and it becomes a byte-string again.

So what were the issues with that? At the core this worked, unfortunatelyPython 2 needed to provide a nice migration path from the non-Unicode intothe Unicode world. This was done by allowing coercion of byte-strings andnon byte-strings. When does this happen and how does it work?

Essentially when you have an operation involving a byte-string and aUnicode-string, the byte-string is promoted into a Unicode string by goingthrough an implicit decoding process that uses the “default encoding”which is set to ASCII. Python did provide a way to change this encodingat one point, but nowadays the site.py module removes the function toset this encoding after it sets the encoding to ASCII. If you startPython with the -S flag the sys.setdefaultencoding function isstill there and you can experiment what happens if you set your Pythondefault encoding to UTF-8 for instance.

So here are some situations where the default encoding kicks in:

    Implicit encoding upon string concatenation:

    >>> "Hello " + u"World"u'Hello World'

    Here the string on the left is decoded by using the default systemencoding into a Unicode string. If it would contain non-ASCIIcharacters this normally blow up with an UnicodeDecodeErrorbecause the default encoding is set to ASCII.

    Implicit encoding through comparison:

    >>> "Foo" == u"Foo"True

    This sounds more evil as it is. Essentially it decodes the left sideto Unicode and then compares. In case the left side cannot be decodedit will warn and return False. This is actually surprisingly sanebehavior even though it sounds insane at first.

    Implicit decoding as part of a codec.

    This one is an evil one and most likely the source of all confusionabout Unicode in Python 2. Confusing enough that Python 3 took theabsolutely insanely radical step and removed .decode() fromUnicode strings and .encode() from byte strings and caused memajor frustration. In my mind this was an insanely stupid decisionbut I have been told more than once that my point of view is wrong andit won’t be changed back.

    The implicit decoding as part of a codec operation looks like this:

    >>> "foo".encode('utf-8')'foo'

    Here the string is obviously a byte-string. We ask it to encode toUTF-8. This by itself makes no sense because the UTF-8 codec encodesfrom Unicode to UTF-8 bytes. So how does this work? It works becausethe UTF-8 codec sees that the object is not a Unicode string and firstperforms a coercion to Unicode through the default codec. Since"foo" is ASCII only and the default encoding is ASCII thiscoercion will succeed and then the resulting u"foo" string will beencoded through UTF-8.

Codec System

So you now know that Python 2 has two ways to represent strings: in bytesand in Unicode. The conversion between those two happens by using thePython codec system. However the codec system does not enforce that aconversion always needs to take place between Unicode and bytes or theother way round. A codec can implement a transformation between bytes andbytes and Unicode and Unicode. In fact, the codec system itself canimplement a conversion between any Python type. You could have a JSONcodec that decodes from a string into a complex Python object if you sodesire.

That this might cause issues at one point has been understood from thevery start. There is a codec called 'undefined' which can be set asdefault encoding in which case any string coercion is disabled:

>>> import sys>>> sys.setdefaultencoding('undefined')>>> "foo" + u"bar"Traceback (most recent call last):    raise UnicodeError("undefined encoding")UnicodeError: undefined encoding

This is implemented as a codec that raises errors for any operation. Thesole purpose of that module is to disable the implicit coercion.

So how did Python 3 fix this? Python 3 removed all codecs that don’t gofrom bytes to Unicode or vice versa and removed the now useless.encode() method on bytes and .decode() method on strings.Unfortunately that turned out to be a terrible decision because there aremany, many codecs that are incredibly useful. For instance it’s verycommon to decode with the hex codec in Python 2:

>>> "\x00\x01".encode('hex')'0001'

While you might argue that this particular case can also be handled by amodule like binascii, there is a deeper problem with that which isthat the codec module is also separately available. For instancelibraries implementing reading from sockets used the codec system toperform partial decoding of zlib streams:

>>> import codecs>>> decoder = codecs.getincrementaldecoder('zlib')('strict')>>> decoder.decode('x\x9c\xf3H\xcd\xc9\xc9Wp')'Hello '>>> decoder.decode('\xcdK\xceO\xc9\xccK/\x06\x00+\xad\x05\xaf')'Encodings'

This was eventually recognized and Python 3.3 restored those codecs. Nowhowever we’re in the land of user confusion again because these codecsdon’t provide the meta information before the call about what types theycan deal with. Because of this you can now trigger errors like this onPython 3:

>>> "Hello World".encode('zlib_codec')Traceback (most recent call last):  File "", line 1, in TypeError: 'str' does not support the buffer interface

(Note that the codec is now called zlib_codec instead of zlibbecause Python 3.3 does not have the old aliases set up.)

So given the current state of Python 3.3, what exactly would happen if wewould get the .encode() method on byte strings back for instance?This is easy to test, even without having to hack the Python interpreter.Let’s just settle for a function with the same behavior for the moment:

import codecsdef encode(s, name, *args, **kwargs):    codec = codecs.lookup(name)    rv, length = codec.encode(s, *args, **kwargs)    if not isinstance(rv, (str, bytes, bytearray)):        raise TypeError('Not a string or byte codec')    return rv

Now we can use this as replacement for the .encode() method we had onbyte strings:

>>> b'Hello World'.encode('latin1')Traceback (most recent call last):  File "", line 1, in AttributeError: 'bytes' object has no attribute 'encode'>>> encode(b'Hello World', 'latin1')Traceback (most recent call last):  File "", line 4, in encodeTypeError: Can't convert 'bytes' object to str implicitly

Oha! Python 3 can already deal with this. And we get a nice error. Iwould even argue that “Can’t convert ‘bytes’ object to str implicitly” isa lot nicer than “’bytes’ object has no attribute ‘encode’”.

Why do we still not have those encoding methods back? I really don’t knowand I no longer care either. I have been told multiple times now that mypoint of view is wrong and I don’t understand beginners, or that the “textmodel” has been changed and my request makes no sense.

Byte-Strings are Gone

Aside from the codec system regression there is also the case that alltext operations now are only defined for Unicode strings. In a way thisseems to make sense, but it does not really. Previously the interpreterhad implementations for operations on byte strings and Unicode strings.This was pretty obvious to the programmer as custom objects had toimplement both __str__ and __unicode__ if they wanted to beformatted into either. Again, there was implicit coercion going on whichconfused newcomers, but at least we had the option for both.

Why was this useful? Because for instance if you write low-levelprotocols you often need to deal with formatting numbers out into bytestrings.

Python’s own version control system is still not on Python 3 because foryears now because the Python team does not bring back stringformatting for bytes.

This is getting ridiculous now though, because it turned out that themodel chosen for Python 3 just does not work in reality. For instance inPython 3 the developers just “upgraded” some APIs to Unicode only, makingthem completely useless for real-world situations. For instance you couldno longer parse byte only URLs with the standard library, the implicitassumption was that every URL as Unicode (for that matter, you could nothandle non-Unicode mails any more either, completely ignoring that binaryattachments exist).

This was fixed obviously, but because byte strings are gone now, the URLparsing library ships two implementations now. One for Unicode stringsand one for byte objects. Two implementations behind the same functionthough, just the return value is vastly different now:

>>> from urllib.parse import urlparse>>> urlparse('http://www.google.com/')ParseResult(scheme='http', netloc='www.google.com',            path='/', params='', query='', fragment='')>>> urlparse(b'http://www.google.com/')ParseResultBytes(scheme=b'http', netloc=b'www.google.com',                 path=b'/', params=b'', query=b'', fragment=b'')

Looks similar? Not at all, because they are made of different types. Oneis a tuple of strings, the other is more like a tuple of arrays ofintegers. I have written about this before already and itstill pains me. It makes writing code for Python incredibly frustratingnow or hugely inefficient because you need to go through multiple encodeand decode steps. Aside from that, it’s really hard to write fullyfunctional code now. The idea that everything can be Unicode is nice intheory, but totally not applicable for the real world.

Python 3 is riddled with weird workarounds now for situations where youcannot use Unicode strings and for someone like me, who has to deal withthose situations a lot, it’s ridiculously annoying.

Our Workarounds Break

The Unicode support in 2.x was not perfect, far from it. There wasmissing APIs and problems left and right, but we as programmers made itwork. Unfortunately many of the ways in which we made it work, do nottransfer well to Python 3 any more and some of the APIs would have had tohave been changed to work well on Python 3.

My favourite example now is the file streams which like before are eithertext or bytes, but there is no way to reliably figure out which one iswhich. The trick which I helped to popularize is to read zero bytes fromthe stream to figure out of which type it is. Unfortunately thoseworkarounds don’t work reliably either. For instance passing a urllibrequest object to Flask’s JSON parse function breaks on Python 3 but workson Python 2 as a result of this:

>>> from urllib.request import urlopen>>> r = urlopen('https://pypi.python.org/pypi/Flask/json')>>> from flask import json>>> json.load(r)Traceback (most recent call last):  File "decoder.py", line 368, in raw_decodeStopIterationDuring handling of the above exception, another exception occurred:Traceback (most recent call last):  File "", line 1, in ValueError: No JSON object could be decoded
The Outlook

There are many more problems with Python 3’s Unicode support than justthose. I started unfollowing Python developers on Twitter because I gotso fed up with having to read about how amazing Python 3 is which is insuch conflict with my own experiences. Yes, lots of things are cool inPython 3, but the core flow of dealing with Unicode and bytes is not.

(The worst of all of this is that many of the features in Python 3 whichare genuinely cool could just as well work on Python 2 as well. Thingslike yield from, nonlocal, SNI SSL support etc.)

In light of only about 3% of all Python developers using Python 3properlyand developers proudly declaring on Twitter that “the migration is goingas planned” I got so incredibly frustrated that I nearly published anmulti page rant about my experience with Python 3 and how we should killit.

I won’t do that now but I do wish Python 3 core developers would become abit more humble. For 97% of us, Python 2 is our beloved world for yearsto come and telling us constantly about how amazing Python 3 is not justpainful, it’s also wrong in light of the many regressions. With peoplestarting to discuss Python 2.8, Stackless Python preparing a new releasewith new features and these bad usage numbers, I don’t know what failureis, if not that.

More About Unicode in Python 2 and 3

相关文章:

你感兴趣的文章:

标签云: