Porting to Python 3 Redux

After a very painful porting experience with Jinja2 to Python 3 Ibasically left the project idling around for a while because I was tooafraid of breaking Python 3 support. The approach I used was one codebasethat was written in Python 2 and translated with 2to3 to Python 3 oninstallation. The unfortunate side-effect of that is that any change youdo requires about a minute of translation which destroys all youriteration speeds. Thankfully it turns out that if you target the rightversions of Python you can do much better.

Thomas Waldmann from the MoinMoin project started running Jinja2 throughmy python-modernizewith the right parameters and ended up with a unified codebase that runson 2.6, 2.7 and 3.3. With a bit of cleanup afterwards we were able tocome up with a nice codebase that runs in all versions of Python and alsolooks like regular Python code for the most time.

Motivated by the results from this I went through the code a few moretimes and also started migrating some other code over to experiment morewith unified codebases.

This is a selection of some tips and tricks I can now share in regards toaccomplishing something similar.

Drop 2.5, 3.1 and 3.2

This is the most important tip. Dropping 2.5 by now is more than possiblesince very few people are still on it and dropping 3.1 and 3.2 are nobrainers anyways considering the low adoption of Python 3 so far. But whywould you drop those versions? The basic answer is that 2.6 and 3.3 havea lot of overlapping syntax and features that allow for code that workswell with both:

Compatible string literals. 2.6 and 3.3 support the same syntax forstrings. You can use 'foo' to refer to a native string (bytestring on 2.x and a Unicode string on 3.x), u'foo' to refer to aUnicode string and b'foo' to refer to a bytestring or bytesobject.

Compatible print syntax. In case you have a few print statementssitting around you can from __future__ import print_function andstart using the print function without having to bind it to adifferent name or suffering from other inconsistencies.

Compatible exception catching syntax. Python 2.6 introduced exceptException as e which is also the syntax used on 3.x to catch downexceptions.

Class decorators are available. They are incredible useful toautomatically correct moved interfaces without leaving a footprint inthe class structure. For instance they can be used to automaticallyrename the iteration method from next to __next__ or__str__ to __unicode__ for Python 2.x.

Builtin next() function to invoke __next__ or next. Thisis helpful because they are performing about the same speed as callingthe method directly so you don’t pay much of a performance penaltycompared to putting runtime checks into places or making a wrapperfunction yourself.

Python 2.6 added the bytearray type which has the same interfacein that version of Python as the one in 3.3. This is useful becausewhile Python 2.6 lacks the Python 3.3 bytes object it does havea builtin with that name but that’s just another name for strwhich has vastly different behavior.

Python 3.3 reintroduces bytes-to-bytes and string-to-string codecsthat were broken in 3.1 and 3.2. Unfortunately the interface for themis clunkier now and the aliases are missing, but it’s much closer towhat we had in 2.x than before.

This is particularly useful if you did stream based encoding anddecoding. That functionality was plain missing between 3.0 up until3.3.

Yes, the six module can get you quite far, but don’t underestimate theimpact of looking at nice code. With the Python 3 port I basically lostinterest in maintaining Jinja2 because the codebase started to frustrateme. Back then a unified codebase was looking ugly and had a performanceimpact (six.b('foo') and six.u('foo') everywhere) or was plaguedunder the bad iteration speeds of 2to3. Not having to deal with any ofthat brings the joy back. Jinja2’s codebase now looks like very clean andyou have to find the Python 2/3 compatibility support. Very few paths inthe code actually do something like if PY2:.

The rest of this article assumes that these are the Python versions youwant to support. Also attempting to support Python 2.5 is a very painfulthing to do and I strongly recommend against it. Supporting 3.2 ispossible if you are willing to wrap all your strings in function callswhich I don’t recommend doing for aesthetic and performance reasons.

Skip Six

Six is a pretty neat library and this is where the Jinja2 port started outwith. There however at the end of the day there is not much in six thatactually is required for getting a Python 3 port running and a few thingsare missing. Six is definitely required if you want to support Python 2.5but from 2.6 or later there really is not much of a reason to use six.Jinja2 ships a _compat module that contains the few helpers required.Including a few lines of non Python 3 code the whole compatibility moduleis less than 80 lines of code.

This saves you from the troubles where users might expect a differentversion of six because of another library or pulling in another dependencyinto your project.

Start with Modernize

To start with the port the python-modernize library is a good start.It is a version of 2to3 that produces code that runs in either. Whileit’s pretty buggy still and the default options are not particularlygreat, it can get you quite far with regards to doing the boring parts foryou. You will still need to go over the result and clean up some importsand ugly results.

Fix your Tests

Before you do anything else walk through your tests and make sure thatthey still make sense. A lot of the problems in the Python standardlibrary in 3.0 and 3.1 came from the fact that the behavior of thetestsuite changed through the conversion to Python 3 in unintended ways.

Write a Compatibility Module

So you’re going to skip six, can you live without helpers? The answer is“no”. You will still need a small compatibility module but that is smallenough that you can just keep it in your package. Here are some basicexamples of what such a compatibility module could look like:

import sysPY2 = sys.version_info[0] == 2if not PY2:    text_type = str    string_types = (str,)    unichr = chrelse:    text_type = unicode    string_types = (str, unicode)    unichr = unichr

The exact contents of that module will depend on how much actually changedfor you. In case of Jinja2 I put a whole bunch of functions in there.For instance it contains ifilter, imap and similar itertools functionsthat became builtins in 3.x. (I stuck with the Python 2.x functions tomake it clear for the reader of the code that the iterator behavior isintended and not a bug).

Test for 2.x not 3.x

At one point there will be the requirement to check if you are executingon 2.x or 3.x. In that cases I would recommend checking for Python 2first and putting Python 3 into your else branch instead of the other wayround. That way you will have less ugly surprises when a Python 4 comesaround at one point.

Good:

if PY2:    def __str__(self):        return self.__unicode__().encode('utf-8')

Less ideal:

if not PY3:    def __str__(self):        return self.__unicode__().encode('utf-8')
String Handling

The biggest change in Python 3 is without doubt the changes on the Unicodeinterface. Unfortunately these changes are very painful in some placesand also inconsistently handled throughout the standard library. Themajority of the time porting will clearly be wasted on this topic. Thistopic is a whole article by itself but here is a quick cheat sheet forporting that Jinja2 and Werkzeug follow:

'foo' always refers to what we call the native string of theimplementation. This is the string used for identifiers, sourcecode,filenames and other low-level functions. Additionally in 2.x it’spermissible as a literal in Unicode strings for as long as it’slimited to ASCII only characters.

This property is very useful for unified codebases because the generaltrend with Python 3 is to introduce Unicode in some interfaces thatpreviously did not support it, but never the inverse. Since nativestring literals “upgrade” to Unicode but still somewhat supportUnicode in 2.x this string literal is very flexible.

For instance the datetime.strftime function strictly does notsupport Unicode in Python 2 but is Unicode only in 3.x. Because inmost cases the return value on 2.x however was ASCII only things likethis work really well in 2.x and 3.x:

>>> u'Current time: %s' % datetime.datetime.utcnow().strftime('%H:%M')u'Current time: 23:52'

The string passed to strftime is native (so bytes in 2.x and Unicodein 3.x). The return value is a native string again and ASCII only.As such both on 2.x and 3.x it will be a Unicode string once stringformatted.

u'foo' always refers to a Unicode string. Many libraries alreadyhad pretty excellent Unicode support in 2.x so that literal should notbe surprising to many.

b'foo' always refers to something that can hold arbitrary bytes.Since 2.6 does not actually have a bytes object like Python 3.3has and Python 3.3 lacks an actual bytestring the usefulness of thisliteral is indeed a bit limited. It becomes immediately more usefulwhen paired with the bytearray object which has the same interfaceon 2.x and 3.x:

>>> bytearray(b' foo ').strip()bytearray(b'foo')

Since it’s also mutable it’s quite efficient at modifying raw bytesand you can trivially convert it to something more conventional bywrapping the final result in bytes() again.

In addition to these basic rules I also add text_type, unichrand string_types variables to my compatibility module as shown above.With those available the big changes are:

isinstance(x, basestring) becomes isinstance(x, string_types).isinstance(x, unicode) becomes isinstance(x, text_type).isinstance(x, str) with the intention of catching bytes becomesisinstance(x, bytes) or isinstance(x, (bytes, bytearray)).

I also created a implements_to_string class decorator that helpsimplementing classes with __unicode__ or __str__ methods:

if PY2:    def implements_to_string(cls):        cls.__unicode__ = cls.__str__        cls.__str__ = lambda x: x.__unicode__().encode('utf-8')        return clselse:    implements_to_string = lambda x: x

The idea is that you just implement __str__ on both 2.x and 3.x andlet it return Unicode strings (yes, looks a bit odd in 2.x) and thedecorator automatically renames it to __unicode__ for 2.x and adds a__str__ that invokes __unicode__ and encodes the return value toutf-8. This pattern has been pretty common in the past with 2.x modules.For instance Jinja2 and Django use it.

Here is an example for the usage:

@implements_to_stringclass User(object):    def __init__(self, username):        self.username = username    def __str__(self):        return self.username
Metaclass Syntax Changes

Since Python 3 changed the syntax for defining the metaclass to use in anincompatible way this makes porting a bit harder than it should be. Sixhas a with_metaclass function that can work around this issue but itgenerates a dummy class that shows up in the inheritance tree. For Jinja2I was not happy enough with that solution and modified it a bit. Theexternal API is the same but the implementation uses a temporary classto hook in the metaclass. The benefit is that you don’t have to pay aperformance penalty for using it and your inheritance tree stays nice.

The code is a bit hard to understand. The basic idea is exploiting theidea that metaclasses can customize class creation and are picked by bythe parent class. This particular implementation uses a metaclass toremove its own parent from the inheritance tree on subclassing. The endresult is that the function creates a dummy class with a dummy metaclass.Once subclassed the dummy classes metaclass is used which has aconstructor that basically instances a new class from the original parentand the actually intended metaclass. That way the dummy class and dummymetaclass never show up.

This is what it looks like:

def with_metaclass(meta, *bases):    class metaclass(meta):        __call__ = type.__call__        __init__ = type.__init__        def __new__(cls, name, this_bases, d):            if this_bases is None:                return type.__new__(cls, name, (), d)            return meta(name, bases, d)    return metaclass('temporary_class', None, {})

And here is how you use it:

class BaseForm(object):    passclass FormType(type):    passclass Form(with_metaclass(FormType, BaseForm)):    pass
Dictionaries

One of the more annoying changes in Python 3 are the changes on thedictionary iterator protocols. In Python 2 all dictionaries hadkeys(), values() and items() that returned lists anditerkeys(), itervalues() and iteritems() that returnediterators. In Python 3 none of that exists any more. Instead they werereplaced with new methods that return view objects.

keys() returns a key view which behaves like some sort of read-onlyset, values() which returns a read-only container and iterable (not aniterator!) and items() which returns some sort of read-only set-likeobject. Unlike regular sets it however can also point to mutable objectsin which case some methods will fail at runtime.

On the positive side a lot of people missed that views are not iteratorsso in many cases you can just ignore that. Werkzeug and Django implementa bunch of custom dictionary objects and in both cases the decision wasmade to just ignore the existence of view objects and let keys() andfriends return iterators.

This is currently the only sensible thing to do due to limitations of thePython interpreter. There are a few problems with it:

The fact that the views are not iterators by themselves mean that inthe average case you create a temporary object for no good reason.The set-like behavior of the builtin dictionary views cannot bereplicated in pure Python due to limitations in the interpreter.Implementing views for 3.x and iterators for 2.x would mean a lot ofcode duplication.

This is what the Jinja2 codebase went with for iterating overdictionaries:

if PY2:    iterkeys = lambda d: d.iterkeys()    itervalues = lambda d: d.itervalues()    iteritems = lambda d: d.iteritems()else:    iterkeys = lambda d: iter(d.keys())    itervalues = lambda d: iter(d.values())    iteritems = lambda d: iter(d.items())

For implementing dictionary like objects a class decorator can becomeuseful again:

if PY2:    def implements_dict_iteration(cls):        cls.iterkeys = cls.keys        cls.itervalues = cls.values        cls.iteritems = cls.items        cls.keys = lambda x: list(x.iterkeys())        cls.values = lambda x: list(x.itervalues())        cls.items = lambda x: list(x.iteritems())        return clselse:    implements_dict_iteration = lambda x: x

In that case all you need to do is to implement the keys() and friendsmethod as iterators and the rest happens automatically:

@implements_dict_iterationclass MyDict(object):    ...    def keys(self):        for key, value in iteritems(self):            yield key    def values(self):        for key, value in iteritems(self):            yield value    def items(self):        ...
General Iterator Changes

Since iterators changed in general a bit of help is needed to make thispainless. The only change really is the transition from next() to__next__. Thankfully this is already transparently handled. The onlything you really need to change is to go from x.next() to next(x)and the language does the rest.

If you plan on defining iterators a class decorator again becomes helpful:

if PY2:    def implements_iterator(cls):        cls.next = cls.__next__        del cls.__next__        return clselse:    implements_iterator = lambda x: x

For implementing this class just name the iteration step method__next__ in all versions:

@implements_iteratorclass UppercasingIterator(object):    def __init__(self, iterable):        self._iter = iter(iterable)    def __iter__(self):        return self    def __next__(self):        return next(self._iter).upper()
Transformation Codecs

One of the nice features of the Python 2 encoding protocol was that it wasindependent of types. You could register an encoding that would transforma csv file into a numpy array if you would have preferred that. Thisfeature however was not well known since the primary exposed interface ofencodings was attached to string objects. Since they got stricter in 3.xa lot of that functionality was removed in 3.0 but later reintroduced in3.3 again since it proved useful. Basically all codecs that did notconvert between Unicode and bytes or the other way round were unavailableuntil 3.3. Among those codecs are the hex and base64 codec.

There are two use cases for those codecs: operations on strings andoperations on streams. The former is well known as str.encode() in2.x but now looks different if you want to support 2.x and 3.x due to thechanged string API:

>>> import codecs>>> codecs.encode(b'Hey!', 'base64_codec')'SGV5IQ==\n'

You will also notice that the codecs are missing the aliases in 3.3 whichrequires you to write 'base64_codec' instead of 'base64'.

(These codecs are preferable over the functions in the binascii modulebecause they support operations on streams through the incrementalencoding and decoding support.)

Other Notes

There are still a few places where I don’t have nice solutions for yet orare generally annoying to deal with but they are getting fewer. Some ofthem are unfortunately now part of the Python 3 API are hard to discoveruntil you trigger an edge case.

Filesystem and file IO access continues to be annoying to deal with onLinux due to it not being based on Unicode. The open() functionand the filesystem layer have dangerous platform specific defaults.If I SSH into a en_US machine from an de_AT one for instancePython loves falling back to ASCII encoding for both file system andfile operations.

Generally I noticed the most reliable way to do text on Python 3 thatalso works okay on 2.x is just to open files in binary mode andexplicitly decode. Alternatively you can use the codecs.open orio.open function on 2.x and the builtin open on Python 3 withan explicit encoding.

URLs in the standard library are represented incorrectly as Unicodewhich causes some URLs to not be dealt with correctly on 3.x.

Raising exceptions with a traceback object requires a helper functionsince the syntax changed. This is very uncommon in general and easyenough to wrap. Since the syntax changed this is one of thesituations where you will have to move code into an exec block:

if PY2:    exec('def reraise(tp, value, tb):\n raise tp, value, tb')else:    def reraise(tp, value, tb):        raise value.with_traceback(tb)

The previous exec trick is useful in general if you have some codethat depends on different syntax. Since exec itself has a differentsyntax now you won’t be able to use it to execute something against anarbitrary namespace. This is not a huge deal because eval withcompile can be used as a drop-in that works on both versions.Alternatively you can bootstrap an exec_ function through execitself.

exec_ = lambda s, *a: eval(compile(s, '', 'exec'), *a)

If you have a C module written on top of the Python C API: shootyourself. There is no tooling available for that yet from what I knowand so much stuff changed. Take this as an opportunity to ditch theway you build modules and redo it on top of cffi or ctypes. Ifthat’s not an option because you’re something like numpy then you willjust have to accept the pain. Maybe try writing some abomination ontop of the C-preprocessor that makes porting easier.

Use tox for local testing. Beingable to run your tests against all python versions at once is veryhelpful and will find you a lot of issues.

Outlook

Unified codebases for 2.x and 3.x are definitely within reach now. Themajority of the porting time will still be spend trying to figure out howAPIs are going to behave with regards to Unicode and interoperability withother modules that might have changed their API. In any case if you wantto consider porting libraries don’t bother with versions outside below2.5, 3.0-3.2 and it will not hurt as much.

Porting to Python 3 Redux

相关文章:

你感兴趣的文章:

标签云: