Wednesday, January 03, 2007

Unicode headaches ... and a solution

Work on Crunchy has restarted by both Johannes and I over the holidays, after a few months long hiatus. Hopefully, we'll be in a position to do a new release soon (just a few more features...). Among the changes, Crunchy now has a proper embedded editor, EditArea. See this example to get an idea of what EditArea can do. Actually, the Python support has since been improved by Christophe Dolivet, the creator of EditArea, prompted by a few suggestions of my own, but the new version has not been made publicly available yet. It is however already used in the development version of Crunchy, and, with a few additions of my own, will definitively be showcased during my Pycon presentation.

Other Crunchy changes include a proper handling of English and French translations. Now, since EditArea's tooltips are available in many more language, I thought I should use a more complete encoding (like utf-8) rather than the one I normally use (latin-1). After all, if rur-ple has been found useful enough to be adapted to 6 languages (with Italian and Chinese in the pipeline), I figured that Crunchy is likely to eventually see the same kind of adaptation.

However, I soon encountered a most puzzling bug. All strings translated were properly rendered by Firefox except for those coming out of an interpretor session (some tracebacks have been customized and translated in French as well as being available in English). When French was selected as the default language, whenever an accented letter, like à, was supposed to appear as a result of a Python output, a ? instead was displayed.

After trying all kind of encoding/decoding tricks over the course of a few hours, I remembered that I had set the default sytem encoding on my computer to latin-1. I decided to change it to utf-8 and, sure enough, everything was working as expected. Success at last!

However, this was only the beginning of my problems. My favourite editor, SPE, stopped working. I also tried to run rur-ple which failed miserably. [Idle, on the other hand, which I use very rarely, was still working perfectly.] Clearly, changing the site default encoding was not an appropriate solution: I could certainly not depend on having a crunchy user set his or her site customization to utf-8. A different approach was needed.

After reverting back to latin-1 as my default python site encoding (so that paths that included my name, André, were properly read by SPE) and poring over the code, I finally figured out a more general solution.

Whenever Crunchy executes some Python code written by a user (or provided by a tutorial writer), it starts a Python interpreter in a separate thread. This Python interpreter uses the default system encoding for all its computation. When the result needs to be sent back by the Crunchy server to Firefox, it needs to have its encoding changed as

result = result.decode(sys.getdefaultencoding()).encode('utf-8')

Of course, in retrospect, it all makes sense but it did stump me for quite a few hours; perhaps the information included here will save a few minutes or hours to someone else.


Unknown said...

Without having looked at the rest of your code, so I might be completely off here, this somehow looks wrong:

result = result.decode(sys.getdefaultencoding()).encode('utf-8')

The reason I say this is that you're decoding and encoding in the same place. Since Python unicode support is so good, it's generally a good idea to decode to unicode any use input you get as early as possible, and to encode only as late as possible when outputting strings. Since you're doing complicated web ui stuff here, so it may be that you're not doing anything with 'result' between input and returning it to the browser, but if you are, the string should have already been decoded by the time it gets to this line. Otherwise this will bite you anytime you try to do anything with the string like simple concatenation.

André Roberge said...

This is a good point in general. However, in this case, I take the result straight from an interpreter session and inject it into a web browser. The line of code above is the one step inserted in between.