Thursday, December 27, 2007

Crunchy and Python 3.0a2

Continuing with my experiment of adapting Crunchy to Python 3.0, I managed to get Crunchy to start with Python 3.0a2 and get some code running from the editor - but not from the interpreter, nor the doctest. Most of the problems I have are dealing with bytes-to-string conversion and string-to-bytes. As mentioned by Guido van Rossum last June
  • We're switching to a model known from Java: (immutable) text strings are Unicode, and binary data is represented by a separate mutable "bytes" data type. In addition, the parser will be more Unicode-friendly: the default source encoding will be UTF-8, and non-ASCII letters can be used in identifiers
Later on, in a comment from that post, we find:
  • > In your presentation last night you had one slide which
    > talked about the "str" vs "bytes" types in Python 3000. On
    > the bottom of that slide was something like:
    >
    > str(b"asdf") == "b'asdf'"
    >
    > However, in discussing this slide (very briefly) you said
    > that a type constructors like "str" could be used to do
    > conversion. It seems like "str" is behaving more like
    > "repr" in this case, which seems unusual and less useful
    > to me. Was this a typo, or is this actually the way it's
    > supposed to work? What's the rationale?

    To be honest, this is an open issue. The slide was wrong compared to the current implementation; but the implementation currently defaults to utf8 (so str(b'a') == 'a'), which is not right either. The problem is that there are conflicting requirements: str() of any object should ideally always return something, but we don't want str() to assume a specific default encoding.

    To be continued...
This change seems innocuous enough...

As a web server, Crunchy sends to and receives information from the browser as "binary data" or "bytes". As a generalized Python interpreter, Crunchy manipulates the information as "strings". It appears that the "bytes" implementation is done much more completely in Python 3.0a2 than it was in Python 3.0a1. And this is the source of many problems.

For example, Crunchy sends from the browser some information about the path to which a Python file should be saved and its content as follows:

'/Users/andre/.crunchy/temp.py_::EOF::_from Tkinter import *\nroot = Tk()\nw = Label(root, text="Crunchy!")\nw.pack()\nroot.mainloop()'

This is sent as a binary stream which needs to be converted to the string written above. This conversion is done via str(...). Using Python 3.0a1 (and 2.4 and 2.5), the result was as above; splitting the string gave the following:

['/Users/andre/.crunchy/temp.py', 'from Tkinter import *\nroot = Tk()\nw = Label(root, text="Crunchy!")\nw.pack()\nroot.mainloop()']

Now, with Python 3.0a2, it gets slightly more complicated. The first string acquires a "b" prefix upon conversion (as mentioned in the comment from Guido's blog mentioned before). After splitting, the result is

["b'/Users/andre/.crunchy/temp.py", 'from Tkinter import *\\nroot = Tk()\\nw = Label(root, text="Crunchy!")\\nw.pack()\\nroot.mainloop()\'']

So, we now have a first string with a "b'" prefix embedded in it, and a second one without. It seems that each case will have to be handled carefully on its own. And I suspect more problems will show up as we get closer to the final 3.0 release.

I know, I know, I'm really not following the "recommended" practice, as quoted on Guido's blog. I should probably wait first for Python 2.6 to come out. Then, I should have a complete unit test coverage and use the conversion tool to create a Python 3.0 version .... However, I am not convinced that the conversion tool will be smart enough to know when a function (that I write) expect a "str" object and when it expect a "byte" one. Furthermore, the few unit tests I had worked fine under both Python 2.5 and 3.0 ... but some functions that I had written with the expectation that they would receive some string arguments did not work in "production code", as they were getting some bytes arguments. And this failed completely silently...

If I had to give some advice to someone about creating Python programs that can work with both Python 2.x and Python 3.x, I would say like Guido: don't. :-) Unless of course you are like me and are doing this for fun and to get to learn about the differences between Python 2.x and 3.x along the way. But then, "be prepared for the unexpected" like the following: turning on a few print statements (via a "debug flag") can result in breaking some code; turn them off and the code works again... Yes, it did happen to me - I still have to figure out how...

No comments: