Sunday, January 17, 2010

Profiling adventures and cython - Faster drawing and conclusion

In the last blog post, I made use of cython to speed up some calculations involving the iterative equation defining the Mandelbrot set. From the initial profiling run with 1000 iterations in the second post to the last run in the previous post, the profiling time went from 575 seconds down to 21 seconds, which is a reduction by a factor of 27. This is nothing to sneer at. Yet, I can do better.
Let's start by having a sample profile run with 1000 iterations with the program as it was last time, but keeping track of more function calls in the profile.
      5441165 function calls in 20.484 CPU seconds

ncalls  tottime  percall  cumtime  percall filename:lineno(function)
494604    8.461    0.000   14.218    0.000 Tkinter.py:2135(_create)
    11    3.839    0.349   20.315    1.847 mandel2g_cy.pyx:27(create_fractal)
494632    2.053    0.000    5.309    0.000 Tkinter.py:1046(_options)
494657    1.809    0.000    2.811    0.000 Tkinter.py:77(_cnfmerge)
494593    1.522    0.000   16.475    0.000 mandel2g_cy.pyx:21(draw_pixel)
989240    0.752    0.000    0.752    0.000 {method 'update' of 'dict' objects}
494593    0.736    0.000   14.953    0.000 Tkinter.py:2155(create_line)
989257    0.698    0.000    0.698    0.000 {_tkinter._flatten}
494632    0.315    0.000    0.315    0.000 {method 'items' of 'dict' objects}
     1    0.158    0.158    0.158    0.158 {_tkinter.create}
494641    0.131    0.000    0.131    0.000 {callable}
    37    0.009    0.000    0.009    0.000 {built-in method call}
    11    0.001    0.000   20.317    1.847 viewer2b.py:20(draw_fractal)
    12    0.000    0.000    0.000    0.000 viewer.py:60(info)
I notice that there are many function calls: over 5 millions of them. While most of them appear to take very little time, they do add up in the end. It is time to adopt a smarter drawing strategy.
Currently, a "line" is potentially drawn for each pixel. If I look at a given fractal drawing, I can see that it could be drawn using "longer lines", when consecutive pixels are to be drawn with the same colour. I can easily implement this as follows.
def create_fractal(int canvas_width, int canvas_height,
                       double min_x, double min_y, double pixel_size,
                       int nb_iterations, canvas):
    cdef int x, y, start_y, end_y
    cdef double real, imag
    cdef bint start_line

    for x in range(0, canvas_width):
        real = min_x + x*pixel_size
        start_line = False
        for y in range(0, canvas_height):
            imag = min_y + y*pixel_size
            if mandel(real, imag, nb_iterations):
                if not start_line:
                    start_line = True
                    start_y = canvas_height - y
            else:
                if start_line:
                    start_line = False
                    end_y = canvas_height - y
                    canvas.create_line(x, start_y, x, end_y, fill="black")
        if start_line:
            end_y = canvas_height - y
            canvas.create_line(x, start_y, x, end_y, fill="black")
Note that I no longer need the function draw_pixel(). The result is a reduction from 20 seconds down to 4 seconds:
      79092 function calls in 3.959 CPU seconds

ncalls  tottime  percall  cumtime  percall filename:lineno(function)
    11    3.552    0.323    3.815    0.347 mandel2h_cy.pyx:21(create_fractal)
  7856    0.150    0.000    0.250    0.000 Tkinter.py:2135(_create)
     1    0.131    0.131    0.131    0.131 {_tkinter.create}
  7884    0.037    0.000    0.091    0.000 Tkinter.py:1046(_options)
  7909    0.030    0.000    0.047    0.000 Tkinter.py:77(_cnfmerge)
  7845    0.014    0.000    0.263    0.000 Tkinter.py:2155(create_line)
 15761    0.014    0.000    0.014    0.000 {_tkinter._flatten}
 15744    0.013    0.000    0.013    0.000 {method 'update' of 'dict' objects}
    37    0.009    0.000    0.009    0.000 {built-in method call}
  7884    0.005    0.000    0.005    0.000 {method 'items' of 'dict' objects}
  7893    0.002    0.000    0.002    0.000 {callable}
    11    0.001    0.000    3.818    0.347 viewer2b.py:20(draw_fractal)
    12    0.000    0.000    0.000    0.000 viewer.py:60(info)
    22    0.000    0.000    0.001    0.000 Tkinter.py:1172(_configure)
And it is now again my own code in create_fractal() that appears to be the limiting factor. Thinking back of when I increased the number of iterations from 100 to 1000, thus only affecting the execution time of mandel(), it seemed like this might be a good place to look at for possible time improvements. Let's recall what the code looks like.
cdef inline bint mandel(double real, double imag, int max_iterations=20):
    '''determines if a point is in the Mandelbrot set based on deciding if,
       after a maximum allowed number of iterations, the absolute value of
       the resulting number is greater or equal to 2.'''
    cdef double z_real = 0., z_imag = 0.
    cdef int i

    for i in range(0, max_iterations):
        z_real, z_imag = ( z_real*z_real - z_imag*z_imag + real,
                           2*z_real*z_imag + imag )
        if (z_real*z_real + z_imag*z_imag) >= 4:
            return False
    return (z_real*z_real + z_imag*z_imag) < 4
I used a Pythonic tuple assignement to avoid the use of temporary variables. However, in a typical iteration, there will be 4 multiplications for the tuple re-assigment and two more for the "if" statement, for a total of 6. It is certainly possible to reduce the number of multiplications by using temporary variables, as follows:
cdef inline bint mandel(double real, double imag, int max_iterations=20):
    '''determines if a point is in the Mandelbrot set based on deciding if,
       after a maximum allowed number of iterations, the absolute value of
       the resulting number is greater or equal to 2.'''
    cdef double z_real = 0., z_imag = 0.
    cdef int i
    cdef double zr_sq, zi_sq, z_cross

    for i in range(0, max_iterations):
        zr_sq = z_real*z_real
        zi_sq = z_imag*z_imag
        z_cross = 2*z_real*z_imag

        z_real = zr_sq - zi_sq + real
        z_imag = z_cross + imag
        if (zr_sq + zi_sq) >= 4:
            return False
    return (zr_sq + zi_sq) < 4
So, there are now fewer multiplications to compute. Surely, this will speed up the code:
      78982 function calls in 4.888 CPU seconds

ncalls  tottime  percall  cumtime  percall filename:lineno(function)
    11    4.478    0.407    4.748    0.432 mandel2i_cy.pyx:26(create_fractal)
  7845    0.153    0.000    0.256    0.000 Tkinter.py:2135(_create)
     1    0.128    0.128    0.128    0.128 {_tkinter.create}
  7873    0.040    0.000    0.095    0.000 Tkinter.py:1046(_options)
  7898    0.031    0.000    0.048    0.000 Tkinter.py:77(_cnfmerge)
  7834    0.014    0.000    0.270    0.000 Tkinter.py:2155(create_line)
 15739    0.013    0.000    0.013    0.000 {_tkinter._flatten}
 15722    0.013    0.000    0.013    0.000 {method 'update' of 'dict' objects}
    37    0.009    0.000    0.009    0.000 {built-in method call}
  7873    0.005    0.000    0.005    0.000 {method 'items' of 'dict' objects}
  7882    0.002    0.000    0.002    0.000 {callable}
    11    0.001    0.000    4.750    0.432 viewer2b.py:20(draw_fractal)
    12    0.000    0.000    0.000    0.000 viewer.py:60(info)
     4    0.000    0.000    0.000    0.000 {posix.stat}
Alas, that is not the case, as the previous profiling run was slightly below 4 seconds. [Note that I did run each profiling test at least three times to prevent any anomalous result.] Apparently my intuition is not a very good guide when it comes to predicting how cython will be able to optimize a given function.
So far, the pictures the program has been able to produce have only been in black and white. It is time to spruce things up and add colour. To do this, we will need to make three general changes:
  1. We will modify mandel() so that it returns the number of iterations required to evaluate that a given point does not belong to the set; if it does belong, we will return -1.
  2. We will create a colour palette as a Python list. For a given number of iterations required by mandel(), we will pick a given colour, cycling through the colours from the palette.
  3. We will need to change our line drawing method so that we keep track of the colour (number of iteration) rather than simply whether or not the point is in the set ("black") or not.
The code is a bit trickier to set up than the previous version, but it uses a similar logic. To determine what colour to draw, we first calculate the colour of the first pixel, and then loop through all the others along a same line. Each time we see a colour change, it signals that we need to draw a line up to that point in the colour used up to that point ("current_colour") and assign the new colour as the new "current" one. When we reach the end of a line (column in the code...), we need to ensure that we draw the last line segment. Without further ado, here is the complete code of the revised cython module.
# mandel3cy.pyx
# cython: profile=True

import cython

def make_palette():
    '''sample coloring scheme for the fractal - feel free to experiment'''
    colours = []

    for i in range(0, 25):
        colours.append('#%02x%02x%02x' % (i*10, i*8, 50 + i*8))
    for i in range(25, 5, -1):
        colours.append('#%02x%02x%02x' % (50 + i*8, 150+i*2,  i*10))
    for i in range(10, 2, -1):
        colours.append('#00%02x30' % (i*15))
    return colours

colours = make_palette()
cdef int nb_colours = len(colours)

@cython.profile(False)
cdef inline int mandel(double real, double imag, int max_iterations=20):
    '''determines if a point is in the Mandelbrot set based on deciding if,
       after a maximum allowed number of iterations, the absolute value of
       the resulting number is greater or equal to 2.'''
    cdef double z_real = 0., z_imag = 0.
    cdef int i

    for i in range(0, max_iterations):
        z_real, z_imag = ( z_real*z_real - z_imag*z_imag + real,
                           2*z_real*z_imag + imag )
        if (z_real*z_real + z_imag*z_imag) >= 4:
            return i
    return -1

def create_fractal(int canvas_width, int canvas_height,
                       double min_x, double min_y, double pixel_size,
                       int nb_iterations, canvas):
    global colours, nb_colours
    cdef int x, y, start_y, end_y, current_colour, new_colour
    cdef double real, imag

    for x in range(0, canvas_width):
        real = min_x + x*pixel_size
        start_y = canvas_height
        current_colour = mandel(real, min_y, nb_iterations)
        for y in range(1, canvas_height):
            imag = min_y + y*pixel_size
            new_colour = mandel(real, imag, nb_iterations)

            if new_colour != current_colour:
                if current_colour == -1:
                    canvas.create_line(x, start_y, x, canvas_height-y,
                                        fill="black")
                else:
                    canvas.create_line(x, start_y, x, canvas_height-y,
                                        fill=colours[current_colour%nb_colours])
                current_colour = new_colour
                start_y = canvas_height - y

        if current_colour == -1:
            canvas.create_line(x, start_y, x, 0, fill="black")
        else:
            canvas.create_line(x, start_y, x, 0,
                                        fill=colours[current_colour%nb_colours])
If we profile this code, we find out that it takes about three times as long to generate a colour picture than it did to generate a black and white one - at least, for the starting configuration...
      2370682 function calls in 12.638 CPU seconds

ncalls  tottime  percall  cumtime  percall filename:lineno(function)
    11    4.682    0.426   12.184    1.108 mandel3_cy.pyx:36(create_fractal)
237015    4.310    0.000    7.084    0.000 Tkinter.py:2135(_create)
237043    1.019    0.000    2.575    0.000 Tkinter.py:1046(_options)
237068    0.877    0.000    1.353    0.000 Tkinter.py:77(_cnfmerge)
     1    0.443    0.443    0.443    0.443 {_tkinter.create}
237004    0.418    0.000    7.502    0.000 Tkinter.py:2155(create_line)
474062    0.361    0.000    0.361    0.000 {method 'update' of 'dict' objects}
474079    0.313    0.000    0.313    0.000 {_tkinter._flatten}
237043    0.143    0.000    0.143    0.000 {method 'items' of 'dict' objects}
237052    0.061    0.000    0.061    0.000 {callable}
    37    0.009    0.000    0.009    0.000 {built-in method call}
    11    0.000    0.000   12.186    1.108 viewer3.py:20(draw_fractal)
    12    0.000    0.000    0.000    0.000 viewer.py:60(info)
     4    0.000    0.000    0.000    0.000 {posix.stat}
We also generate some nice pictures! First, using only 100 iterations.

Notice how the Mandelbrot set boundary seems rather smooth ... this is clearly a sign that we might be missing something.  Increasing the number of iterations to 1000 reveals a different picture.


We see much more details and many points that were wrongly identified as being part of the Mandelbrot sets are correctly excluded (and colored!).  What happens if we look at a region that contains more points inside the Mandelbrot set (thus requiring more iterations) and increase the number of iterations to 1500 (as I found that 1000 was not enough in this region).



Unfortunately, the profiling and the timing information displayed does not tell the entire story. In practice, I found that it would take many more seconds (sometimes more than 10) for the canvas to be updated than the timing information given for each of the three pictures displayed above. Something is happening behind the scene when the picture is updated on the screen and which is not being recorded by my simple timing method.

For comparison, I had a look at Aptus, a program that uses a hand-coded C extension for speed. Actually, I had tried Aptus a few years ago, when Ned Batchelder first mentioned it, and tried it again for fun as I was working on my own version. Aptus can produce really nice pictures, really fast. Here's an example that was produced, according to the timing given by Aptus itself in 0.25 seconds.




Note that the timing given by Aptus seems to be truly representative, unlike the timing for my program. I should mention that, in addition to using a hand-crafted C extension, Aptus uses wxPython instead of Tkinter, and it also uses PIL and numpy, both of which are known for their speed. It might be possible that using PIL and numpy with my program would improve the speed significantly. However, all three libraries are not part of the standard library and so do not meet the constraints I had decided upon at the beginning.

This concludes this profiling experiment ... at least for now. I should emphasize that the goal of these posts was to record a profiling experiment using cython. I do not pretend that this code was especially hand-crafted for speed ... even though I did try some simple changes to improve its speed. I may revisit this application at a later time, especially if someone more experienced can point out ways to increase the speed significantly, preferably while staying within the constraints I set up at the beginning: other than cython, use only modules from the standard library. However, I would be interested if anyone adapts this code to use PIL and/or numpy in a straightforward way and, in doing so, increases the speed significantly.

6 comments:

Markus said...

Nice post series, thanks for the quick dive into cython. Seems nice all in all. Acknowledging that hand-written algos might be more efficient, yet the time you needed to get it going that fast was incredibly short. A major +1 for cython imho.

It might not be suitable for any apps but surely at least for some where simple stuff needs a heavy speedup

Ryan Ginstrom said...

Just from examining the generated C code, the following mandel function looks tighter because it avoid dipping into lots of CPython stuff.

@cython.boundscheck(False)
@cython.wraparound(False)
@cython.profile(False)
cdef inline int mandel(double real, double imag, int max_iterations=20):
'''determines if a point is in the Mandelbrot set based on deciding if,
after a maximum allowed number of iterations, the absolute value of
the resulting number is greater or equal to 2.'''
cdef double z_real = 0., z_imag = 0.
cdef int i
cdef double threshold = 4.

for i from 0 <= i < max_iterations:
z_real, z_imag = ( z_real*z_real - z_imag*z_imag + real,
2*z_real*z_imag + imag )
if (z_real*z_real + z_imag*z_imag) >= threshold:
return i
return -1

You could likely do something similar for the rest. There are good hints for speeding up code with cython in the SciPy 2009 proceedings:
http://conference.scipy.org/proceedings/SciPy2009/

André Roberge said...

Ryan:
Thanks for the suggestion and the link. I tried to make this change with the latest version but did not notice any speedup. I will read the papers you linked to and see if I can make some improvements.

Anonymous said...

You might be able to speed up the drawing quite a bit by getting rid of the drawing calls. Instead, you can use an array to store the image, and alter it pixel by pixel with Cython.

I'd use a numpy array to do it, but if you don't want to use any external libs, you can use cython to create a regular old C array, and use that to store your image.

I'm not sure how to get an array into an image with Tk to display it, but I'll bet it can be done.

-Chris Barker

Anonymous said...

OK -- I was thinking about it, so I did it. I wrote a version that uses a numpy array to create and store the image, and then, since I'm familiar with wx, I displayed it in a wx app.

I wrote a Wiki page about it in teh Cython Wiki:

http://wiki.cython.org/examples/mandelbrot

I never got your TK version working on my machine, so I don't know how it compares, but I think I got the image drawing bit down to totally negligible.
-Chris Barker

André Roberge said...

Chris: Thanks for doing this. I saw your post on the cython users list. Unfortunately, I can't see the cython wiki - it is denying any request to see it. Hopefully it won't do the same to others (if so, look up the cython-users list).