Sunday, January 17, 2010

Profiling adventures and cython - Introducing cython

In the previous blog post, I made some attempts at speeding up the function mandel() by making changes in the Python code. While I had some success in doing so, it was clearly not enough for my purpose. As a result, I will now try to use cython. Before I do this, I note again the result from the last profiling run, limiting the information to the 5 longest-running functions or methods.

       3673150 function calls in 84.807 CPU seconds

 ncalls  tottime  percall  cumtime  percall filename:lineno(function)
3168000   73.210    0.000   73.210    0.000 mandel1c.py:7(mandel)
     11   10.855    0.987   84.204    7.655 viewer1.py:23(draw_fractal)
      1    0.593    0.593    0.593    0.593 {_tkinter.create}
 504530    0.137    0.000    0.137    0.000 viewer1.py:17(draw_pixel)
     37    0.009    0.000    0.009    0.000 {built-in method call}

The goal of cython could be described as providing an easy way to convert a Python module into a C extension. This is what I will do. [There are other ways to work with cython extensions than what I use here; for more information, please consult the cython web site.] Note that I am NOT a cython expert; this is only the first project for which I use cython. While I am not interested in creating an application for distribution, and hence do not use the setup method for cython, it is quite possible that there are better ways to use cython than what I explore here.
I first start by taking my existing module and copying it into a new file, with a ".pyx" extension instead of the traditional ".py".

# mandel2cy.pyx
# cython: profile=True

def mandel(c, max_iterations=20):
    '''determines if a point is in the Mandelbrot set based on deciding if,
       after a maximum allowed number of iterations, the absolute value of
       the resulting number is greater or equal to 2.'''
    z = 0
    for i in range(0, max_iterations):
        z = z**2 + c
        if abs(z) >= 4:
            return False
    return abs(z) < 2

Note that I have removed the equivalence between range and xrange. The reason I have done this is because with xrange present like this in the file results in a compilation error when running cython with Python 3.1. Furthermore, as will be seen later, it is not really needed even for Python 2.x when using cython properly.
I have also included a commented line stating that 'profile' was equal to True; this is a cython directive that will enable the Python profiler to also include cython functions in its tally.

In order to import this module, I also need to modify the viewer to import the cython module. Here is the new version.
# viewer2.py

import pyximport
pyximport.install()

from mandel2_cy import mandel
from viewer import Viewer
import time

import sys
if sys.version_info < (3,):
    import Tkinter as tk
    range = xrange
else:
    import tkinter as tk

class FancyViewer(Viewer):
    '''Application to display fractals'''

    def draw_pixel(self, x, y):
        '''Simulates drawing a given pixel in black by drawing a black line
           of length equal to one pixel.'''
        return
        #self.canvas.create_line(x, y, x+1, y, fill="black")

    def draw_fractal(self):
        '''draws a fractal on the canvas'''
        self.calculating = True
        begin = time.time()
        # clear the canvas
        self.canvas.create_rectangle(0, 0, self.canvas_width,
                                    self.canvas_height, fill="white")
        for x in range(0, self.canvas_width):
            real = self.min_x + x*self.pixel_size
            for y in range(0, self.canvas_height):
                imag = self.min_y + y*self.pixel_size
                c = complex(real, imag)
                if mandel(c, self.nb_iterations):
                    self.draw_pixel(x, self.canvas_height - y)
        self.status.config(text="Time required = %.2f s  [%s iterations]  %s" %(
                                (time.time() - begin), self.nb_iterations,
                                                                self.zoom_info))
        self.status2.config(text=self.info())
        self.calculating = False

if __name__ == "__main__":
    root = tk.Tk()
    app = FancyViewer(root)
    root.mainloop()

Other than the top few lines, nothing has changed. Time to run the profiler with this new version.
       6841793 function calls in 50.145 CPU seconds

 ncalls  tottime  percall  cumtime  percall filename:lineno(function)
3168000   35.913    0.000   35.913    0.000 mandel2_cy.pyx:4(mandel)
     11   10.670    0.970   48.754    4.432 viewer2.py:26(draw_fractal)
3168000    2.001    0.000   37.914    0.000 {mandel2_cy.mandel}
      1    1.356    1.356    1.356    1.356 {_tkinter.create}
 505173    0.167    0.000    0.167    0.000 viewer2.py:20(draw_pixel)

A reduction from 85 to 50 seconds; cython must be doing something right! Note that the calls to abs() have been eliminated by using cython. All I did is import the module via Python without making any other change to the code.
Note also that mandel appears twice: once (the longest running) as the function defined on line 8 of mandel2_cy.pyx, and once as a object belonging to the module mandel2_cy. I will come back to this later but, for now, I will do some changes to help cython do even better.
As mentioned before, cython is a tool to help create C extensions. One of the differences between C and Python is that variables have a declared type in C. If one tells cython about what type a given variable is, cython can often use that information to make the code run faster. As an example, I know that two of the variables are of type integers which is a native C type; I can add this information as follows.
# mandel2a_cy.pyx
# cython: profile=True

def mandel(c, int max_iterations=20):
    '''determines if a point is in the Mandelbrot set based on deciding if,
       after a maximum allowed number of iterations, the absolute value of
       the resulting number is greater or equal to 2.'''
    cdef int i
    z = 0
    for i in range(0, max_iterations):
        z = z**2 + c
        if abs(z) >= 2:
            return False
    return abs(z) < 2

Running the profiler with this change yields the following:
       6841793 function calls in 39.860 CPU seconds

 ncalls  tottime  percall  cumtime  percall filename:lineno(function)
3168000   27.431    0.000   27.431    0.000 mandel2a_cy.pyx:4(mandel)
     11    9.869    0.897   39.339    3.576 viewer2.py:26(draw_fractal)
3168000    1.906    0.000   29.337    0.000 {mandel2a_cy.mandel}
      1    0.511    0.511    0.511    0.511 {_tkinter.create}
 505173    0.131    0.000    0.131    0.000 viewer2.py:20(draw_pixel)

Another significant time reduction, this time of the order of 20%. And we didn't tell cython that "z" and "c" are complex yet.

Actually, C does not have a complex data type. So, I can choose one of two strategies:
  1. I can change the code so that I deal only with real numbers, by working myself how to multiply and add complex numbers.
  2. I can use some special cython technique to extract all the relevant information about the Python built-in complex data type without changing the code inside the function (other than adding some type declaration).
I will choose the second of these methods and see what it gives. The required changes are as follows:
# mandel2b_cy.pyx
# cython: profile=True

cdef extern from "complexobject.h":

    struct Py_complex:
        double real
        double imag

    ctypedef class __builtin__.complex [object PyComplexObject]:
        cdef Py_complex cval


def mandel(complex c, int max_iterations=20):
    '''determines if a point is in the Mandelbrot set based on deciding if,
       after a maximum allowed number of iterations, the absolute value of
       the resulting number is greater or equal to 2.'''
    cdef int i
    cdef complex z

    z = 0. + 0.j

    for i in range(0, max_iterations):
        z = z**2 + c
        if abs(z) >= 2:
            return False
    return abs(z) < 2

The timing results are the following:
       6841793 function calls in 38.424 CPU seconds

 ncalls  tottime  percall  cumtime  percall filename:lineno(function)
3168000   26.771    0.000   26.771    0.000 mandel2b_cy.pyx:14(mandel)
     11    9.435    0.858   38.209    3.474 viewer2.py:26(draw_fractal)
3168000    1.865    0.000   28.636    0.000 {mandel2b_cy.mandel}
      1    0.205    0.205    0.205    0.205 {_tkinter.create}
 505173    0.136    0.000    0.136    0.000 viewer2.py:20(draw_pixel)

The time difference between this run and the previous one is within the variation I observe from one profiling run to the next (using exactly the same program). Therefore, I conclude that this latest attempt didn't speed up the code. It is possible that I have overlooked something to ensure that cython could make use of the information about the complex datatype more efficiently ... It seems like I need a different strategy. I will resort to doing the complex algebra myself, and work only with real numbers. Here's the modified code for the mandel module.
# mandel2c_cy.pyx
# cython: profile=True

def mandel(double real, double imag, int max_iterations=20):
    '''determines if a point is in the Mandelbrot set based on deciding if,
       after a maximum allowed number of iterations, the absolute value of
       the resulting number is greater or equal to 2.'''
    cdef double z_real = 0., z_imag = 0.
    cdef int i

    for i in range(0, max_iterations):
        z_real, z_imag = ( z_real*z_real - z_imag*z_imag + real,
                     2*z_real*z_imag + imag )
        if (z_real*z_real + z_imag*z_imag) >= 4:
            return False
    return (z_real*z_real + z_imag*z_imag) < 4

I also change the call within draw_fractal() so that I don't use complex variables. The result is extremely encouraging:
       6841793 function calls in 7.205 CPU seconds

 ncalls  tottime  percall  cumtime  percall filename:lineno(function)
     11    4.379    0.398    7.066    0.642 viewer2a.py:26(draw_fractal)
3168000    1.557    0.000    2.570    0.000 {mandel2c_cy.mandel}
3168000    1.013    0.000    1.013    0.000 mandel2c_cy.pyx:4(mandel)
      1    0.130    0.130    0.130    0.130 {_tkinter.create}
 505173    0.114    0.000    0.114    0.000 viewer2a.py:20(draw_pixel)

This total execution time has been reduced from 38 to 7 seconds. mandel() is no longer the largest contributor to the overall execution time; draw_fractal() is. However, the program is still a bit too slow: without actually doing any drawing, it takes approximately 0.6 seconds to generate one fractal image. However, I can do better. Looking at the code, I notice that draw_fractal() contains two embedded for loops, resulting to all those calls to mandel(). Remember how telling cython about integer types used in loops sped up the code? This suggest that perhaps I should do something similar and move some of the code of draw_fractal() to the cython module. Here's a modified viewer module.
# viewer2b.py

import pyximport
pyximport.install()

from mandel2d_cy import create_fractal
from viewer import Viewer
import time

import sys
if sys.version_info < (3,):
    import Tkinter as tk
    range = xrange
else:
    import tkinter as tk

class FancyViewer(Viewer):
    '''Application to display fractals'''

    def draw_fractal(self):
        '''draws a fractal on the canvas'''
        self.calculating = True
        begin = time.time()
        # clear the canvas
        self.canvas.create_rectangle(0, 0, self.canvas_width,
                                    self.canvas_height, fill="white")
        create_fractal(self.canvas_width, self.canvas_height,
                       self.min_x, self.min_y, self.pixel_size,
                       self.nb_iterations, self.canvas)
        self.status.config(text="Time required = %.2f s  [%s iterations]  %s" %(
                                (time.time() - begin), self.nb_iterations,
                                                                self.zoom_info))
        self.status2.config(text=self.info())
        self.calculating = False

if __name__ == "__main__":
    root = tk.Tk()
    app = FancyViewer(root)
    root.mainloop()

And here is the new cython module, without any additional type declaration.
# mandel2d_cy.pyx
# cython: profile=True

def mandel(double real, double imag, int max_iterations=20):
    '''determines if a point is in the Mandelbrot set based on deciding if,
       after a maximum allowed number of iterations, the absolute value of
       the resulting number is greater or equal to 2.'''
    cdef double z_real = 0., z_imag = 0.
    cdef int i

    for i in range(0, max_iterations):
        z_real, z_imag = ( z_real*z_real - z_imag*z_imag + real,
                     2*z_real*z_imag + imag )
        if (z_real*z_real + z_imag*z_imag) >= 4:
            return False
    return (z_real*z_real + z_imag*z_imag) < 4

def draw_pixel(x, y, canvas):
    '''Simulates drawing a given pixel in black by drawing a black line
       of length equal to one pixel.'''
    return
    #canvas.create_line(x, y, x+1, y, fill="black")

def create_fractal(canvas_width, canvas_height,
                       min_x, min_y, pixel_size,
                       nb_iterations, canvas):
    for x in range(0, canvas_width):
        real = min_x + x*pixel_size
        for y in range(0, canvas_height):
            imag = min_y + y*pixel_size
            if mandel(real, imag, nb_iterations):
                draw_pixel(x, canvas_height - y, canvas)

The profiling result is as follows:
       3673815 function calls in 3.873 CPU seconds

 ncalls  tottime  percall  cumtime  percall filename:lineno(function)
     11    2.632    0.239    3.706    0.337 mandel2d_cy.pyx:24(create_fractal)
3168000    1.002    0.000    1.002    0.000 mandel2d_cy.pyx:4(mandel)
      1    0.155    0.155    0.155    0.155 {_tkinter.create}
 505173    0.072    0.000    0.072    0.000 mandel2d_cy.pyx:18(draw_pixel)
     37    0.009    0.000    0.009    0.000 {built-in method call}

Simply by moving over some of the code to the cython module, I have reduced the profiling time to almost half of it previous value. Looking more closely at the profiling results, I also notice that calls to mandel() now only appear once; some overhead in calling cython functions from python modules has disappeared. Let's see what happens if I now add some type information.
def create_fractal(int canvas_width, int canvas_height,
                       double min_x, double min_y, double pixel_size,
                       int nb_iterations, canvas):
    cdef int x, y
    cdef double real, imag

    for x in range(0, canvas_width):
        real = min_x + x*pixel_size
        for y in range(0, canvas_height):
            imag = min_y + y*pixel_size
            if mandel(real, imag, nb_iterations):
                draw_pixel(x, canvas_height - y, canvas)

The result is only slightly better:
       3673815 function calls in 3.475 CPU seconds

 ncalls  tottime  percall  cumtime  percall filename:lineno(function)
     11    2.189    0.199    3.308    0.301 mandel2e_cy.pyx:24(create_fractal)
3168000    1.046    0.000    1.046    0.000 mandel2e_cy.pyx:4(mandel)
      1    0.135    0.135    0.135    0.135 {_tkinter.create}
 505173    0.074    0.000    0.074    0.000 mandel2e_cy.pyx:18(draw_pixel)
     37    0.028    0.001    0.028    0.001 {built-in method call}

However, one thing I remember from the little I know about C it that, not only do variables have to be declared to be of a certain type, but the same has to be done to functions as well. Here, mandel() has not been declared to be of a specific type, so cython assumes it to be a generic Python object. After reading the cython documentation, and noticing that mandel() is only called from within the cython module, I conclude that not only should I specify the type for mandel() but that it probably makes sense to specify that it can be "inlined"; I also do the same for draw_pixel().
# mandel2f_cy.pyx
# cython: profile=True

cdef inline bint mandel(double real, double imag, int max_iterations=20):
    '''determines if a point is in the Mandelbrot set based on deciding if,
       after a maximum allowed number of iterations, the absolute value of
       the resulting number is greater or equal to 2.'''
    cdef double z_real = 0., z_imag = 0.
    cdef int i

    for i in range(0, max_iterations):
        z_real, z_imag = ( z_real*z_real - z_imag*z_imag + real,
                     2*z_real*z_imag + imag )
        if (z_real*z_real + z_imag*z_imag) >= 4:
            return False
    return (z_real*z_real + z_imag*z_imag) < 4

cdef inline void draw_pixel(x, y, canvas):
    '''Simulates drawing a given pixel in black by drawing a black line
       of length equal to one pixel.'''
    return
    #canvas.create_line(x, y, x+1, y, fill="black")

This yields a nice improvement.
       3673815 function calls in 2.333 CPU seconds

 ncalls  tottime  percall  cumtime  percall filename:lineno(function)
     11    1.190    0.108    2.194    0.199 mandel2f_cy.pyx:24(create_fractal)
3168000    0.930    0.000    0.930    0.000 mandel2f_cy.pyx:4(mandel)
      1    0.127    0.127    0.127    0.127 {_tkinter.create}
 505173    0.074    0.000    0.074    0.000 mandel2f_cy.pyx:18(draw_pixel)
     37    0.009    0.000    0.009    0.000 {built-in method call}

However... I asked cython to "inline" mandel, thus treating them as a pure C function. Yet, they both appear in the Python profiling information, which was not the case for abs() once I used cython for the first time. The reason it appears is that cython has been instructed to profile all functions in the module, via the directive at the top. I can selectively turn off the profiling for an individual function by importing the "cython module" and using a special purpose decorator as follows.
# mandel2g_cy.pyx
# cython: profile=True

import cython

@cython.profile(False)
cdef inline bint mandel(double real, double imag, int max_iterations=20):
    '''determines if a point is in the Mandelbrot set based on deciding if,
       after a maximum allowed number of iterations, the absolute value of
       the resulting number is greater or equal to 2.'''
    cdef double z_real = 0., z_imag = 0.
    cdef int i

    for i in range(0, max_iterations):
        z_real, z_imag = ( z_real*z_real - z_imag*z_imag + real,
                     2*z_real*z_imag + imag )
        if (z_real*z_real + z_imag*z_imag) >= 4:
            return False
    return (z_real*z_real + z_imag*z_imag) < 4

cdef inline void draw_pixel(x, y, canvas):
    '''Simulates drawing a given pixel in black by drawing a black line
       of length equal to one pixel.'''
    return
    #canvas.create_line(x, y, x+1, y, fill="black")

The result is even better than I would have expected!
      505519 function calls in 0.817 CPU seconds

ncalls  tottime  percall  cumtime  percall filename:lineno(function)
    11    0.605    0.055    0.676    0.061 mandel2g_cy.pyx:27(create_fractal)
     1    0.128    0.128    0.128    0.128 {_tkinter.create}
504877    0.070    0.000    0.070    0.000 mandel2g_cy.pyx:21(draw_pixel)
    37    0.010    0.000    0.010    0.000 {built-in method call}
    11    0.001    0.000    0.678    0.062 viewer2b.py:20(draw_fractal)

From 85 seconds (at the beginning of this post) down to 0.8 seconds: a reduction by a factor of 100 ...thank you cython!  :-)

However, increasing the number of iterations to 1000 (from the current value of 100 used for testing) does increase the time significantly.
      495235 function calls in 3.872 CPU seconds

ncalls  tottime  percall  cumtime  percall filename:lineno(function)
    11    3.653    0.332    3.723    0.338 mandel2g_cy.pyx:27(create_fractal)
     1    0.136    0.136    0.136    0.136 {_tkinter.create}
494593    0.071    0.000    0.071    0.000 mandel2g_cy.pyx:21(draw_pixel)
    37    0.009    0.000    0.009    0.000 {built-in method call}
    11    0.001    0.000    3.726    0.339 viewer2b.py:20(draw_fractal)

It is probably a good time to put back the drawing to see what the overall time profile looks like in a more realistic situation.
      5441165 function calls in 20.747 CPU seconds

ncalls  tottime  percall  cumtime  percall filename:lineno(function)
494604    8.682    0.000   14.427    0.000 Tkinter.py:2135(_create)
    11    3.863    0.351   20.572    1.870 mandel2g_cy.pyx:27(create_fractal)
494632    2.043    0.000    5.326    0.000 Tkinter.py:1046(_options)
494657    1.845    0.000    2.861    0.000 Tkinter.py:77(_cnfmerge)
494593    1.548    0.000   16.709    0.000 mandel2g_cy.pyx:21(draw_pixel)

Clearly, the limiting time factor is now the Tkinter based drawing, and not the other code. It is time to think of a better drawing strategy. However, this will have to wait until next post.

4 comments:

Anonymous said...

You should likely use

cdef double complex z

instead of

cdef complex z

The former makes z a low-level "unboxed" number (fast), whereas the latter AFAIK makes it a Python complex object (much slower since all arithmetic goes through the PyNumber API).

André Roberge said...

Anonymous: thanks for the suggestion. I did that (and the same for the variable "c"). In addition, I had to replace z**2 by z*z for this to work, as I got an error message about "complex powers not yet supported".

Simply replacing z**2 by z*z brought down the time from 38 to 29 seconds (approximately); adding the "double" type declaration brought down the time to 22 seconds: nice, but not as much as can be obtained by working with real numbers only.

Anonymous said...

Now that's a surprise. Examining the assembler output, it seems that gcc does not inline complex multiplication, which apparently gives a factor of 10 difference.

With `-ffast-math` flag there is no performance difference (of plain C) double vs. complex routines.

patfla said...

The program wasn't actually drawing anything until I noticed that all instances on this page of

#self.canvas.create_line(x, y, x+1, y, fill="black")

are commented out.

I might also point out that, as best I can tell, the posts don't include actual instructions to build (or profile) with cython. I figured out how to do so of course looking elsewhere on the web. E.g.

Building Cython code

Other than that the exercise has been very helpful.

Je vous remercie - Patrick