Skip to content

A Stroll Down Optimization Lane

March 9, 2013

As promised, this is a tutorial demonstrating a case where using the Numpy C API does buy you something. I’m going to work through five different versions of a simple classification function that served as the backbone of a likelihood function. In this case, the C implementation buys you about a factor of two. I implemented the whole likelihood function in C, which gave me very nice speed, but the learning curve was pretty steep. As a teaser, there is a trick that allows you to (in many cases) simply ignore the issue of reference counting and not lose anything. As always, the code can be found at my GitHub under OptimizingLane. [1] Unfortunately, once we were able to sufficiently test this method, we realized the idea didn’t work. The code is still worthwhile as an example of how an optimization might progress from pure Python all the way to the C API.

The classifier in question is simply a piecewise linear function:

f(x) = \begin{cases} 0 & \mbox{if } x<c-b \\          \frac{x-c+b}{2b} & \mbox{if } c-b\leq x\leq c+b \\                              1 & \mbox{if } x>c+b\end{cases}

Let’s start with a simple pure Python implementation:

def classify(x, c, b):
  if x<c-b:
    return 0
  elif x>c+b:
    return 1
  else:
    if b>10**-7:
      return (x-c+b)/2/b
    else:
      return 0.5

Now, we need some data to analyze:

DATA = np.random.rand(1000)

That code generates a thousand random numbers in the interval (0, 1). We will use the parameters c=0.5 and b=0.25 for our classifier function. Let’s start by running our pure Python function in the standard pythonic manner:

%timeit [pp_class(x, 0.5, 0.25) for x in DATA]
100 loops, best of 3: 2.64 ms per loop

pp_class is just the local name for our pure Python implementation. When that function is going to get called many, many times, we need the function to run faster. The first optimization we’re going to try is NumPy’s vectorize:

pp_vect_class = np.vectorize(pp_class)

That little trick actually buys you close to a factor of 5:

%timeit pp_vect_class(DATA, 0.5, 0.25)
1000 loops, best of 3: 416 us per loop

Unfortunately, that factor of 5 is just how slow NumPy indexing is:

listDATA = DATA.tolist()
%timeit [pp_class(x, 0.5, 0.25) for x in listDATA]
1000 loops, best of 3: 401 us per loop

Now, there is an obvious algorithmic optimization we could do. Our algorithm is going to be O(n) no matter what, but if we use a sorted data structure, we can do only O(log(n)) comparisons by locating the first and last elements in the range (c-b, c+b) with O(log(n)) comparisons. The sorting step would only have to be done once, and the classification function was run a number of times. Let’s use line_profiler to see what that would buy us:

Line #      Hits   Time  Per Hit   % Time  Line Contents
========================================================
     4                                     @profile
     5                                     def classify(x, c, b):
     6    100001  84977      0.8     34.7    if x<c-b:
     7     24861  14866      0.6      6.1      return 0
     8     75140  54415      0.7     22.2    elif x>c+b:
     9     25326  14910      0.6      6.1      return 1
    10                                       else:
    11     49814  30855      0.6     12.6      if b>10**-7:
    12     49814  44963      0.9     18.4        return (x-c+b)/2/b
    13                                         else:
    14                                           return 0.5

As long as we are still using pure Python, an optimization based on reducing the number of comparisons will buy us at most a factor of 2. I didn’t go down that road because the full likelihood function had paired observations, so any sorted data structure would have had to have a mapping for that pairing. Since I needed something compiled either way, I punted on the algorithmic optimization down the road. If the project had gone farther, I would have looked a lot harder into that kind of optimization.

Since we went over basic Cython in the last post, let’s jump straight to the optimized version.

@cython.boundscheck(False)
def cy_classify(np.ndarray[np.float_t, ndim=1] data, float c, float b):
    cdef unsigned int i
    cdef np.ndarray[np.float_t, ndim=1] out = np.zeros(data.shape[0], dtype=float)
    for i in xrange(data.shape[0]):
        out[i] = c_classify(data[i], c, b)
    return out

cdef float c_classify(float x, float c, float b):
    if x<c-b:
        return 0
    elif x>c+b:
        return 1
    else:
        if b>10**-7:
            return (x-c+b)/2/b
        else:
            return 0.5

There are two functions in the above code. The cdef float before c_classify tells Cython that c_classify is a function that returns a float and should not be exposed to Python. The function is then implemented in C and has a lower function call overhead. Almost of the extra effort in the cy_classify function is to get past the slow Python indexing. Declaring our input to be a 1-d NumPy ndarray of floats (np.ndarray[np.float_t, ndim=1] data) goes a long way. The next optimization is to declare our iterator i to be an unsigned int. That way, Cython knows not to worry about negative indexing. Finally, we added the decorator @cython.boundscheck(False) to prevent bounds checking.

After compiling our Cython code, we are finally in the ballpark:

%timeit cy_class(DATA, 0.5, 0.25)
100000 loops, best of 3: 7.47 us per loop

Let’s see what we can get when we move to C. The hardest part of dealing with
the Python C API is reference counting. As you probably know, Python is a garbage collected language. It’s garbage collector is implemented by counting all
the times code references an object created and frees the associated memory when the reference count reaches zero. Because Python can’t introspect your C code, you have to manually increment and decrement the reference count of every Python
object you interact with. If you don’t properly increment the count, Python might free the object before you are done with it. If you don’t properly decrement the count, Python will hold onto the memory forever, leaking it into the abyss. Here’s an example of some leaky code:

static PyObject* classify_bad_classify(PyObject* self, PyObject* args) {
  double c, b;
  int iN;
  PyObject *x_obj;

  if (!PyArg_ParseTuple(args, "Odd", &x_obj, &c, &b))
    return NULL;

  PyArrayObject *x_array = 
    (PyArrayObject *) PyArray_FROM_OTF(x_obj, NPY_DOUBLE, NPY_IN_ARRAY);
  if (x_array == NULL) {
    Py_XDECREF(x_array);
    return NULL;
  }

  int nd = PyArray_NDIM(x_array);
  if (nd != 1) {
    Py_XDECREF(x_array);
    PyErr_SetString(PyExc_TypeError,"Array is wrong shape!\n");
    return NULL;
  }
  npy_intp N = PyArray_DIM(x_array, 0);

  npy_intp shape[] = {N};
  PyArrayObject *vecout = (PyArrayObject *) PyArray_SimpleNew(1, shape,  
							      PyArray_DOUBLE);
  double *OUT = (double *) PyArray_DATA(vecout);
  double *x = (double *) PyArray_DATA(x_array);

  for (iN=0; iN<N; iN++) {
      OUT[iN] = classify(x[iN], c, b);
  }

  return Py_BuildValue("N",vecout);
}

Can you see the leak? Yeah, it took me a little while to find it, too. Here’s a short synopsis about how this thing works. PyArg_ParseTuple converts the
Python objects into C objects. PyArray_FROM_OTF converts the object in x_obj into a numpy ndarray. Literally, you could read that line as x_array = np.asarray(x_obj, dtype=float). NPY_IN_ARRAY is a flag to ensure that our data
is stored in a contiguous array. PyArray_NDIM is just a convenience function to
check the number of dimensions, and PyArray_DATA is a convenience function that
returns the data array that we can cast to a C array. We can successfully do
the cast because of using NPY_IN_ARRAY. Finally, PyBuildValue converts our data into a Python tuple for returning. The first two refcounts are pretty easy. PyArg_ParseTuple simply borrows references. Python isn’t going to delete memory while your function is running, so you don’t have to touch anything in that section.
PyArray_SimpleNew automatically increments the reference count to vecout, and we hand off vecout’s reference using the “N” argument to Py_BuildValue. The documentation for PyArray_FROM_OTF doesn’t mention any references because PyArray_FROM_OTF is just a macro wrapping PyArray_SimpleNew. We need to add a Py_DECREF(x_array) to our code right before we return.

After all that work, let’s see what we get:

%timeit c_class(DATA, 0.5, 0.25)
100000 loops, best of 3: 3.85 us per loop

We get a speedup of about a factor of two compared to optimized Cython. That isn’t phenomenal, but it certainly isn’t nothing. The more times you have to access data from an array, or if you need to build intermediate structures, the more advantage you are going to see from using the C API.

Now, let’s assume that you don’t want to spend a ton of time counting references. The trick is to handle all the “type safety” things at a higher level and just write x_array = np.asarray(x_obj, dtype=float) explicitly in Python. We could write a pretty thin wrapper like this:

def classify(data, c, b):
    data = np.asarray(data)
    return _classify.classify(data, c, b)

We can use that with a simpler C extension _classify:

static PyObject* _classify_classify(PyObject* self, PyObject* args) {
  double c, b;
  int iN;
  PyArrayObject *x_obj;

  if (!PyArg_ParseTuple(args, "Odd", &x_obj, &c, &b))
    return NULL;

  int nd = PyArray_NDIM(x_obj);
  if (nd != 1) {
    PyErr_SetString(PyExc_TypeError,"Array is wrong shape!\n");
    return NULL;
  }
  npy_intp N = PyArray_DIM(x_obj, 0);

  npy_intp shape[] = {N};
  PyArrayObject *vecout = (PyArrayObject *) PyArray_SimpleNew(1, shape,  
							      PyArray_DOUBLE);
  double *OUT = (double *) PyArray_DATA(vecout);
  double *x = (double *) PyArray_DATA(x_obj);

  for (iN=0; iN<N; iN++) {
      OUT[iN] = classify(x[iN], c, b);
  }

  return Py_BuildValue("N",vecout);
}

Because we can now just assume that x_obj is a ndarray, we can cast the data with impunity. Our only two references are those that we borrow from PyArg_ParseTuple and the reference we hand over to Python for vecout. How are we doing on speed?

%timeit c_wrap_class(DATA, 0.5, 0.25)
100000 loops, best of 3: 4.77 us per loop

Our run time goes up by about 25%. Unless you need a Python structure intermediate, it is often easier to pass ndarrays through wrappers. For wrappers that handle optional arguments, the overhead is O(1), whereas the ease and maintainability of your code is significantly higher.

To be clear, I did profile my program extensively and made a number of optimizations before running to C. Also note that the C code here is more flexible than the Cython code. The Cython code cannot handle lists, tuples, or anything else array-like, while we have made an effort to ensure that the C code can convert those data types. There were a number of functions of this type chained together and moving to some compiled implementation was necessary. The take home from this is that there is some benefit to a full C implementation when you are applying a function to an array.

1. I will forewarn you that I am pretty sure this code is not NumPy 1.7 compliant. If you want a NumPy 1.7 compliant version, either leave a comment or email me, and I will make the necessary changes to the C code.

Advertisements

From → Cython, Python C API

3 Comments
  1. Rob permalink

    Interesting posts regarding Cython optimization Daniele. I am curious about how these optimizations compare against using something like NumExpr to parallelize elementwise computations to make use of multicore processors. I’m a novice at Cython/C but I’ve spend a good bit of time optimizing some code involving a large amount of element-by-element operations across a few Numpy Arrays.

    While Cython provided some speed benefit, I saw the largest boost by offloading the element-wise calculations to a Python function using NumExpr (calling functions like “exp”) and using Cython to improve speeds of some loops. I think this is largely due to a result of the calculations being far faster when using all cores on my computer with NumExpr. I doubt I would see the same speeds if I wrote this in native C unless I also implemented threading, and frankly I think folks developing tools like NumExpr are probably going to do a far better job at writing efficient, multi-threaded code than me.

    • I think you are in some ways comparing apples and oranges. Comparing single-threaded optimizations to multi-threaded versions is very difficult. In terms of developer effort, it could be a useful comparison which provides the biggest bang for the cost. As a crude approximation, a perfectly parallel algorithm is going to be proportional to the number of CPU instructions divided by the number of cores.* The optimizations here are trying to reduce the number of instructions whereas multithreaded technologies are trying to increase the number of cores. If the NumExpr people can do both, which it looks like they can, that could be really exciting.

      * Yes, there’s more two it than that due to things like branch prediction and cache behavior, but I’m just trying to get a concept across.

Trackbacks & Pingbacks

  1. Optimization lane… Once again | Partial Lattice

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s

%d bloggers like this: