As promised, this is a tutorial demonstrating a case where using the Numpy C API does buy you something. I’m going to work through five different versions of a simple classification function that served as the backbone of a likelihood function. In this case, the C implementation buys you about a factor of two. I implemented the whole likelihood function in C, which gave me very nice speed, but the learning curve was pretty steep. As a teaser, there is a trick that allows you to (in many cases) simply ignore the issue of reference counting and not lose anything. As always, the code can be found at my GitHub under OptimizingLane. [1] Unfortunately, once we were able to sufficiently test this method, we realized the idea didn’t work. The code is still worthwhile as an example of how an optimization might progress from pure Python all the way to the C API.

The classifier in question is simply a piecewise linear function:

$f(x) = \begin{cases} 0 & \mbox{if } xc+b\end{cases}$

def classify(x, c, b):
if x<c-b:
return 0
elif x>c+b:
return 1
else:
if b>10**-7:
return (x-c+b)/2/b
else:
return 0.5


Now, we need some data to analyze:

DATA = np.random.rand(1000)


That code generates a thousand random numbers in the interval (0, 1). We will use the parameters c=0.5 and b=0.25 for our classifier function. Let’s start by running our pure Python function in the standard pythonic manner:

%timeit [pp_class(x, 0.5, 0.25) for x in DATA]
100 loops, best of 3: 2.64 ms per loop


pp_class is just the local name for our pure Python implementation. When that function is going to get called many, many times, we need the function to run faster. The first optimization we’re going to try is NumPy’s vectorize:

pp_vect_class = np.vectorize(pp_class)


That little trick actually buys you close to a factor of 5:

%timeit pp_vect_class(DATA, 0.5, 0.25)
1000 loops, best of 3: 416 us per loop


Unfortunately, that factor of 5 is just how slow NumPy indexing is:

listDATA = DATA.tolist()
%timeit [pp_class(x, 0.5, 0.25) for x in listDATA]
1000 loops, best of 3: 401 us per loop


Now, there is an obvious algorithmic optimization we could do. Our algorithm is going to be O(n) no matter what, but if we use a sorted data structure, we can do only O(log(n)) comparisons by locating the first and last elements in the range (c-b, c+b) with O(log(n)) comparisons. The sorting step would only have to be done once, and the classification function was run a number of times. Let’s use line_profiler to see what that would buy us:

Line #      Hits   Time  Per Hit   % Time  Line Contents
========================================================
4                                     @profile
5                                     def classify(x, c, b):
6    100001  84977      0.8     34.7    if x<c-b:
7     24861  14866      0.6      6.1      return 0
8     75140  54415      0.7     22.2    elif x>c+b:
9     25326  14910      0.6      6.1      return 1
10                                       else:
11     49814  30855      0.6     12.6      if b>10**-7:
12     49814  44963      0.9     18.4        return (x-c+b)/2/b
13                                         else:
14                                           return 0.5


As long as we are still using pure Python, an optimization based on reducing the number of comparisons will buy us at most a factor of 2. I didn’t go down that road because the full likelihood function had paired observations, so any sorted data structure would have had to have a mapping for that pairing. Since I needed something compiled either way, I punted on the algorithmic optimization down the road. If the project had gone farther, I would have looked a lot harder into that kind of optimization.

Since we went over basic Cython in the last post, let’s jump straight to the optimized version.

@cython.boundscheck(False)
def cy_classify(np.ndarray[np.float_t, ndim=1] data, float c, float b):
cdef unsigned int i
cdef np.ndarray[np.float_t, ndim=1] out = np.zeros(data.shape[0], dtype=float)
for i in xrange(data.shape[0]):
out[i] = c_classify(data[i], c, b)
return out

cdef float c_classify(float x, float c, float b):
if x<c-b:
return 0
elif x>c+b:
return 1
else:
if b>10**-7:
return (x-c+b)/2/b
else:
return 0.5


There are two functions in the above code. The cdef float before c_classify tells Cython that c_classify is a function that returns a float and should not be exposed to Python. The function is then implemented in C and has a lower function call overhead. Almost of the extra effort in the cy_classify function is to get past the slow Python indexing. Declaring our input to be a 1-d NumPy ndarray of floats (np.ndarray[np.float_t, ndim=1] data) goes a long way. The next optimization is to declare our iterator i to be an unsigned int. That way, Cython knows not to worry about negative indexing. Finally, we added the decorator @cython.boundscheck(False) to prevent bounds checking.

After compiling our Cython code, we are finally in the ballpark:

%timeit cy_class(DATA, 0.5, 0.25)
100000 loops, best of 3: 7.47 us per loop


Let’s see what we can get when we move to C. The hardest part of dealing with
the Python C API is reference counting. As you probably know, Python is a garbage collected language. It’s garbage collector is implemented by counting all
the times code references an object created and frees the associated memory when the reference count reaches zero. Because Python can’t introspect your C code, you have to manually increment and decrement the reference count of every Python
object you interact with. If you don’t properly increment the count, Python might free the object before you are done with it. If you don’t properly decrement the count, Python will hold onto the memory forever, leaking it into the abyss. Here’s an example of some leaky code:

static PyObject* classify_bad_classify(PyObject* self, PyObject* args) {
double c, b;
int iN;
PyObject *x_obj;

if (!PyArg_ParseTuple(args, "Odd", &x_obj, &c, &b))
return NULL;

PyArrayObject *x_array =
(PyArrayObject *) PyArray_FROM_OTF(x_obj, NPY_DOUBLE, NPY_IN_ARRAY);
if (x_array == NULL) {
Py_XDECREF(x_array);
return NULL;
}

int nd = PyArray_NDIM(x_array);
if (nd != 1) {
Py_XDECREF(x_array);
PyErr_SetString(PyExc_TypeError,"Array is wrong shape!\n");
return NULL;
}
npy_intp N = PyArray_DIM(x_array, 0);

npy_intp shape[] = {N};
PyArrayObject *vecout = (PyArrayObject *) PyArray_SimpleNew(1, shape,
PyArray_DOUBLE);
double *OUT = (double *) PyArray_DATA(vecout);
double *x = (double *) PyArray_DATA(x_array);

for (iN=0; iN<N; iN++) {
OUT[iN] = classify(x[iN], c, b);
}

return Py_BuildValue("N",vecout);
}


Can you see the leak? Yeah, it took me a little while to find it, too. Here’s a short synopsis about how this thing works. PyArg_ParseTuple converts the
Python objects into C objects. PyArray_FROM_OTF converts the object in x_obj into a numpy ndarray. Literally, you could read that line as x_array = np.asarray(x_obj, dtype=float). NPY_IN_ARRAY is a flag to ensure that our data
is stored in a contiguous array. PyArray_NDIM is just a convenience function to
check the number of dimensions, and PyArray_DATA is a convenience function that
returns the data array that we can cast to a C array. We can successfully do
the cast because of using NPY_IN_ARRAY. Finally, PyBuildValue converts our data into a Python tuple for returning. The first two refcounts are pretty easy. PyArg_ParseTuple simply borrows references. Python isn’t going to delete memory while your function is running, so you don’t have to touch anything in that section.
PyArray_SimpleNew automatically increments the reference count to vecout, and we hand off vecout’s reference using the “N” argument to Py_BuildValue. The documentation for PyArray_FROM_OTF doesn’t mention any references because PyArray_FROM_OTF is just a macro wrapping PyArray_SimpleNew. We need to add a Py_DECREF(x_array) to our code right before we return.

After all that work, let’s see what we get:

%timeit c_class(DATA, 0.5, 0.25)
100000 loops, best of 3: 3.85 us per loop


We get a speedup of about a factor of two compared to optimized Cython. That isn’t phenomenal, but it certainly isn’t nothing. The more times you have to access data from an array, or if you need to build intermediate structures, the more advantage you are going to see from using the C API.

Now, let’s assume that you don’t want to spend a ton of time counting references. The trick is to handle all the “type safety” things at a higher level and just write x_array = np.asarray(x_obj, dtype=float) explicitly in Python. We could write a pretty thin wrapper like this:

def classify(data, c, b):
data = np.asarray(data)
return _classify.classify(data, c, b)


We can use that with a simpler C extension _classify:

static PyObject* _classify_classify(PyObject* self, PyObject* args) {
double c, b;
int iN;
PyArrayObject *x_obj;

if (!PyArg_ParseTuple(args, "Odd", &x_obj, &c, &b))
return NULL;

int nd = PyArray_NDIM(x_obj);
if (nd != 1) {
PyErr_SetString(PyExc_TypeError,"Array is wrong shape!\n");
return NULL;
}
npy_intp N = PyArray_DIM(x_obj, 0);

npy_intp shape[] = {N};
PyArrayObject *vecout = (PyArrayObject *) PyArray_SimpleNew(1, shape,
PyArray_DOUBLE);
double *OUT = (double *) PyArray_DATA(vecout);
double *x = (double *) PyArray_DATA(x_obj);

for (iN=0; iN<N; iN++) {
OUT[iN] = classify(x[iN], c, b);
}

return Py_BuildValue("N",vecout);
}


Because we can now just assume that x_obj is a ndarray, we can cast the data with impunity. Our only two references are those that we borrow from PyArg_ParseTuple and the reference we hand over to Python for vecout. How are we doing on speed?

%timeit c_wrap_class(DATA, 0.5, 0.25)
100000 loops, best of 3: 4.77 us per loop


Our run time goes up by about 25%. Unless you need a Python structure intermediate, it is often easier to pass ndarrays through wrappers. For wrappers that handle optional arguments, the overhead is O(1), whereas the ease and maintainability of your code is significantly higher.

To be clear, I did profile my program extensively and made a number of optimizations before running to C. Also note that the C code here is more flexible than the Cython code. The Cython code cannot handle lists, tuples, or anything else array-like, while we have made an effort to ensure that the C code can convert those data types. There were a number of functions of this type chained together and moving to some compiled implementation was necessary. The take home from this is that there is some benefit to a full C implementation when you are applying a function to an array.

1. I will forewarn you that I am pretty sure this code is not NumPy 1.7 compliant. If you want a NumPy 1.7 compliant version, either leave a comment or email me, and I will make the necessary changes to the C code.

From → Cython, Python C API