Whenever you try to interface two incompatible technologies, say a relational database and an object-oriented typing system, you are going to run into difficulties. One problem I ran into recently was dealing with polymorphic classes in an ORM. The syntax here will be specific to Django, but the ideas should be applicable to many ORMs.

We have two django models, which do basically the same thing:

from django.db import models

class Foo(models.Model):
foo_field = models.CharField()

def run(self):
...implementation based on foo_field...

class Bar(models.Model):
bar_field = models.CommaSeparatedIntegerField()

def run(self):
...implementation based on bar_field...


Let’s say we have another Django model which attaches to our foo and baz models. What we really want is to write:

class Baz(models.Model):
...stuff...
foo_bar = models.ForeignKey(Foo or Bar)

...methods...


In particular, I want to make sure that baz, and all of it’s baz-like cousins, get deleted whenever I call delete() on the relevant foo or bar instance. Also, if I were to write a new bazzy class, I shouldn’t have to manipulate foo and bar to make everyone place nice. That is what the ORM is for. So, in my first GoF mention in this space, we nee an adaptor pattern, or a wrapper as the Pythonic part of the engineering universe calls it.

Here is what my foo_bar_wrapper looks like:

class FooBarWrapper(models.Model):
foo_bar_type = models.CharField(max_length=32)
foo_bar_id = models.IntegerField()

unique_together = ('foo_bar_name', 'foo_bar_id')


Now, we only have one class to override our delete() and save() functions for:

class Foo(models.Model):
...foo-y stufff...

def save(self, *args, **kwargs):
create_wrapper = False
if not self.id:
create_wrapper = True
super(foo, self).save(*args, **kwargs)
if create_wrapper:
foo_wrapper = foo_bar_wrapper(
foo_bar_type=self.__class__.__name__
foo_bar_id=self.id)
foo_wrapper.save()


The delete() method would be pretty similar. I implemented the delete() overloading on the wrapper class. We could implement that function just to be safe.

Outside of Django model forms, I attempted to not use the original foo and bar classes as much as possible. To do that, we need to access our foo or bar object:

from django.db.models.loading import get_model

class FooBarWrapper(models.Model):
...stuff...

@propery
def foo_bar(self):
return get_model(self.foo_bar_type).objects.get(
id=self.foo_bar_id)


An initial draft used a __getattr__ trick, but I believe this implementation makes it very clear you are accessing the foo or bar object indirectly through the wrapper. At this point, you can either write your own getter to obtain a foo_bar_wrapper from a foo or bar object, or you can hack foobarwrapper_set to get the object.

We’re even able to safely manipulate the universe attached to FooBarWrapper:

class FooBarWrapper(models.Model):
...

def manipulatae_unverse(self):
for uni in self.universe_set.all():
uni.manipulate()


We have to be concerned about grep -r safety here, so I would love any input about universe_set.all() versus Universe.objects.filter(foo_bar_wrapper=self). My bias is towards the latter because of the aforementioned grep issues, but the latter seems to be more Django idiomatic.

We’ve ended up with an ugly one-to-one database relationship, which is often a very bad call. However, by putting FooBarWrapper in the middle, we have one class which is connected to the rest of the universe. Any and all new dependencies can ForeignKey off of FooBarWrapper with impunity, and a new Foo-ish class can be easily hooked into FooBarWrapper without modifying the rest of the universe.

I’m finally writing the announcement I promised. At the beginning of June, I started at Schrodinger, Inc as a software engineer. I work on the enterprise services team building tools to enable chemists to extract insights from data they already have.

This post was delayed by the headaches and hassles associated with moving to New York City and my first month at Schrodinger. Unfortunately, I don’t think I will have time for the detailed tests and comparisons I’ve been posting here, but that might change when our product is more developed and we shift gears towards optimization. In the meantime, I’ll be posting some ideas on getting disparate packages and libraries to play nice with one another.

At long last, I am writing up a comparison between a Cython skip list class and a similar C++ class using Boost.Python as a wrapper. There are other tools to wrap C++ classes, but I don’t have time at the moment to compare them. In a previous post, I suggested that you should use Cython as a first go to whenever you need to speed up your functions. My recommendation for classes is going to be similar, but more tempered. As a first pass, directly compiling your Python code with Cython buys you some speed for almost no effort. However, if you are comfortable with C++, Cython extension types don’t save you much by way of headaches and don’t speed your code up that much.

I have come to restore the good name of C. In my previous post, I compared the implementation of a classification function with the C API and Cython:

$f(x) = \begin{cases} 0 & \mbox{if } xc+b\end{cases}$

The C API ended up being faster, but only slightly so. After learning a little more about optimizing programs in C, my newest C API implementation is a factor of three times faster than the Cython implementation. This is just going to be a quick post outlining the C optimization. The optimization was done on a machine running Ubuntu with the x86_64 architecture, for other machines YMMV.

I did attempt some optimizations on the Cython code with unspectacular results. Applying the same optimizations I used for C to the Cython code speeds things up by less than 10%. Running the classifiers on a NumPy array with 1k floats:

%timeit cyu_class(data, 0.5, 0.1)
100000 loops, best of 3: 6.04 us per loop

%timeit cy_class(data, 0.5, 0.1)
100000 loops, best of 3: 6.46 us per loop


I also worked through Jake VenderPlas’s various Cython optimizations with no particular luck.

For reference, here is the timing result for our old C API code:

%timeit c_class(data, 0.5, 0.1)
100000 loops, best of 3: 4.21 us per loop


Let’s start by running our optimizations in straight C and then port them to the Python API. All of these programs are compiled with the -O3 optimization level with gcc 4.7.2. Our first test program looks like:

#include &lt;stdio.h&gt;
#include &lt;stdlib.h&gt;

double classify(double x, double c, double b);

int main(int argc, char* argv[]) {
clock_t t;
int iN, N = 10000;
int iM, M = atoi(argv[1]);

double c = 0.5, b = 0.1;
double x[N];
double out[N];

// Make runs repeatable for timing
srand(1234);
for (iN=0; iN&lt;N; iN++)
x[iN] = rand() / (double)RAND_MAX;

const int stepsize = 8;
for (iM=0; iM&lt;M; iM++) {
for (iN=0; iN&lt;N; iN++) {
out[iN] = classify(x[iN], c, b);
}
}
return 0;
}

double classify(double x, double c, double b) {
if (x&lt;(c-b))
return 0;
else if (x&gt;(c+b))
return 1;
else {
if (b&gt;1E-7)
return (x-c+b)/2/b;
else
return 0.5;
}
}


Those two if statements inside classify() get called a whole bunch of times, so let’s look and see how well branch prediction is working and how we’re doing on instructions per cycle. Running perf
on our program gives the following output:

perf stat -e instructions -e cycles -e branches -e branch-misses ./test 100000

Performance counter stats for './test 100000':

10,765,154,806 instructions              #    0.69  insns per cycle
15,637,304,307 cycles                    #    0.000 GHz
2,597,977,423 branches
472,634,207 branch-misses             #   18.19% of all branches

5.138279674 seconds time elapsed


In a word, that says our program is performing badly. Very badly. We’ll try to get that branch miss rate down in a moment, but first, let’s see if we can get our processor to vectorize some of our operations. We can unroll the loop a little bit, details here, to optimize or processor a little bit. Our new inner loop looks like:

#define ROUND_DOWN(x, s) ((x) &amp; ~((s)-1))
/* Other code */
const int stepsize = 8;
for (iM=0; iM&lt;M; iM++) {
for (iN=0; iN&lt;ROUND_DOWN(N, stepsize); iN+=stepsize) {
out[iN] = classify(x[iN], c, b);
out[iN+1] = classify(x[iN+1], c, b);
out[iN+2] = classify(x[iN+2], c, b);
out[iN+3] = classify(x[iN+3], c, b);
out[iN+4] = classify(x[iN+4], c, b);
out[iN+5] = classify(x[iN+5], c, b);
out[iN+6] = classify(x[iN+6], c, b);
out[iN+7] = classify(x[iN+7], c, b);
}
for (; iN&lt;N; iN++)
out[iN] = classify(x[iN], c, b);
}


Let’s run our new code with perf:

perf stat -e instructions -e cycles -e branches -e branch-misses ./test 100000

Performance counter stats for './test 100000':

8,555,370,759 instructions              #    0.92  insns per cycle
9,336,312,035 cycles                    #    0.000 GHz
1,722,405,624 branches
286,243,754 branch-misses             #   16.62% of all branches

3.082999950 seconds time elapsed


Somehow or another, we saved a bunch of instructions, and as expected, our instructions per cycle went up a bunch. Those add up to a factor of two speedup. Our branch prediction is still failing pretty spectacularly. At least for the gcc 4.7 compiler on my machine, branch prediction works a lot better with the ternary ? operator than with if … else … blocks. Let’s try a new version of our classification function:

double m = c - b, w = 1/2.0/b;

double o_classify(double x, double m, double w) {
double out = (x-m)*w;
out = (out &lt; 0) ? 0 : out;
out = (out &gt; 1) ? 1 : out;
return out;
}


I’ve pulled the check on b out of the function, but I doubt that has any performance effect since ‘always go one way’ is pretty easy for the branch prediction method to figure out. We’ve added more floating point operations for 0 or 1 classified inputs, but maybe we’ll get lucky with better branch performance. Let’s see:

perf stat -e instructions -e cycles -e branches -e branch-misses ./test 100000
Run took 1.380000 seconds.

Performance counter stats for './test 100000':

9,225,082,688 instructions              #    2.18  insns per cycle
4,227,577,833 cycles                    #    0.000 GHz
1,529,437,617 branches
22,990,447 branch-misses             #    1.50% of all branches

1.396654857 seconds time elapsed


That’s more than another factor of two, and now our branch prediction is working pretty well. The processor is vectorizing some calculations, and we’re getting more than two instructions per cycle. Nice. Our cache behavior is still crap (data not shown), but I’m going to leave that for another day. Let’s copy and paste our C code into our Python API file and see the speed:

%timeit f_class(data, 0.5, 0.1)
100000 loops, best of 3: 2.96 us per loop


Wait, what? I know the Python overhead means we might not get our full factor of four, but only a 25% reduction in runtime? [1] I feel cheated. Let’s run perf on a Python test script:

import numpy as np
import sys
import f_classify
f_class = f_classify.classify

np.random.seed(1234)
data = np.random.rand(10000)
for i in xrange(int(sys.argv[1])):
f_class(data, 0.5, 0.1)


Running that through perf:

perf stat -e instructions -e cycles -e branches -e branch-misses ./test.py 10000

Performance counter stats for './test.py 10000':

1,827,741,558 instructions              #    0.95  insns per cycle
1,917,740,093 cycles                    #    0.000 GHz
369,266,029 branches
35,360,923 branch-misses             #    9.58% of all branches

0.651237630 seconds time elapsed


We’ve lost our nice 2+ instructions per cycle, and our branch prediction performance has regressed. Apparently, with everything else going on, the compiler wasn’t able to figure out that our o_classify function should be inlined. Let’s change the declaration and see what happens:

inline double o_classify(double x, double m, double w);

perf stat -e instructions -e cycles -e branches -e branch-misses ./test.py 10000

Performance counter stats for './test.py 10000':

1,366,976,017 instructions              #    1.87  insns per cycle
730,149,538 cycles                    #    0.000 GHz
208,851,791 branches
5,135,634 branch-misses             #    2.46% of all branches

0.251113049 seconds time elapsed


That looks a lot better. Without digging into the Python internals just yet, Python is interfering with our nice pipelining, but this is pretty good. Finally, let’s see the timing inside Python:

%timeit f_class(data, 0.5, 0.1)
100000 loops, best of 3: 2.05 us per loop


Nice. Our C code is now 3x as fast as Cython, and all is right with the world. Obviously, I did not do nearly as good of a job optimizing Cython, so I would love it if someone came along and found a better Cython implementation. As usual, code can be found at GitHub. Feel free to play around with anything, just let me know what you find out!

UPDATE: I ran things real quick on my work machine, and the differences are even more stark. Interestingly enough, the branch prediction on that (older) machine is complete crap:

perf stat -e cycles -e instructions -e branch-misses -e branches ./test 10000
Run took 0.670000 seconds.

Performance counter stats for './test 10000':

1,608,587,494 cycles                    #    0.000 GHz                     [49.88%]
922,742,676 instructions              #    0.57  insns per cycle         [75.25%]
44,245,065 branch-misses             #   28.87% of all branches         [75.26%]
153,251,625 branches                                                     [74.87%]

0.679764354 seconds time elapsed


However, with gcc 4.7.3, the aforementioned cache issues completely disappear:

perf stat -e instructions -e cycles -e L1-dcache-loads -e L1-dcache-load-misses ./test 10000
Run took 0.670000 seconds.

Performance counter stats for './test 10000':

923,611,151 instructions              #    0.57  insns per cycle         [74.71%]
1,611,770,402 cycles                    #    0.000 GHz                     [75.16%]
1,592,489 L1-dcache-load-misses     #    0.60% of all L1-dcache hits   [49.52%]

0.680825297 seconds time elapsed


The loop unrolling and if … else … classification optimizations in Cython now have a huge effect:

%timeit cy_class(data, 0.5, 0.1)
100000 loops, best of 3: 18.8 us per loop

%timeit cyu_class(data, 0.5, 0.1)
100000 loops, best of 3: 9.15 us per loop


And the C optimizations also have a larger effect size in our Python extension:

%timeit c_class(data, 0.5, 0.1)
100000 loops, best of 3: 13.8 us per loop

%timeit f_class(data, 0.5, 0.1)
100000 loops, best of 3: 3.13 us per loop


I’m going to have to play around at home and see if a newer compiler can get rid of the L1 cache problems while retaining the nice vectorization.

1. I also added the -O3 flag to our compilation, but that did not appear to have a significant effect.

We all know testing is important. And hard. Writing good tests that cover all likely use cases and run in a reasonable amount of time. You know what makes that harder? When your function returns a different value every time even with the same input. Random data structures and algorithms provide a unique challenge in writing good tests. This post is going to discuss how I went about writing up unit tests for my C++ skip list implementation.

Before getting into the actual details, let’s start with how to get Boost.Test up and running. I would recommend you start here. Sometime in between the Boost 1.34.1 he used and Boost 1.48, there was a change to the Boost unit testing library adding another required macro. Jason Sankey’s hello world needs to be modified by adding #define BOOST_TEST_MAIN:

#define BOOST_TEST_DYN_LINK
#define BOOST_TEST_MAIN
#define BOOST_TEST_MODULE Hello
#include <boost/test/unit_test.hpp>

int add(int i, int j)
{
return i + j;
}

BOOST_AUTO_TEST_CASE(universeInOrder)
{
BOOST_CHECK(add(2, 2) == 4);
}


Before going over the tests let’s take a look at the code. Frist, the signature:

template <class T>
class Node {
protected:
T value;
public:
Node * left;
Node * right;
Node * up;
Node * down;
Node(T value) : value(value) {
left = right = up = down = NULL;
};
T get_value() {return value;};
};

template <class T>
class SkipList {
//Insert node with value to right of node
void list_insert(Node<T>* node, T Value);
void raise_level(Node<T>* node, T value);
void insert_level();
bool raise_check(void)
{ return (p_up > distribution(generator)); };
void delete_list(Node<T>* node);
void delete_node(Node<T>* node);
void copy(SkipList<T>& new_, const SkipList<T>& other);
std::mt19937 generator;
std::uniform_int_distribution<unsigned int> distribution;
unsigned int rand_check;

public:
const T p_inf;
const T n_inf;
Node<T> * tail;
int levels;
int n_elem;
float p_up;
SkipList();
SkipList(const SkipList<T>& list_to_copy);
SkipList<T>& operator=(const SkipList<T>& rhs);
~SkipList();
void set_seed();
void insert(const T value);
bool remove(T value);
Node<T> * find(T value);
void reset();
};


I’ve made the classes slightly Pythonic in the sense that almost all of the members of the two classes are public. The only things that are private are those that would probably destroy the structure of the lists if invoked by an outside caller. The code is definitely more verbose–I think I typed the line template <class T> approximately 100 times, but the only real difference in code complexity is the C++ object-oriented work. In particular, because we have dynamic memory allocation, we need a custom destructor, copy constructor and assignment operator. Those three functions make up another ~50 lines of code. The member function/methods are basically equivalent to the Python versions. For example, here is the implementation of insert:

template <class T> void
SkipList<T>::insert(const T value) {
if (value == n_inf || value == p_inf)
return;
rand_check = distribution(generator);
Node<T> * node = head;
while (value >= node->value) {
if (value < node->right->get_value() && node->down)
node = node->down;
else
node = node->right;
}
n_elem++;
list_insert(node->left, value);
if (raise_check())
raise_level(node->left, value);
}
template <class T> void
SkipList<T>::raise_level(Node<T>* node, T value) {
Node<T> * upnode = node->left;
while (!(upnode->up) && upnode->left)
upnode = upnode->left;
if (upnode->up)
upnode = upnode->up;
else {
insert_level();
}
list_insert(upnode, value);
upnode->right->down = node;
node->up = upnode->right;
if (raise_check()) raise_level(upnode->right, value);
}
template <class T> void
SkipList<T>::insert_level() {
head->up = new Node<T>(n_inf);
tail->up = new Node<T>(p_inf);
tail->up->down = tail;
tail = tail->up;
levels++;
}


I’m using some bit-array trickery to allow me to only draw one random number on an insert, but other than that, the most significant difference in implementation is the fact that integer types don’t have a built-in infinity value. The right way to do this would be to define our own infinity template and overload the comparison operator. For now, I’m just using the numerical_limits to get +/- infinity if we can or just the largest/smallest representable value. That approach limits us to numeric types, but refactoring the code to be more general would not be difficult.

Since we are good developers committed to test-driven development, we are going to write up some good, consistent unit tests. We start writing up our unit tests for the private member functions before moving on to higher level tests, and we suddenly realize: we don’t know what our skip list is going to look like. In fact, the structure of our skip lists will likely change from run to run. When I tested the Python implementation, I simply set the random number generator seed to a fixed number and solved the skip list structure by hand. I have absolutely no desire to do that again. Also, that approach would mean that changing the (pseudo-)random number generator would break our tests. We write unit tests in part so that we can confirm we didn’t break anything when I tinker. We can start with some obvious invariants. First, let’s set up our test fixture so that we don’t have to recreate our skip lists over and over:

struct RandomSList {
SkipList<int> s_list;
RandomSList() {
default_random_engine generator;
uniform_int_distribution<int> rand_int(0, MAX_INSERT);

for (int i=0; i<MAX_INSERT/10; i++)
s_list.insert(rand_int(generator));
}
};

BOOST_FIXTURE_TEST_SUITE(test_list_function, RandomSList)

/*
...
Our tests here
...
*/

BOOST_AUTO_TEST_SUITE_END()


We can break tests of random data structures into two categories: deterministic invariants and probabilistic tests. Unfortunately, a judgement call will need to be made in how frequently you want your test to fail on a correct implementation. Starting out, we know every element of every list better be less than or equal to it’s right neighbor, so we can start by testing that. A whole bunch of BOOST_CHECK_LE(node->get_value(), node->right->get_value()); later, we know that our skip list is properly ordered:

BOOST_AUTO_TEST_CASE(test_ordering)
{
s_list.insert(1003);
s_list.insert(1003); // Ensure a duplicate insert was attempted

Node<int> * row, * node;
row = node = s_list.head;

while (node->down) {
node = row;
while (node->right) {
BOOST_CHECK_LE(node->get_value(), node->right->get_value());
if (node->up)
BOOST_CHECK_EQUAL(node->get_value(), node->up->get_value());
if (node->down)
BOOST_CHECK_EQUAL(node->get_value(), node->down->get_value());
node = node->right;
}
row = row->down;
}
}


We should also check that every connection is reciprocal, and versions of BOOST_CHECK_EQUAL(node, node->right->left); will do the trick. We can add more tests to ensure that insert, delete, find and our various constructors are behaving as expected. However, we still don’t know if levels are being created properly.

In order to check the skip list’s levels, we really need to run a statistical test. If ten inserts in a row lead to elements in the second level of our skip list, we can never know if our implementation is broken or if we are simply one in a thousand unlucky. Thankfully, we can just run our test again and find out. The one statistical test I wrote tests for the rate at which nodes are promoted up a level. If this were deployed with large data sets, you would also want to test for correlations, but we’re not going that far. Our promotion decisions should be independent binomial trials, so we can use a normal approximation for the number of elements promoted to the next level given a number of total nodes. The test I wrote throws a warning if the number of promoted nodes is off by three standard deviations:

BOOST_AUTO_TEST_CASE(test_new_levels)
{
Node<int> * row, * node;
while (row->down)
row = row->down;
while (1) {
node = row;
// Node count starts at -1 to avoid counting sentinels
int nodes = -1, upnodes = 0;
while (node->right) {
if (node->up)
upnodes++;
nodes++;
node = node->right;
}
if (nodes < 10) break;
BOOST_WARN_MESSAGE(abs((float) s_list.p_up*nodes-upnodes) <
3*std_dev(nodes),
"Level increase rate > 3*sigma away from expected.\n"
<< "Total nodes = " << nodes << "; Nodes pointing up = "
<< upnodes << "\n");
if (row->up)
row = row->up;
else
break;
}
}


For the bottom list with 1000 elements, that test is going to fail >50% of the time if the probability of being promoted is >0.55 or <0.45. That is not incredibly sensitive, but it is always going to catch catastrophic failures.

As usual, all the code can be found at GitHub. Before you spend too much time twisting your brain around test_skip.cpp, I will forewarn you that file needs a lot of work. I’ve only been able to get Boost.Test to report the test and line number where the test failed. Because of that, the test program includes a ton of repeated code. There is also no testing of private members yet. In short, I’m relatively confident the skip list implementation is safe, but the tests aren’t strong enough to ensure that future versions are that way.

I ran some quick tests after wrapping the C++ code with Boost::Python, and I saw a 10-20x speed up in inserts/deletes and around a factor of 10 in memory savings. I was very impressed with Boost::Python, but it is not very portable since a user needs Boost::Python installed to compile your extension. In my next post, I will also wrap the C++ code in Cython and compare both to straight Cython and pure Python implementations. As a warning, the code presented here requires C++11, but I think the only real dependency is the random number generator. I've also done all the memory handling manually to get things up and running quickly. Don't worry, the code has been properly valgrind'ed and comes up clean. There are some significant object safety questions in the implementation, but as they say, we're all adults here. You can traverse the list yourself since the head and tail are exposed, but In a future post, I may try Boost and/or C++11 smart pointers and compare performance. It just seemed simpler to me to write the memory handling myself for a first pass.

EDIT: Adding the -O3 flag to g++ brings the speed up to about a factor of 100. More details soon.

We’re going to continue playing around with our simple classification function. For those of you arriving late to the party, the function in question is:

$f(x) = \begin{cases} 0 & \mbox{if } xc+b\end{cases}$

We’re trying to apply that function to arrays of data as quickly as possible. In my last post, I focused on using Cython and C extensions to speed up the calculations. This post focuses on using an algorithmic optimization. I’m very glad to report that it appears not going down this road at my job was a good call.

%timeit [p_classify(x, c, b) for x in DATA]
100 loops, best of 3: 1.59 ms per loop

%timeit s_classify(s_list, c, b)
100 loops, best of 3: 1.01 ms per loop


Close to 40% speedup? Not bad for… more than a hundred lines of code, a few hours of work and a significant increase in code complexity. Like I said, vindicated. With p_classify, we are simply looping over all the data and applying our function, and s_classify is a function optimized to work with our skip list data structure. However, we’re not giving up after this result. More details at the end of the post, but you could soon be seeing a C++ compiled skip list class in Python. Yeah, I’m excited, too.

First, let’s run a quick line profile of our classify function to make sure that reducing the number of comparisons will do us some good. To bias the results in our favor, we’ll take some randomized data distributed so that observations are classified as 0 or 1 80% of the time. The profile looks like:

Line #      Hits  Time  Per Hit   % Time  Line Contents
==============================================================
2                                    @profile
3                                    def classify(x, c, b):
8      1000  4287      4.3     40.7    if x<c-b:
9       405   850      2.1      8.1      return 0
10       595  2233      3.8     21.2    elif x>c+b:
11       406   856      2.1      8.1      return 1
12                                      else:
13       189   422      2.2      4.0      if b>10**-7:
14       189  1877      9.9     17.8        return (xc+b)/2/b
15                                        else:
16                                          return 0.5


The comparisons end up being 60% of the total run time, so we’ll be able to get a real return if we play our optimization right. We’ll also be able to cut down the number of times this function has to return a value since we will have big blocks of ones and zeros. Somewhat surprising to me, the floating point operations don’t take a disproportionate amount of time. The floating point section is hit about 19% of the time, and it takes up about 22% of the actual time used. When we’re done, 90% of the time is going to be spent on the floating point section.

For more detail about skip lists, see the previous post and the links therein. We’re going to jump straight into our algorithm. Due to time constraints, I’m going to keep the discussion of the algorithm at a pretty high level. The first thing we need to do is to augment skip lists to allow us to calculate the number of elements less than or equal to a given value. Basically, what we are going to do is store the number of elements we skip over at each jump and add those up on our way down. Here is some quality ASCII art to illustrate my point:

             x --------- 3 ----- 2 - x
|           |       |   |
x - 1 ----- 2 - 1 - 1 - x
|   |       |   |   |   |
x - 1 - 1 - 1 - 1 - 1 - x


At each node, we count how many nodes over we are after each jump to the right. That number is stored in the variable accum, at each node, and we can find the rank of a number in our skip list:

def rank(self, value):
rank = 0
while value >= node.value:
if value < node.right.value:
if node.down:
node = node.down
else:
break
else:
node = node.right
rank += node.accum
return rank


All we are doing is moving along each list until we encounter a node containing a value greater than our argument and moving down. Once we can’t move down any more, we return our count of elements less than our argument. The hard part is maintaining the variable accum, but that doesn’t turn out to be all that hard:

def insert(self, value):
self.n_items += 1
while node.down:
if value < node.right.value:
node.right.accum += 1
node = node.down
else:
node = node.right
while value > node.right.value:
node = node.right
node(value)
node.right.accum = 1
self._raise_level(node.right)

def _raise_level(self, node, accum=None):
if accum is None:
accum = 1
if self._raise_check():
up_check = node.left
while up_check.up is None and up_check.value>float("-inf"):
accum += up_check.accum
up_check = up_check.left
if up_check.up:
place_node = up_check.up
place_node(node.value)
place_node.right.accum = accum
place_node.right.right.accum -= accum
node.up = place_node.right
place_node.right.down = node
self._raise_level(place_node.right, accum=accum)
else:
self._init_level()


If you want to go through the code line by line, I recommend heading over to GitHub. At a high level, we find where to insert the node exactly as before and then perform some indexing to make sure all the accum values continue to add up. Delete does basically the same thing in reverse. Test code is also included at GitHub to confirm everything is working properly. Finally, the method find_rank(m) returns a reference to the node with rank m.

That is all we need to implement our skip list classifier function:

def classify_list(rslist, c, b):
N = rslist.count()
zeros = rslist.rank(c-b)
ones = N-rslist.rank(c+b)

start = rslist.find_rank(zeros+1) if zeros > 0 else None
end = rslist.find_rank(N-ones+1) if ones > 0 else None

output = [_classify(val, c, b) for val in
rslist.iterate(start=start, stop=end)]
return [0]*zeros + output + [1]*ones


Because of the skip list structure, the four rank() calls are asymptotically as efficient as binary search, which is O(log N). The iterate() method returns a generator which loops over the base linked list in the usual way. Now, let’s look at how this performs line by line:

Line #      Hits     Time  Per Hit   % Time  Line Contents
==============================================================
12                                       #@profile
13                                       def classify_list(rslist, c, b):
14      1000     1694      1.7      0.1    N = rslist.count()
15      1000    79401     79.4      4.2    zeros = rslist.rank(c-b)
16      1000    81666     81.7      4.3    ones = N-rslist.rank(c+b)
17
18      1000    23593     23.6      1.3    start = rslist.find_rank(zeros+1) if zeros > 0 else None
19      1000    24351     24.4      1.3    end = rslist.find_rank(N-ones+1) if ones > 0 else None
20
21      1000      782      0.8      0.0    output = [_classify(val, c, b) for val in
22    181000  1654586      9.1     88.1    rslist.iterate(start=start, stop=end)]
23      1000    12238     12.2      0.7    return [0]*zeros + output + [1]*ones


We are now spending almost all of our time dealing with the middle section of classifier. The generator we return is significantly slower that iterating over a list, but we have greatly reduced the amount of time classifying 1s and 0s. Recalling our comparison with the Pythonic classifier:

%timeit [p_classify(x, c, b) for x in DATA]
100 loops, best of 3: 1.59 ms per loop

%timeit s_classify(s_list, c, b)
100 loops, best of 3: 1.01 ms per loop


The 40% reduction in run time is not much compared to the speedup seen by converting to a compiled version, but it certainly is nothing to scoff at. This would certainly make sense as an optimization to combine with using tools like Cython.

Now, for the teaser: What we really want is to compare our new algorithm with a Cython and a C++ implementation. The Cython implementation should be pretty straightforward, so look for that to come out first. However, I’ve been wanting to create a compiled class for quite a while. Why haven’t I? This. That’s a pretty steep learning curve for something I want to do for the hell of it. I recently learned about Boost.Python and how it can expose C++ classes to Python, and I’m pretty excited to try it out.

Along with learning more Python, I’m also attempting to teach myself some algorithms and data structures. As I mentioned in my previous post, an algorithmic optimization using a sorted data structure would save comparisons for our classification algorithm. This post goes over skip lists, a very simple sorted data structure, and gives me an excuse to talk about Python’s data model. In the next post, we’ll see how the skip list performs with Python versus Cython. We may even run a comparison using PyPy, an optimized Python implementation.

When talking about Python’s data model, I’m going to use C++ memory handling as a reference in part because I intend to also code up the classification method in C++. With any luck, I’ll post a comparison between the C++ and Python code in terms of both speed and code complexity.

Let’s get started talking about Python’s data model. This discussion is going to get a little pedantic, but I think it is valuable to understanding both Python in general and the code presented here. Just in case, here is a link to skip past the pedantry: skip. Python never binds memory to variables in the way that lower-level languages do. In particular, there is no way to do this:

int x = 5;


Every variable in Python is simply a reference to memory that is controlled by the Python interpreter itself. When we write x = 5, we are writing something closer to:

int Py_5 = 5;
const int& x = Py_5;


Python controls Py_5, and we simply reference the memory with our variable x. x is a constant reference because integers are immutable, and we can’t change x without creating a new reference. For mutable objects, we have non-constant references where can manipulate the memory owned by Python.

As a reminder, everything in Python is an object, so x is really an integer object. Also, unlike in C++, we can redefine a reference as many times as we please:

x=5
x='apple'
def x(y):
return x*y


The above is kind of silly, but it is perfectly valid Python code.

Python handles memory by keeping a count of all the references to an object and freeing the memory whenever that refcount reaches zero. We can use sys.getrefcount to check how many references an object has:

>>>import sys
>>>
>>>x = 5
>>>sys.getrefcount(x)
129
>>>x = [5]
>>>sys.getrefcount(x)
2
>>>y = x
>>>sys.getrefcount(x)
3
>>>y = 3
>>>sys.getrefcount(x)
2


sys.getrefcount adds a reference to the object it is called on, so you should subtract one from the output of that function. As you can see, Python has the number 5 stored in its memory, and a lot of things point to 5 outside of x. However, when x is a list containing 5, x is the only reference to that list until we assign y to x.

All of this discussion is important for two reasons: One, we need to make sure we don’t leave any memory allocated when our skip list is deleted, and two, we need a workaround for the fact that Python doesn’t implement pointers. Reference counting can run into problems when you have circular references. There is no way to delete all of the references in a cycle simultaneously. We can see this in action with two classes:

>>> class Mine(object):
>>>   pass
>>> x = Mine()
>>> y = Mine()
>>> x.y = y
>>> y.x = x
>>> sys.getrefcount(x)
3
>>> test = x
>>> del x
>>> del y
>>> sys.getrefcount(test)
3


That last call to sys.getrefcount should be pretty surprising to you. y should be gone. How can it still be adding a reference to our variable test? The reference cycle between x and y has confused Python’s reference counter. Fortunately, cPython includes a cyclic reference detector that extends the reference counting garbage collector. Other implementations of Python handle garbage collection differently, so we are going to write a __del__() method to make sure everything gets cleaned up properly.

Notice that, in the above code, we were able to use the attributes .x and .y similar to how we could use pointers in C++. In fact, we will use the above type of references to implement our nodes in the skip list.

And so ends the pedantry.

At long last, it is time for some code. Skip lists are a probabilistic data structure that function similarly to a balanced search tree. You can think of a skip list as a linked list of linked lists. At the very bottom level is a sorted linked list with all the elements. Each level up has fewer and fewer nodes, with the expected number of nodes decreasing by a factor of two at each level. The standard image (care of Wikipedia) gives you an idea of how the structure looks:

Searches, inserts and deletes can all be done in O(log n) with high probability, so the asymptotic scaling is similar to more common balanced search trees. If you want to know more, go there:

My implementation is slightly different. I attempted to avoid Python lists at all costs for this project. Is there a good reason for that? Probably not. I honestly have no idea what the speed difference between dereferencing Python lists vs. having some additional references to run though, but I just wanted the challenge. In the end, my method was to base a skip list implementation on doubly linked lists. Line profiling would allow us to compare two implementations.

First, the node class is pretty straightforward:

class SLNode(object):
def __init__(self, value):
self._value = value
self.left = None
self.right = None
self.up = None
self.down = None

value = property(lambda self: self._value)
def __call__(self, value):
"""
Use __call__ to insert a node into a skip list.

UNSAFE OUTSIDE OF SKIP LIST!
"""
self.right.left = self.__class__(value)
self.right.left.right = self.right
self.right = self.right.left
self.right.left = self

def __str__(self):
return "A skip list node with value {}".format(self.value)


For our __init__, we simply assign a value to our node and put all of our directional references to None. The function property allows us to define the attribute value as a read-only accessor of _value. Our __call__ is how we insert a new node into our doubly linked list. I will admit that putting a method only relevant for skip lists into a class which could be used for other data structures is poor object oriented form, but we’re just going to roll with it for today. The behavior of the skip list is solidly integrated into the behavior of the nodes, so I don’t think we can separate them too easily.

Now, the skip list methods are significantly more complicated, so I’m going to go through them piece by piece. Indentation is going to get destroyed across code blocks, so just remember that any function starting with self is a method for our skip list class. The initializer looks a lot like the one for our node:

class SkipList(object):
def __init__(self):
self._tail = SLNode(float("inf"))


self.head is the node through which we access the list, in the upper right-hand corner of the diagram above. self._tail is a sentinel used to avoid having to repeatedly check for None. I’m also using the fact that Python has a floating point number equivalent to infinity. We can now do comparisons with both end points and have them come out exactly as we expect. For safety, we really should have a data type attribute so that we can ensure future inserts are comparable with the current elements, but we don’t need values outside of numbers for the proposed algorithm.

The first method we need is insert():

  def insert(self, value):
while node.down:
if value < node.right.value:
node = node.down
else:
node = node.right
while value > node.right.value:
node = node.right
node(value)
self._raise_level(node.right)

def _raise_level(self, node):
if self._raise_check():
up_check = node.left
while up_check.up is None and up_check.value>float("-inf"):
up_check = up_check.left
if up_check.up is not None:
place_node = up_check.up
place_node(node.value)
node.up = place_node.right
place_node.right.down = node
self._raise_level(place_node.right)
else:
self._init_level()

def _init_level(self):

def _raise_check(self):
return (np.random.rand()>0.5)


In insert, we first go along the higher lists moving down when we have to, and then find the place to insert the new node. Once the node is in place, we place another node with the same value in the above list with probability one half. We use the doubly linked list to traverse back to find the previous element where we want to insert a node in the list above our current list. Finally, if we need to add another level to our skip list, we move self.head up one level and link it with self._tail. The other possible implementation for this is to create a stack of references to the nodes where we moved down. Each time we raise the level of a node, we would pop a reference from our stack and use it to insert the node at a higher level. However, the implementation I used should be much more efficient. If you use a stack, you will have an average of O(log n) inserts to the stack every time you insert a new element. For my method, you only have an average of three moves on the lattice… when you raise the level. And at each insert, you raise the level an average of one time. That is, you have an expected O(1) work for finding nodes to raise the level. The trade off is that we double the number of references we need to store.

It is important to note that since there is only one sentinel self._tail.left is completely dependent upon previous operations on the list. Don’t trust it to actually mean anything.

Before we can delete an element, we should probably be able to find it:

  def find(self, value):
while node.down:
if value < node.right.value:
node = node.down
else:
node = node.right
while value >= node.value:
if value==node.value:
return node
else:
node = node.right
return None


The find() algorithm is basically the same thing as the insert() algorithm with equality testing added. We finally delete our node:

  def delete(self, value):
node = self.find(value)
if node is None:
raise ValueError('{} not in skip list'.format(value))
else:
while node.up:
self._delete(node)
upnode = node.up
del node.up
node = upnode
del node.down
del upnode
self._delete(node)

def _delete(self, node):
node.right.left = node.left
node.left.right = node.right


This is where the pedantry comes in handy. We want to make sure we get rid of all of the references to our old node as we work our way up through the levels. To do that, we create a dummy node before deleting the reference to the node above our node. Again, in CPython, this is all a little excessive. The CPython garbage collector will detect the cycles and free the memory as appropriate. However, the machinations behind deleting those references are good practice for working with Python references. Finally, let’s add a reset() and clean up our list with a __del__:

  def reset(self):
while del_queue:
node = del_queue.pop(0)
if hasattr(node, 'up'):
self._que_append(node.up, del_queue)
del node.up
if hasattr(node, 'down'):
self._que_append(node.down, del_queue)
del node.down
if hasattr(node, 'right'):
self._que_append(node.right, del_queue)
del node.right
if hasattr(node, 'left'):
self._que_append(node.left, del_queue)
del node.left
self.__init__()

def __del__(self):
self.reset()