Skip to content

Skip Lists and the Python Data Model

March 18, 2013

Along with learning more Python, I’m also attempting to teach myself some algorithms and data structures. As I mentioned in my previous post, an algorithmic optimization using a sorted data structure would save comparisons for our classification algorithm. This post goes over skip lists, a very simple sorted data structure, and gives me an excuse to talk about Python’s data model. In the next post, we’ll see how the skip list performs with Python versus Cython. We may even run a comparison using PyPy, an optimized Python implementation.

When talking about Python’s data model, I’m going to use C++ memory handling as a reference in part because I intend to also code up the classification method in C++. With any luck, I’ll post a comparison between the C++ and Python code in terms of both speed and code complexity.

Let’s get started talking about Python’s data model. This discussion is going to get a little pedantic, but I think it is valuable to understanding both Python in general and the code presented here. Just in case, here is a link to skip past the pedantry: skip. Python never binds memory to variables in the way that lower-level languages do. In particular, there is no way to do this:

int x = 5;

Every variable in Python is simply a reference to memory that is controlled by the Python interpreter itself. When we write x = 5, we are writing something closer to:

int Py_5 = 5;
const int& x = Py_5;

Python controls Py_5, and we simply reference the memory with our variable x. x is a constant reference because integers are immutable, and we can’t change x without creating a new reference. For mutable objects, we have non-constant references where can manipulate the memory owned by Python.

As a reminder, everything in Python is an object, so x is really an integer object. Also, unlike in C++, we can redefine a reference as many times as we please:

x=5
x='apple'
def x(y):
  return x*y

The above is kind of silly, but it is perfectly valid Python code.

Python handles memory by keeping a count of all the references to an object and freeing the memory whenever that refcount reaches zero. We can use sys.getrefcount to check how many references an object has:

>>>import sys
>>>
>>>x = 5
>>>sys.getrefcount(x)
129
>>>x = [5]
>>>sys.getrefcount(x)
2
>>>y = x
>>>sys.getrefcount(x)
3
>>>y = 3
>>>sys.getrefcount(x)
2

sys.getrefcount adds a reference to the object it is called on, so you should subtract one from the output of that function. As you can see, Python has the number 5 stored in its memory, and a lot of things point to 5 outside of x. However, when x is a list containing 5, x is the only reference to that list until we assign y to x.

All of this discussion is important for two reasons: One, we need to make sure we don’t leave any memory allocated when our skip list is deleted, and two, we need a workaround for the fact that Python doesn’t implement pointers. Reference counting can run into problems when you have circular references. There is no way to delete all of the references in a cycle simultaneously. We can see this in action with two classes:

>>> class Mine(object):
>>>   pass
>>> x = Mine()
>>> y = Mine()
>>> x.y = y
>>> y.x = x
>>> sys.getrefcount(x)
3
>>> test = x
>>> del x
>>> del y
>>> sys.getrefcount(test)
3

That last call to sys.getrefcount should be pretty surprising to you. y should be gone. How can it still be adding a reference to our variable test? The reference cycle between x and y has confused Python’s reference counter. Fortunately, cPython includes a cyclic reference detector that extends the reference counting garbage collector. Other implementations of Python handle garbage collection differently, so we are going to write a __del__() method to make sure everything gets cleaned up properly.

Notice that, in the above code, we were able to use the attributes .x and .y similar to how we could use pointers in C++. In fact, we will use the above type of references to implement our nodes in the skip list.

And so ends the pedantry.

At long last, it is time for some code. Skip lists are a probabilistic data structure that function similarly to a balanced search tree. You can think of a skip list as a linked list of linked lists. At the very bottom level is a sorted linked list with all the elements. Each level up has fewer and fewer nodes, with the expected number of nodes decreasing by a factor of two at each level. The standard image (care of Wikipedia) gives you an idea of how the structure looks:

500px-Skip_list

Searches, inserts and deletes can all be done in O(log n) with high probability, so the asymptotic scaling is similar to more common balanced search trees. If you want to know more, go there:

Wikipedia
MIT Lecture

My implementation is slightly different. I attempted to avoid Python lists at all costs for this project. Is there a good reason for that? Probably not. I honestly have no idea what the speed difference between dereferencing Python lists vs. having some additional references to run though, but I just wanted the challenge. In the end, my method was to base a skip list implementation on doubly linked lists. Line profiling would allow us to compare two implementations.

First, the node class is pretty straightforward:

class SLNode(object):
  def __init__(self, value):
    self._value = value
    self.left = None
    self.right = None
    self.up = None
    self.down = None
    
  value = property(lambda self: self._value) 
  def __call__(self, value):
    """
    Use __call__ to insert a node into a skip list.

    UNSAFE OUTSIDE OF SKIP LIST!
    """
    self.right.left = self.__class__(value)
    self.right.left.right = self.right
    self.right = self.right.left
    self.right.left = self
    
  def __str__(self):
    return "A skip list node with value {}".format(self.value)

For our __init__, we simply assign a value to our node and put all of our directional references to None. The function property allows us to define the attribute value as a read-only accessor of _value. Our __call__ is how we insert a new node into our doubly linked list. I will admit that putting a method only relevant for skip lists into a class which could be used for other data structures is poor object oriented form, but we’re just going to roll with it for today. The behavior of the skip list is solidly integrated into the behavior of the nodes, so I don’t think we can separate them too easily.

Now, the skip list methods are significantly more complicated, so I’m going to go through them piece by piece. Indentation is going to get destroyed across code blocks, so just remember that any function starting with self is a method for our skip list class. The initializer looks a lot like the one for our node:

class SkipList(object):
  def __init__(self):
    self.head = SLNode(float("-inf"))
    self._tail = SLNode(float("inf"))
    self.head.right = self._tail
    self._tail.left = self.head

self.head is the node through which we access the list, in the upper right-hand corner of the diagram above. self._tail is a sentinel used to avoid having to repeatedly check for None. I’m also using the fact that Python has a floating point number equivalent to infinity. We can now do comparisons with both end points and have them come out exactly as we expect. For safety, we really should have a data type attribute so that we can ensure future inserts are comparable with the current elements, but we don’t need values outside of numbers for the proposed algorithm.

The first method we need is insert():

  def insert(self, value):
    node = self.head
    while node.down:
      if value < node.right.value:
        node = node.down
      else:
        node = node.right
    while value > node.right.value:
      node = node.right
    node(value)
    self._raise_level(node.right)

  def _raise_level(self, node):
    if self._raise_check():
      up_check = node.left
      while up_check.up is None and up_check.value>float("-inf"):
        up_check = up_check.left
      if up_check.up is not None:
        place_node = up_check.up
        place_node(node.value)
        node.up = place_node.right
        place_node.right.down = node
        self._raise_level(place_node.right)
      else:
        self._init_level()
        self.head(node.value)
        node.up = self.head.right
        self.head.right.down = node
        self._raise_level(self.head.right)

  def _init_level(self):
    self.head.up = SLNode(-np.inf)
    self.head.up.down = self.head
    self.head = self.head.up
    self.head.right = self._tail
    self._tail.left = self.head

  def _raise_check(self):
    return (np.random.rand()>0.5)

In insert, we first go along the higher lists moving down when we have to, and then find the place to insert the new node. Once the node is in place, we place another node with the same value in the above list with probability one half. We use the doubly linked list to traverse back to find the previous element where we want to insert a node in the list above our current list. Finally, if we need to add another level to our skip list, we move self.head up one level and link it with self._tail. The other possible implementation for this is to create a stack of references to the nodes where we moved down. Each time we raise the level of a node, we would pop a reference from our stack and use it to insert the node at a higher level. However, the implementation I used should be much more efficient. If you use a stack, you will have an average of O(log n) inserts to the stack every time you insert a new element. For my method, you only have an average of three moves on the lattice… when you raise the level. And at each insert, you raise the level an average of one time. That is, you have an expected O(1) work for finding nodes to raise the level. The trade off is that we double the number of references we need to store.

It is important to note that since there is only one sentinel self._tail.left is completely dependent upon previous operations on the list. Don’t trust it to actually mean anything.

Before we can delete an element, we should probably be able to find it:

  def find(self, value):
    node = self.head
    while node.down:
      if value < node.right.value:
        node = node.down
      else:
        node = node.right
    while value >= node.value:
      if value==node.value:
        return node
      else:
        node = node.right
    return None

The find() algorithm is basically the same thing as the insert() algorithm with equality testing added. We finally delete our node:

  def delete(self, value):
    node = self.find(value)
    if node is None:
      raise ValueError('{} not in skip list'.format(value))
    else:
      while node.up:
        self._delete(node)
        upnode = node.up
        del node.up
        node = upnode
        del node.down
        del upnode
      self._delete(node)

  def _delete(self, node):
    node.right.left = node.left
    node.left.right = node.right

This is where the pedantry comes in handy. We want to make sure we get rid of all of the references to our old node as we work our way up through the levels. To do that, we create a dummy node before deleting the reference to the node above our node. Again, in CPython, this is all a little excessive. The CPython garbage collector will detect the cycles and free the memory as appropriate. However, the machinations behind deleting those references are good practice for working with Python references. Finally, let’s add a reset() and clean up our list with a __del__:

  def reset(self):
    del_queue = [self.head]
    while del_queue:
      node = del_queue.pop(0)
      if hasattr(node, 'up'):
        self._que_append(node.up, del_queue)
        del node.up
      if hasattr(node, 'down'):
        self._que_append(node.down, del_queue)
        del node.down
      if hasattr(node, 'right'):
        self._que_append(node.right, del_queue)
        del node.right
      if hasattr(node, 'left'):
        self._que_append(node.left, del_queue)
        del node.left
    self.__init__()

  def __del__(self):
    self.reset()
    del self.head.right
    del self.head
    del self._tail.left
    del self._tail

We perform a breadth first search of our linked lists deleting all the references we can get our hands on. The hasattr checks are done to prevent us trying to delete a reference that we already deleted. Those attempts could also be put inside a try: … except: … block.

Alright, after much work and consternation, we have a working skip list data structure. We’re going to need to modify it to get ranks in a manner similar to binary search trees. I haven’t benchmarked any of this code, so we’ll also have to do that before throwing PyPy and Cython at the problem. As usual, all of the code (along with a test suite, which is not comprehensive) is at my GitHub.

Advertisements
One Comment
  1. Good stuff. Looking forward to the next installment.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s

%d bloggers like this: