# Bit fields and Sets in Python

I finally learned how to use the bitwise & operator (in the context of C, but still), so I thought I would talk about how you can use some low-level, C-Style bit field action. Python has one built-in advantage over C for this: the long integer type. Because you can represent (almost) arbitrarily large integers in Python, you can create really large and useful bit fields/bit arrays without having to chain integers together.

I’m going to use bit fields for fast (O(1)) inclusion testing and (not O(1), but still fast) pattern detection applied to things like genomics data. Now, there is a canonical way to do inclusion testing in Python: sets. As a quick example, let’s set up two objects:

our_list = ['red', 'green', 'blue', 'green', 'purple'] our_set = set(our_list)

`our_set`

is now a set of four elements and prints out: `set(['red', 'blue', 'green', 'purple'])`

. Your mileage may vary for the ordering you get. The set object type is based on dictionaries, which have no ordering. Now, the boolean test `'red' in our_set`

will run nice and fast. In fact, it will run significantly faster than `'red' in our_list`

. It will run O(1), in fact.

The problem with using sets for what we want to do is that lack of ordering. If I have the sequence ‘GCATAGCCA’ and want to search for ‘GCC’, because I’m in a C kind of mood, sets are not very useful. We could set up a set with all the three letter combinations, but that won’t be useful. A bit field stores data by turning on a particular bit in our binary integer. A quick reminder of binary:

1 = 001 2 = 010 3 = 011 4 = 100 5 = 101

We can now execute our binary & (and) and | (or) on some integers:

1 & 5 = 001 & 101 = 001 = 1 1 | 5 = 001 | 101 = 101 = 5

It’s pretty intuitive once you get used to the concept. Now, we need to introduce the binary shift operators >> and <<. All those two operators do is shift the binary representation to the right or left, i.e.:

1 << 1 = 001 << 1 = 010 = 2 3 << 2 = 0011 << 2 = 1100 = 12 5 >> 1 = 101 >> 1 = 010 = 2

Those two operators are all we need. Now, let's represent our string by four bit fields. Each bit field will represent a letter, and each bit will be turned on if that letter is in our string at that position. For example:

str = 'GCATAGCCA' A = 001010001 = 81

Above, I've written the bit field reading from our string right to left. I will continue with this convention, but the opposite convention could easily be used. We can test whether there is an 'A' in the first position by checking `A&2**0`

which returns 1. For the second position, `A&2**1`

returns 0. Since non-zero integers are interpreted as True and zero integers as false, we can use the & operator in expressions for if statements. Now, let’s create our dictionary of bit fields.

d = {'A' : 0, 'C' : 0, 'G' : 0, 'T' : 0} str = 'GCATAGCCA' for i, char in enumerate(str[::-1]): d[char] += 2**i

In that code, we are simply looping over our string backwards (hence the fancy slice indexing) and turning on the relevant bit in our bit field by adding a power of two to the dictionary element we care about. We can test to see if ‘GCC’ is in our string with one line of code:

if d['G'] & (d['C'] << 1) & (d['C'] << 2): print("I found GCC!")

We are taking all our bits for ‘G’ and comparing them to the bits in ‘C’ shifted left once and the bits in ‘C’ shifted left twice. Since ‘GCC’ is in our code, the expression will return a non-zero integer, and our code will print out that we found GCC. If we tested for a sequence that doesn’t occur, e.g. ‘GAC’, the expression would return a zero integer, and nothing would print.

Looking into IPython’s %timeit output, using the code ` 'GCC' in str`

is faster than our bitwise operations, but I would assume there is a use case for bit fields somewhere out there (perhaps on some data less structured than a string).