Skip to content

Protobufs just reinvented JSON

I’m not a super big fan of protocol buffers, but that is neither here nor there. With the new protocol buffer release, you can now completely reproduce JSON in protocol buffers with proto maps:

message JSON {
  oneof value {
    int64 i = 1;
    double f = 2;
    string s = 3;
    JSON_List l = 4;
    JSON_Map m = 5;
  }
}

message JSON_Map {
  map<string, JSON> m = 1;
}

message JSON_List {
  repeated JSON val = 1;
}

I can’t imagine why on earth you would want to do that, but I’m amused that you can.

CS Outreach Mumblings

I think we can all agree that Computer Science has trouble recruiting and retaining people, especially those from less privileged backgrounds. I wonder how much of that is the fact that we put steep learning curves in front of people before you can do anything interesting. Here’s a short story of my experience trying to test out a script a friend of mine sent me. For a little background, I’m running Ubuntu 14.04 and trying to run a Python script.

I get the script in the email and try to execute it. Sadly, my friend is a Windows user which means I have to find a program to remove all the superfluous carriage returns [1]. Once the file is cleaned up, I try again to run the script. Unsurprisingly, I don’t have all of the dependencies installed on my system. No big deal, I’ll just add them. Of course, one of the dependencies isn’t available through Ubuntu’s packaged manager and has to be installed from the Python package index (PyPI). Being the sort who heeds sys-admin warnings to not install PyPI to my system’s root [2]. To avoid doing that, I need to set up a virtualenv, which allows me to install Python packages without installing them on my entire system. Upon trying to install the first package dependency, pip fails, and I’m forced to rely on Google. Apparently, I have to upgrade pip, using pip, just to install something else with pip. Long story short, this kind of mundanity continued for close to an hour.

Trying to put myself in someone who is just learning to program, I’m not sure why I would want to keep banging my head on the wall before I could even starting playing with the code. The Python community is generally good about being welcoming and encouraging to newcomers. That’s certainly part of why I have a career as a programmer at all. However, this is something we need to be better at. When core tools requires non-trivial expertise to use and no one fixes it, that sends the signal that we don’t care if people without that experience can use our tools. We can do better.

1. For whatever reason, Windows uses two characters to represent a new line in a file. It uses both the newline character ‘\n’ and the carriage return character ‘\r’. Unix-based systems (including Mac OS X) only use a single newline character. Unfortunately, the extra carriage return characters Windows adds to files causes Unix-based systems like my Ubuntu OS to choke and be unable to properly read the file.

2. Ubuntu uses Python for a lot of system management tasks. If you install a package to your system’s root, you run the risk of version conflicts that can break your OS. In short, never `sudo pip install` anything.

C++ co-routines have nothing to do with concurrency

Yes, the title is a little bit link-baity. It’s OK.

I just watched this excellent talk about the proposal to add co-routines to the C++17 standard. It’s a really interesting talk and gives you a sense of how to efficiently implement the future monad in an imperative language that compiles to machine code.

Now, here’s the rub: the core of the proposal (N4402) is adding a few keywords to C++ that are named after a concurrency pattern (co-routines) in spite of having nothing specific to do with currency. What is the proposal really doing? Bringing Haskell do-notation to C++!!! Haskell’s do-notation is easily my favorite syntactic sugar in the history of all syntactic sugar. Basically, do-notation makes working with monads palatable. Unfortunately, I do not have the time/space/sense of self loathing to be able to try to describe monads and why they matter, so I’m going to dedicate the rest of this post to pointing out the obvious similarities.

Read more…

A Simple Graph Algorithm

I wrote a simple graph algorithm to solve an NPR word puzzle and wrote up the solution on my GitHub.

An update about me: I’m still alive, working at Google, and mostly writing C++ these days. It seems I updated my ‘About’ section already in spite of not posting anything new.  I do have a few mostly written up blog posts that I do intend to finish. Unfortunately, almost all of my coding right now is either for work and confidential or too trivial to be worth writing about (dear god does the world not need another blog post on a prime number filter in Haskell. It’s an awesome language, but you can thank me for not writing that.).

You will be hunted down for you getattr tricks!!!

I recently inherited a codebase with… issues. One of my favorite pastimes for improving a codebase is to scroll through log files and fix bugs. Today, I found this error:

  File ".../my_module.py", line 1337, in foo_method
    bar = foo.title,
AttributeError: 'Foo' object has no attribute 'title'

No worries, grep -r will save me! Or wait… it won’t. Because someone wrote this code:

def call_obliquely(self, method_name, args):
  return getattr(self, 'method_'+method_name)(args)

Please don’t write that. Someday, someone somewhere is going to need to refactor that function somehow. If they can’t find where you call that function, they can’t take your usage into account. Using the full name is slightly better, because at least grep -r will find it. However, our IDE using brethren will be unable to use their automated refactoring tools to rename the function, and really, all their silly mouse-clicks have to be for something.

Generators generate modularity

Work has limited my time to post/do in depth work, but I wanted to write something about one of my favorite features of Python: generators. My secret goal in our current code base is to slowly make everything generators up and down. Maybe that is a touch facetious, but I do think generators are a great way to hide state and generate modularity.

Let’s start with an example. Can anyone tell me what is wrong with this code?

output_list = []
current_node = get_my_first_node()
while current_node:
  output_list.append(do_stuff(current_node.data))
  current_node = current_node.next_node

That’s a pretty straightforward example of iterating through a linked list and processing the data somehow. And you are correct, my dear reader, we are doing three logically distinct things with intermingled code. We are iterating over a linked list, processing the data, and accumulating the results of do_stuff() all together. We’ve written do_stuff to pull out a little bit of the complexity, but we’ve still kind of coded ourselves into a corner. What if we wanted to make this lazy? Why can’t I use my beloved list comprehension? What if I only wanted the first 5 items?

Ok, I went over the top a little there for a moment, but such is life. The solution is to abstract away the linked list with a generator:

def linked_list_generator(first_node):
  current_node = first_node
  while current_node:
    yield current_node
    current_node = current_node.next_node

Look at that! All of our linked list logic is hidden behind this interface. The yield statement means that the output of linked_list_generator(some_node) is going to be an iterable. More concretely, it means that we can write for node in linked_list_generator(some_node) and have that behave exactly as other iterables do. Now, we can replace our bad code about with a beautiful list comprehension:

first_node = get_my_first_node()
output_list = [do_something(node.data)
               for node in linked_list_generator(first_node)]

Isn’t that so much cleaner? We’ve now separated out our iterating logic, our processing logic and our accumulating logic. All without creating unnecessary mutable state.

If you’re looking to learn more about generators, you should probably look here.

Never leave the house with a bare exception

Some times best practices end up being just for aesthetics, and some times best practices exist so you aren’t stuck in an infinite loop which is catching your KeyboardInterrupt’s. There is basically no excuse to ever use code that looks like this:

try:
  do_something()
except:
  do_something_else()

If you don’t believe me, try this code:

while True:
  try:
    pass
  except:
    continue

Helpful hint: ctrl-\ will crash your interpreter. Which is the only way to get out of that loop. Today, I interacted with some code which is the moral equivalent of above. Except for that the infinite loop seemed to involve looping through a series of about five files. Those files were littered with bare except clauses. Finding the block catching my KeyboardInterrupt’s was effectively impossible. Just don’t do it.

If you insist on catching a plethora of exceptions try:

while True:
  try:
    pass
  except Exception:
    continue

That code will at least let you escape with a KeyboardInterrupt. If you are using a package which throws an exception not inheriting from Exception, I suggest you either write wrappers to catch the package-defined exceptions, or just use a better package.

Polymorphism in your ORM

Whenever you try to interface two incompatible technologies, say a relational database and an object-oriented typing system, you are going to run into difficulties. One problem I ran into recently was dealing with polymorphic classes in an ORM. The syntax here will be specific to Django, but the ideas should be applicable to many ORMs.

We have two django models, which do basically the same thing:

from django.db import models

class Foo(models.Model):
  foo_field = models.CharField()

  def run(self):
    ...implementation based on foo_field...

class Bar(models.Model):
  bar_field = models.CommaSeparatedIntegerField()

  def run(self):
    ...implementation based on bar_field...

Let’s say we have another Django model which attaches to our foo and baz models. What we really want is to write:

class Baz(models.Model):
  ...stuff...
  foo_bar = models.ForeignKey(Foo or Bar)

  ...methods...

In particular, I want to make sure that baz, and all of it’s baz-like cousins, get deleted whenever I call delete() on the relevant foo or bar instance. Also, if I were to write a new bazzy class, I shouldn’t have to manipulate foo and bar to make everyone place nice. That is what the ORM is for. So, in my first GoF mention in this space, we nee an adaptor pattern, or a wrapper as the Pythonic part of the engineering universe calls it.

Here is what my foo_bar_wrapper looks like:

class FooBarWrapper(models.Model):
  foo_bar_type = models.CharField(max_length=32)
  foo_bar_id = models.IntegerField()

  unique_together = ('foo_bar_name', 'foo_bar_id')

Now, we only have one class to override our delete() and save() functions for:

class Foo(models.Model):
  ...foo-y stufff...

  def save(self, *args, **kwargs):
    create_wrapper = False
    if not self.id:
      create_wrapper = True
    super(foo, self).save(*args, **kwargs)
    if create_wrapper:
      foo_wrapper = foo_bar_wrapper(
        foo_bar_type=self.__class__.__name__
        foo_bar_id=self.id)
      foo_wrapper.save()

The delete() method would be pretty similar. I implemented the delete() overloading on the wrapper class. We could implement that function just to be safe.

Outside of Django model forms, I attempted to not use the original foo and bar classes as much as possible. To do that, we need to access our foo or bar object:

from django.db.models.loading import get_model

class FooBarWrapper(models.Model):
  ...stuff...

  @propery
  def foo_bar(self):
    return get_model(self.foo_bar_type).objects.get(
      id=self.foo_bar_id)

An initial draft used a __getattr__ trick, but I believe this implementation makes it very clear you are accessing the foo or bar object indirectly through the wrapper. At this point, you can either write your own getter to obtain a foo_bar_wrapper from a foo or bar object, or you can hack foobarwrapper_set to get the object.

We’re even able to safely manipulate the universe attached to FooBarWrapper:

class FooBarWrapper(models.Model):
  ...

  def manipulatae_unverse(self):
    for uni in self.universe_set.all():
      uni.manipulate()

We have to be concerned about grep -r safety here, so I would love any input about universe_set.all() versus Universe.objects.filter(foo_bar_wrapper=self). My bias is towards the latter because of the aforementioned grep issues, but the latter seems to be more Django idiomatic.

We’ve ended up with an ugly one-to-one database relationship, which is often a very bad call. However, by putting FooBarWrapper in the middle, we have one class which is connected to the rest of the universe. Any and all new dependencies can ForeignKey off of FooBarWrapper with impunity, and a new Foo-ish class can be easily hooked into FooBarWrapper without modifying the rest of the universe.

The Announcement

I’m finally writing the announcement I promised. At the beginning of June, I started at Schrodinger, Inc as a software engineer. I work on the enterprise services team building tools to enable chemists to extract insights from data they already have.

This post was delayed by the headaches and hassles associated with moving to New York City and my first month at Schrodinger. Unfortunately, I don’t think I will have time for the detailed tests and comparisons I’ve been posting here, but that might change when our product is more developed and we shift gears towards optimization. In the meantime, I’ll be posting some ideas on getting disparate packages and libraries to play nice with one another.

Cython Objects… Fools Rush In

At long last, I am writing up a comparison between a Cython skip list class and a similar C++ class using Boost.Python as a wrapper. There are other tools to wrap C++ classes, but I don’t have time at the moment to compare them. In a previous post, I suggested that you should use Cython as a first go to whenever you need to speed up your functions. My recommendation for classes is going to be similar, but more tempered. As a first pass, directly compiling your Python code with Cython buys you some speed for almost no effort. However, if you are comfortable with C++, Cython extension types don’t save you much by way of headaches and don’t speed your code up that much.

Read more…