Skip to content

Optimization lane… Once again

April 21, 2013

I have come to restore the good name of C. In my previous post, I compared the implementation of a classification function with the C API and Cython:

f(x) = \begin{cases} 0 & \mbox{if } x<c-b \\          \frac{x-c+b}{2b} & \mbox{if } c-b\leq x\leq c+b \\                              1 & \mbox{if } x>c+b\end{cases}

The C API ended up being faster, but only slightly so. After learning a little more about optimizing programs in C, my newest C API implementation is a factor of three times faster than the Cython implementation. This is just going to be a quick post outlining the C optimization. The optimization was done on a machine running Ubuntu with the x86_64 architecture, for other machines YMMV.

I did attempt some optimizations on the Cython code with unspectacular results. Applying the same optimizations I used for C to the Cython code speeds things up by less than 10%. Running the classifiers on a NumPy array with 1k floats:

%timeit cyu_class(data, 0.5, 0.1)
100000 loops, best of 3: 6.04 us per loop

%timeit cy_class(data, 0.5, 0.1)
100000 loops, best of 3: 6.46 us per loop

I also worked through Jake VenderPlas’s various Cython optimizations with no particular luck.

For reference, here is the timing result for our old C API code:

%timeit c_class(data, 0.5, 0.1)
100000 loops, best of 3: 4.21 us per loop

Let’s start by running our optimizations in straight C and then port them to the Python API. All of these programs are compiled with the -O3 optimization level with gcc 4.7.2. Our first test program looks like:

#include &lt;stdio.h&gt;
#include &lt;stdlib.h&gt;

double classify(double x, double c, double b);

int main(int argc, char* argv[]) {
  clock_t t;
  int iN, N = 10000;
  int iM, M = atoi(argv[1]);

  double c = 0.5, b = 0.1;
  double x[N];
  double out[N];
						
  // Make runs repeatable for timing
  srand(1234);
  for (iN=0; iN&lt;N; iN++)
    x[iN] = rand() / (double)RAND_MAX;

  const int stepsize = 8;
  for (iM=0; iM&lt;M; iM++) {
    for (iN=0; iN&lt;N; iN++) {
      out[iN] = classify(x[iN], c, b);
    } 
  }
  return 0;
}

double classify(double x, double c, double b) {
  if (x&lt;(c-b))
    return 0;
  else if (x&gt;(c+b))
    return 1;
  else {
    if (b&gt;1E-7)
      return (x-c+b)/2/b;
    else
      return 0.5;
  } 
}

Those two if statements inside classify() get called a whole bunch of times, so let’s look and see how well branch prediction is working and how we’re doing on instructions per cycle. Running perf
on our program gives the following output:

perf stat -e instructions -e cycles -e branches -e branch-misses ./test 100000

 Performance counter stats for './test 100000':

    10,765,154,806 instructions              #    0.69  insns per cycle        
    15,637,304,307 cycles                    #    0.000 GHz                    
     2,597,977,423 branches                                                    
       472,634,207 branch-misses             #   18.19% of all branches        

       5.138279674 seconds time elapsed

In a word, that says our program is performing badly. Very badly. We’ll try to get that branch miss rate down in a moment, but first, let’s see if we can get our processor to vectorize some of our operations. We can unroll the loop a little bit, details here, to optimize or processor a little bit. Our new inner loop looks like:

#define ROUND_DOWN(x, s) ((x) &amp; ~((s)-1))
/* Other code */
  const int stepsize = 8;
  for (iM=0; iM&lt;M; iM++) {
    for (iN=0; iN&lt;ROUND_DOWN(N, stepsize); iN+=stepsize) {
      out[iN] = classify(x[iN], c, b);
      out[iN+1] = classify(x[iN+1], c, b);
      out[iN+2] = classify(x[iN+2], c, b);
      out[iN+3] = classify(x[iN+3], c, b);
      out[iN+4] = classify(x[iN+4], c, b);
      out[iN+5] = classify(x[iN+5], c, b);
      out[iN+6] = classify(x[iN+6], c, b);
      out[iN+7] = classify(x[iN+7], c, b);
    } 
    for (; iN&lt;N; iN++)
      out[iN] = classify(x[iN], c, b);
  }

Let’s run our new code with perf:

perf stat -e instructions -e cycles -e branches -e branch-misses ./test 100000

 Performance counter stats for './test 100000':

     8,555,370,759 instructions              #    0.92  insns per cycle        
     9,336,312,035 cycles                    #    0.000 GHz                    
     1,722,405,624 branches                                                    
       286,243,754 branch-misses             #   16.62% of all branches        

       3.082999950 seconds time elapsed

Somehow or another, we saved a bunch of instructions, and as expected, our instructions per cycle went up a bunch. Those add up to a factor of two speedup. Our branch prediction is still failing pretty spectacularly. At least for the gcc 4.7 compiler on my machine, branch prediction works a lot better with the ternary ? operator than with if … else … blocks. Let’s try a new version of our classification function:

double m = c - b, w = 1/2.0/b;

double o_classify(double x, double m, double w) {
  double out = (x-m)*w;
  out = (out &lt; 0) ? 0 : out;
  out = (out &gt; 1) ? 1 : out;
  return out;
}

I’ve pulled the check on b out of the function, but I doubt that has any performance effect since ‘always go one way’ is pretty easy for the branch prediction method to figure out. We’ve added more floating point operations for 0 or 1 classified inputs, but maybe we’ll get lucky with better branch performance. Let’s see:

perf stat -e instructions -e cycles -e branches -e branch-misses ./test 100000
Run took 1.380000 seconds.

 Performance counter stats for './test 100000':

     9,225,082,688 instructions              #    2.18  insns per cycle        
     4,227,577,833 cycles                    #    0.000 GHz                    
     1,529,437,617 branches                                                    
        22,990,447 branch-misses             #    1.50% of all branches        

       1.396654857 seconds time elapsed

That’s more than another factor of two, and now our branch prediction is working pretty well. The processor is vectorizing some calculations, and we’re getting more than two instructions per cycle. Nice. Our cache behavior is still crap (data not shown), but I’m going to leave that for another day. Let’s copy and paste our C code into our Python API file and see the speed:

%timeit f_class(data, 0.5, 0.1)
100000 loops, best of 3: 2.96 us per loop

Wait, what? I know the Python overhead means we might not get our full factor of four, but only a 25% reduction in runtime? [1] I feel cheated. Let’s run perf on a Python test script:

import numpy as np
import sys
import f_classify
f_class = f_classify.classify

np.random.seed(1234)
data = np.random.rand(10000)
for i in xrange(int(sys.argv[1])):
    f_class(data, 0.5, 0.1)

Running that through perf:

perf stat -e instructions -e cycles -e branches -e branch-misses ./test.py 10000

 Performance counter stats for './test.py 10000':

     1,827,741,558 instructions              #    0.95  insns per cycle        
     1,917,740,093 cycles                    #    0.000 GHz                    
       369,266,029 branches                                                    
        35,360,923 branch-misses             #    9.58% of all branches        

       0.651237630 seconds time elapsed

We’ve lost our nice 2+ instructions per cycle, and our branch prediction performance has regressed. Apparently, with everything else going on, the compiler wasn’t able to figure out that our o_classify function should be inlined. Let’s change the declaration and see what happens:

inline double o_classify(double x, double m, double w);
perf stat -e instructions -e cycles -e branches -e branch-misses ./test.py 10000

 Performance counter stats for './test.py 10000':

     1,366,976,017 instructions              #    1.87  insns per cycle        
       730,149,538 cycles                    #    0.000 GHz                    
       208,851,791 branches                                                    
         5,135,634 branch-misses             #    2.46% of all branches        

       0.251113049 seconds time elapsed

That looks a lot better. Without digging into the Python internals just yet, Python is interfering with our nice pipelining, but this is pretty good. Finally, let’s see the timing inside Python:

%timeit f_class(data, 0.5, 0.1)
100000 loops, best of 3: 2.05 us per loop

Nice. Our C code is now 3x as fast as Cython, and all is right with the world. Obviously, I did not do nearly as good of a job optimizing Cython, so I would love it if someone came along and found a better Cython implementation. As usual, code can be found at GitHub. Feel free to play around with anything, just let me know what you find out!

UPDATE: I ran things real quick on my work machine, and the differences are even more stark. Interestingly enough, the branch prediction on that (older) machine is complete crap:

perf stat -e cycles -e instructions -e branch-misses -e branches ./test 10000
Run took 0.670000 seconds.

 Performance counter stats for './test 10000':

     1,608,587,494 cycles                    #    0.000 GHz                     [49.88%]
       922,742,676 instructions              #    0.57  insns per cycle         [75.25%]
        44,245,065 branch-misses             #   28.87% of all branches         [75.26%]
       153,251,625 branches                                                     [74.87%]

       0.679764354 seconds time elapsed

However, with gcc 4.7.3, the aforementioned cache issues completely disappear:

perf stat -e instructions -e cycles -e L1-dcache-loads -e L1-dcache-load-misses ./test 10000
Run took 0.670000 seconds.

 Performance counter stats for './test 10000':

       923,611,151 instructions              #    0.57  insns per cycle         [74.71%]
     1,611,770,402 cycles                    #    0.000 GHz                     [75.16%]
       266,594,868 L1-dcache-loads                                              [75.33%]
         1,592,489 L1-dcache-load-misses     #    0.60% of all L1-dcache hits   [49.52%]

       0.680825297 seconds time elapsed

The loop unrolling and if … else … classification optimizations in Cython now have a huge effect:

%timeit cy_class(data, 0.5, 0.1)
100000 loops, best of 3: 18.8 us per loop

%timeit cyu_class(data, 0.5, 0.1)
100000 loops, best of 3: 9.15 us per loop

And the C optimizations also have a larger effect size in our Python extension:

%timeit c_class(data, 0.5, 0.1)
100000 loops, best of 3: 13.8 us per loop

%timeit f_class(data, 0.5, 0.1)
100000 loops, best of 3: 3.13 us per loop

I’m going to have to play around at home and see if a newer compiler can get rid of the L1 cache problems while retaining the nice vectorization.

1. I also added the -O3 flag to our compilation, but that did not appear to have a significant effect.

Advertisements

From → Cython, Python C API

Leave a Comment

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s

%d bloggers like this: