Speed up your NEURON code

I use NEURON for my neural simulations and I recommend it. Not only is it well documented, but it is consistently well supported – Michael Hines and Ted Carnevale (NEURON developers) are always quick to respond to questions on the NEURON forum or by email. NEURON has been used in well over a thousand publications.

Most every programmer is interested in having their program execute as quickly as possible. This concern is held by computational neuroscientists as well. Most important is your time: the faster you can get results, the faster your research progresses and the sooner you can get published. But computing time is also an issue – people often run their simulations on shared machines and can only get (or afford) a limited amount of computing time.

So I want to share a few things that have helped me speed up my NEURON code, as well as other tips suggested by Michael Hines. Some of these are more general, while others are specific to NEURON (but may have analogs in other programs). I write this from the perspective of a network modeler; there are other considerations for running a detailed single cell model, where parts of the cell may be split across different processors. I have organized the steps according to whether they are applicable to all NEURON code or just to parallel NEURON code.

All NEURON code

  1. Time each part of your code so you know what you are dealing with. Add a call to startsw() before and after each section of your code and compute the difference to calculate the total time spent on that section:
    begin_time = startsw()
    time_taken = startsw() - begintime
    print "time for this section: ", time_taken

    You will likely want to calculate the amount of time spent setting up and creating the cells, the time spent connecting all the cells, the time spent executing the actual simulation, and the time spent writing the results.

  2. Where possible, compile frequently used functions. If you are programming in hoc, that is interpreted code, which is executed much more slowly than compiled code. NEURON can access compiled code through NMODL. Even though NMODL is usually described as the language used for defining ion channels, synapses, and other subcellular mechanisms, you can use it for anything that you want to compile and have available in hoc. I will describe how to do this in a future post.
  3. Check your time step size (“dt”). If you are using a fixed time step, make sure you chose that particular time step for a reason. It needs to be small enough to ensure numerical stability, but it doesn’t need to be any smaller than that.
  4. Parallelize your code. Even if you only run your code on your personal computer, most every computer now has multiple processors, so you are likely to cut your simulation time at least in half. The NEURON developers have published a methods article demonstrating how to parallelize NEURON code and there is a whole section of the NEURON forum dedicated to parallel NEURON.

Parallel NEURON code

  1. Ensure that your code is really parallelized. Make sure that each processor is only performing computations for the cells that it owns. Best way to do this is to make liberal use of pcitr() functions instead of regular for loops. See the init.hoc code of this model in ModelDB for an example of a pcitr() function.
  2. Ensure that your jobs are load balanced. This is not likely to be a problem is the number of cells in your model >> the number of processors you are using. But you can check it by adding the following code to your program after the run() call:
    comptime = pc.step_time
    avgcomp = pc.allreduce(comptime, 1)/pc.nhost
    maxcomp = pc.allreduce(comptime, 2)
    if (maxcomp>0) {
       if (pc.id == 0) { printf("load_balance = %g\n", avgcomp/maxcomp)}

    This will give you a number between 0 and 1. Numbers above 0.7 are relatively good, though ideally you want 0.95 or greater.

  3. If you have parallelized your code, you may also want to measure how much time your program takes to exchange information about identities and times of spikes across processors, aka the exchange time. You can do this by adding the following to your code:
    runstart = startsw()
    // then call your pc.psolve(tstop) or run() statement
    runtime = startsw() - runstart
    comptime = pc.step_time
    avgcomp = pc.allreduce(comptime, 1)/pc.nhost
    maxcomp = pc.allreduce(comptime, 2)
    if (maxcomp>0) {
       if (pc.id == 0) { printf("exchange_time = %g\n",  runtime - maxcomp)}
  4. Especially if you find that the model has a large spike exchange time, try out spike compression and gid compression to reduce the amount of memory required to send spike information between processors. This can drastically decrease spike exchange time. Note that gid compression is only available if each processor in your job owns fewer than 256 cells. See the NEURON documentation about spike_compress as well*:
    nspike = 2 // see notes on setting nspike
    gid_compress = 1
    nspike = pc.spike_compress(nspike, gid_compress)

    Set nspike, the first argument to the spike_compress function, based on how many spikes are usually transferred across processors at each time step in your model. To determine what this number is likely to be, you will want to run your model once and obtain a spike transfer histogram. The histogram will tell you how often 0,1,2,3, etc spikes were transferred. So if you found that 95% of the spike transfers involved 3 spikes or fewer, you would want to set nspike to 3 for future simulations, assuming they will have a similar spiking transfer structure.

    Here is the code for producing a spike exchange histogram from your model:

    objref spkhist
    spkhist = new Vector(pc.nhost)
    if (pc.id==0) {
    // Run statements go here, pc.solve(tstop) or run()
    objref f4
    if (pc.id==0) {
    	f4 = new File("spkhist.dat")
    	for i=0, pc.nhost-1 {
    		f4.printf("%d\t%d\n", i, spkhist.x[i])	// Print the spike time and spiking cell gid
  5. Try altering the mode for the spike queue if you have a lot of artificial cells in your model. Note that this method is more experimental; as with spike compression and the cache efficient mode described next, compare your code times before and after making this change to see if it results in a speedup*:
    use_fixed_step_bin_queue = 1
    use_self_queue = 1
    mode = cvode.queue_mode(use_fixed_step_bin_queue, use_self_queue)
  6. Try using the cache efficient option in your code. This method reallocates vectors and matrices in the memory space and can result in a substantial speedup of your code*:
  7. If you have a whole bunch of processors, you may not want them to write their results in serial. Have each processor write out results to its own file in parallel, concatenate them later on your personal computer. You can even parallelize the concatenation.

* For steps 4-6, these have the possibility of altering your model’s results. Therefore, make sure to compare the spikeraster generated by your model before and after trying each of these steps to ensure that they do not affect your model’s results.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google+ photo

You are commenting using your Google+ account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s