Parallelism on Heroku
Parallel (multi-core) processing with Ruby on Heroku's Celadon Cedar stack

Heroku released their new stack called Celadon Cedar. This stack introduces the Procfile, which is a file that defines what processes to run, and how to run them. The Procfile lives in the root of your application.

Here’s an example of what a Procfile might look like:

1
2
web: bundle exec thin start -p $PORT
worker: bundle exec rake jobs:work

This configuration essentially replicates Heroku’s previous stack: “Badious Bamboo”. However, rather than using Thin, we can instead switch to Unicorn:

1
2
web: bundle exec unicorn -p $PORT
worker: bundle exec rake jobs:work

Now that we’re using Unicorn, let’s see how we can increase our throughput. First of all, let’s see how many virtual cores a single dyno provides.

1
heroku run nproc

This returns 4, so we have 4 virtual cores to work with.

Update: It appears that the Free, Hobby, Standard-1X and Standard-2X dyno types now all provide 8 virtual cores instead of 4.

Since (MRI) Ruby has a global interpreter lock (GIL) in place, Ruby is unable to take advantage of all of our available virtual cores using threads. So instead of using threads, we’ll have to rely on spawning multiple unix processes. Note that Rubinius and JRuby don’t have a GIL, and therefore can simply use threads to utilize all of the available virtual cores.

Unicorn has the ability to run in cluster-mode. What this essentially does is it will spawn a master process that loads your application into memory. Then it’ll spawn one or more child processes that actually handle the requests. The benefit of this, among other things, is that you only need a single port on your machine to run multiple instances (processes) of your application, and you can take advantage of all of the available virtual cores. This’ll work perfectly on Heroku, since Heroku only exposes a single port per dyno.

Now you’ll want to configure Unicorn to spawn the master process, along with 4 child processes (one for each virtual core). Note that at the time of writing, you only have access to 512mb ram per dyno, so you might hit the memory limit before being able to utilize all of your virtual cores. If you’re hitting the memory limit, be sure to reduce the number of child processes to spawn.

To configure Unicorn to spawn 4 child processes, create ./config/unicorn.rb with the following contents:

1
2
preload_app true
worker_processes 4

Then update the web entry in ./Procfile to:

1
web: bundle exec unicorn -p $PORT -c ./config/unicorn.rb

Now Unicorn will spawn 1 master process, and 4 child processes, and it’ll take advantage of all 4 of the available virtual cores.

On to some benchmarks. I’ve used the ab utility to measure how many requests per second a blank Rails application can handle on various concurrency levels on Heroku using Unicorn.

1 child process (roughly equivalent to a single Thin instance):

1
2
Time taken for tests:   3.835 seconds
Requests per second:    260.77 [#/sec]

2 child processes:

1
2
Time taken for tests:   1.689 seconds
Requests per second:    592.00 [#/sec]

3 child processes:

1
2
Time taken for tests:   1.164 seconds
Requests per second:    833.91 [#/sec]

4 child processes:

1
2
Time taken for tests:   0.979 seconds
Requests per second:    918.90 [#/sec]

As you can see, we’ve significantly improved our throughput simply by taking advantage of multiple virtual cores.

If we spawn any more child processes for this particular application however, we’ll hit the memory limit and go into swap. This is what that looks like:

1
2
Time taken for tests:   10.621 seconds
Requests per second:    83.32 [#/sec]

This is worse than running on a single virtual core.

Conclusion

We got roughly 4x more throughput, for free, by actually utilizing all of the available resources of the dyno.

To Archive →