Page 1 of 1

OpenMP slower on OSX

PostPosted: Tue May 28, 2013 6:07 am
by hansg91

For an assignment I am asked to implement an algorithm using OpenMP. I am working on OSX 10.8.3 and have installed g++-4.7 from Homebrew and everything compiles and works fine, except for the speed. The serial version of the algorithm performs in 0.98 seconds, the openmp version performs in 4.8 seconds. Executing the same algorithm on my laptop but in Parallels, on Ubuntu 12.04 (also compiled with g++-4.7), both execute in 1.55 seconds, which is somewhat more what I expected. On some remote server (running Ubuntu as well) the sequential code runs in 1.39 seconds and the OpenMP code in 0.78 seconds, which is more what I was expecting.

So I think it is not my algorithm that is not properly programmed. Adding num_threads(2) to the pragma improves the result slightly, down to 1.67 seconds, but this is still slower than what I see in my parallels session with Ubuntu and also slower than the sequential code (but marginally). I would expect at least the same speedup from my Parallels Ubuntu since it is ran on the same machine, just with a different OS. Is there something not correct for OSX ?

Best regards,

Re: OpenMP slower on OSX

PostPosted: Tue May 28, 2013 6:45 am
by MarkB
Hi there,

A few questions for you to help us work out what is going on:

How many OpenMP threads are you using for the parallel runs?
How many cores does your laptop have?
What are you using to measure the execution time?
How repeatable are the execution times?


Re: OpenMP slower on OSX

PostPosted: Wed May 29, 2013 2:13 am
by hansg91
Hello Mark,

Sure, no problem.

I am not forcing any number of threads, but omp_get_num_threads() returns 8. On Ubuntu in Parallels it returns 4 (because it only has access to 4 cores through the settings).
I have a Intel(R) Core(TM) i7-3720QM CPU @ 2.60GHz (4 cores, running up to 8 threads).
I will try to post a minimum example of my assignment, the execution times are very consistent and I am using gettimeofday to measure time.

Here is the code :

On OSX I get the following result :

Where the first one is the serial execution and the second is the omp version.

On Ubuntu in Parallels I get :

Interestingly if I enable the pragma on line 99, both OS's get a nice speedup. On OSX the time then becomes:

And on Ubuntu the time becomes:

Which is basically the same. I don't quite understand why this is? I am supposed to run the algorithm a number of times, but not in parallel, to calculate the time it takes. I can understand that there is perhaps a speedup from 0.44 to 0.17 since it might start a new execution of the algorithm before the previous one is done and therefore 'cheat' a little, but a speedup of 40.81 seconds to 0.15 seconds? Where does that come from?

Best regards,

Re: OpenMP slower on OSX

PostPosted: Wed May 29, 2013 2:52 am
by MarkB
Hi Hans,

The reason your code gets very poor speedup is that the parallelism is too fine grained and the overhead of setting up the parallel regions and synchronising threads at the end of them outweighs any benefit of splitting the computation over multiple threads. Most likely this is much worse on 8 threads than on 4 because there will be other processes (e.g. from the OS, or other applications which are running) competing for the CPUs, and your 8 threads are time-sharing instead of all running continuously. This means that at the end of every parallel region the code has to wait until all the 8 threads have been scheduled on the CPU, which can typically take tens of milliseconds. I expect that if you run on OSX with 4 threads you will see similar performance as on Ubuntu.

Enabling the pragma on line 99 is definitely cheating! In this case the other parallel constructs will be ignored, since nested parallelism is disabled by default. The code now has a bug, because multiple threads are now reading and writing all of the shared arrays, and the results will be incorrect. Performance is much better, because you are only synchronising the threads once instead of tens of 1000s of times, but you still do not see good speedup: this is most likely because of contention for cache lines in the shared arrays.

Hope that helps,

Re: OpenMP slower on OSX

PostPosted: Wed May 29, 2013 3:19 am
by hansg91
Hello Mark,

Thank you for your fast reply, it cleared up some things.

I did indeed notice an improvement if I forced the number of threads to 4, but it is still considerably slower than on Ubuntu.

Code: Select all
#pragma omp parallel for num_threads(4)

Instead of
Code: Select all
#pragma omp parallel for

Now gives me times of

(this is of course with the pragma on line 99 disabled again). I don't mind a slowdown (in fact with this input size I expect it), but this seems too much. Does this make sense?

Re: OpenMP slower on OSX

PostPosted: Wed May 29, 2013 3:36 am
by MarkB
hansg91 wrote:Thank you for your fast reply, it cleared up some things.

You're very welcome!

How much slowdown you observe may depend on the OS's scheduling policy: it would not surprise me if this is different between OSX and Ubuntu.
It may also depend on what the OpenMP runtime does with threads between parallel regions (spin/yield/sleep), which again might vary between OSs.
Do the settings of OMP_WAIT_POLICY and OMP_PROC_BIND make any difference?

Re: OpenMP slower on OSX

PostPosted: Wed May 29, 2013 3:58 am
by hansg91
Hey Mark,

Why would they have chosen for these scheduling policies for OSX then? Or is my example just a bad example? The environment variables you mentioned did not make a noticeable difference. I tried lowering the number of threads through the environment variable OMP_NUM_THREADS instead of pragma (which does the same thing of course but it is easier to manipulate). These are my results:





And it goes on like this. The only times I would have expected is the one where it is forced to one thread, causing a little overhead. I wouldn't expect a slowdown of 24x for 4 threads.. It frustrates me a little that I can't seem to get the expected results on OSX :(

Best regards,

edit: Also I was looking into the nowait clause, but I am not sure if I am using it well. I tried doing :
Code: Select all
#pragma omp parallel for nowait

But this gives me an error:
Code: Select all
omptest.cpp: In function 'void omp_function(int)':
omptest.cpp:19:28: error: 'nowait' is not valid for '#pragma omp parallel for'

If I do without the parallel :
Code: Select all
#pragma omp for nowait

It does compile and run but its time is almost exactly that of the sequential code, so I am guessing it is disabled, or running in one thread?

Re: OpenMP slower on OSX

PostPosted: Wed May 29, 2013 4:38 am
by MarkB
hansg91 wrote:Why would they have chosen for these scheduling policies for OSX then?

OS scheduling policies are chosen for general purpose workloads and are rarely optimised for multi-threaded codes, let alone code that does crazy things like trying to synchronise its threads every few microseconds!

You can't suppress the barrier at the end of a parallel region with nowait (and in any case the barriers are needed for correctness of your program). An omp for directive without an enclosing parallel region is essentially ignored.

Re: OpenMP slower on OSX

PostPosted: Wed May 29, 2013 4:59 am
by hansg91
Thank you for your reply once more :) Seems I am just stuck with this slowdown in this algorithm for now then ;)