doacross loops - implementations and performance

General OpenMP discussion
Forum rules
The OpenMP Forums are now closed to new posts. Please visit Stack Overflow if you are in need of help: https://stackoverflow.com/questions/tagged/openmp
Locked
bda
Posts: 6
Joined: Wed Nov 07, 2007 12:48 pm

doacross loops - implementations and performance

Post by bda »

Hi,
I have implemented the 3D Gauss-Seidel code from the OpenMP tutorial (Terboven & Klemm), and tested it with different compilers:
  • gcc 10.2
  • clang 11
  • icc 2020 upd 4
While GCC gives the best single thread performance, it doesn't show any speed-up for more threads. A quick look during execution seems to indicate, that the threads are not working simultaneously, which also can be seen at the walltimes (almost linear increase with the number of threads).
clang and icc are slower with a single thread (can probably be tweaked with some optimization flags), and the scaling is showing the behaviour expected for this kind of loops, i.e. no speed-up for 2 threads, but then good speed-up for more threads, compared to the 2-threads case.
Does anybody have similar experiences, and maybe some knowledge how the different runtimes do implement the doacross loops?

Regards,
Bernd

JeffHammond
Posts: 2
Joined: Sat Feb 21, 2015 2:20 pm
Location: Portland, OR
Contact:

Re: doacross loops - implementations and performance

Post by JeffHammond »

I haven't looked at this pattern in a while, but when I did a study about two years ago, I found that on needed to block loops manually for doacross to produce useful performance results.

https://github.com/ParRes/Kernels/blob/ ... mp.cc#L162

This causes the code to perform similarly to the task-based version.

https://github.com/ParRes/Kernels/blob/ ... mp.cc#L153

If you prefer Fortran, see the following, although the doacross version in Fortran is not blocked. Fixing that would be trivial.

https://github.com/ParRes/Kernels/blob/ ... openmp.F90
https://github.com/ParRes/Kernels/blob/ ... openmp.F90

bda
Posts: 6
Joined: Wed Nov 07, 2007 12:48 pm

Re: doacross loops - implementations and performance

Post by bda »

Thanks! Interesting collection of benchmarks.

I am aware of, that blocking can help for performance here. My main question/concern is, why is gcc vs clang vs icc so different? clang and icc behave very much alike, but they also use the same runtime, AFAIK.

ProtzeJoachim
Posts: 3
Joined: Tue Jun 30, 2015 7:40 am

Re: doacross loops - implementations and performance

Post by ProtzeJoachim »

For doacross loops, the possible concurrency depends on the choosen schedule. AFAIK, the implementation specific default schedule for gcc is static,blocked. In that case, the second thread can only start executing once the first thread reached the last iteration of it's block. At the end you see almost sequential execution with such schedule.
Since you cannot rely on the implementation to always choose a reasonable schedule, you should specify a reasonable schedule.
As long as the implementation does not perform "fancy" (e.g., polyhedral) loop optimizations, choosing chunk size <= dependency distance in the distributed dimension seems reasonable.

I also think, that in many cases, collapse will prevent concurrency, unless the chunk-size is well chosen.

bda
Posts: 6
Joined: Wed Nov 07, 2007 12:48 pm

Re: doacross loops - implementations and performance

Post by bda »

Thanks for the scheduling hint. Do you know, what LLVM/Intel use as the default schedule for this type of loops?
My goal is to have a portable code, that works and performs reasonably well for all compilers, and if this can be achieved by added a schedule() clause, that will be fine!

jdoerfert
Posts: 1
Joined: Tue Nov 24, 2020 4:08 pm

Re: doacross loops - implementations and performance

Post by jdoerfert »

Clang uses `static` schedule with chunk size of `1` by default *for doacross loops*. Other loops are scheduled differently.
Here is the source if you are interested:
https://github.com/llvm/llvm-project/bl ... .cpp#L2496

~ Johannes

bda
Posts: 6
Joined: Wed Nov 07, 2007 12:48 pm

Re: doacross loops - implementations and performance

Post by bda »

Thanks Joachim and Johannes, both for the hint and the clarification! I can get similar behavior with all compilers, now, after I have added 'static,1' scheduling.

Locked