## Optimization flag for OpenMP in GCC compiler

General OpenMP discussion

### Optimization flag for OpenMP in GCC compiler

Hi,
I am working on parallelization of Conjugated Gradient matrix solver using OpenMP. A piece of my code is attached below:

Code: Select all
`# pragma omp parallel num_threads(NTt) default(none) private(j,k) shared(STA, COEFF, RLL, p_sparse_s, coef_a_sparse, res_sparse_s, normres_sparse, nx,ny, nz)         {      #pragma omp for reduction(+:normres_sparse)            for (i=1; i<=nx; i++)  for (j=1; j<=ny; j++)    for (k=1; k<=nz; k++)               {                                  p_sparse_s[i][j][k] =                 COEFF[0][i][j][k]   * STA[i-1][j][k]                                        +COEFF[2][i][j][k]   * STA[i][j-1][k]                              + COEFF[4][i][j][k]   * STA[i][j][k-1]                              + COEFF[6][i][j][k]   * STA[i][j][k]                              + COEFF[5][i][j][k]   * STA[i][j][k+1]                              + COEFF[3][i][j][k]   * STA[i][j+1][k]                              + COEFF[1][i][j][k]   * STA[i+1][j][k];               res_sparse_s[i][j][k] = RLL[i][j][k] + coef_a_sparse * p_sparse_s[i][j][k];               normres_sparse += (res_sparse_s[i][j][k] * res_sparse_s[i][j][k])/ (nx*ny*nz);               }         } `

Please note that here I had defined 3-D matrix of 200 X 200 X 200 size, i.e. nx= ny = nz = 200;

I am using GCC compiler in intel i7 quad code processor and without using any optimization flag I am getting around 95 % efficiency in 4 processors.
i.e. run time for serial code = 4 h, run time parallel code = 1 h 3 min. (in 4 processor)
But when I use –O3 flag, serial code starts taking 2 h 30 min and for parallel code it is around 55 min.
Though I am getting fast results, but efficiency decreases

So my questions are:
(1) For benchmarking should one use any optimizer flag ? If the Ans. is YES, What optimizer flag should I use for OpenMP ?
(2) I heard about the term “false sharing”, but don’t know much about it. Is this the problem with false sharing. As I need to share vary large number of arrays.

Any suggestion/ resolution will be highly appreciated.

Regards,
Saurish
saurishdas

Posts: 2
Joined: Sat Oct 26, 2013 1:50 pm

### Re: Optimization flag for OpenMP in GCC compiler

Hi,

Looking at your code, I don't understand why i is not annotate as private.

(1) For benchmarking should one use any optimizer flag ? If the Ans. is YES, What optimizer flag should I use for OpenMP ?

I would say YES. User will use optimisation to you should do the same. Some optimisations are well known like loop unrooling (and many others) and compiler can do it while you keep your code clean.
I think optimisation are not related to OpenMP (I am not sure about it) but it is well know that some time O2 is faster than O3. You can also use Ofast but you have to test each of them to know which one is the better.

About efficiency, some time the CPU is not the bottleneck for your programme. Your band-width for memory may be slower than computation and it may explains that efficiency is decreasing.
Looking at your code, this happen because your data are not allign in your memory.When you write COEFF[5][i][j][k] you are jumping 4 time in your memory while if it is align, you can write : COEFF[(((5 *COEFF.shape[1] + i) * COEFF.shape[2]) + j) * COEFF.shape[3] + k]. In this case it is allign in memory so your computer should do it faster (and you may be able to use SSE instruction).

(2) I heard about the term “false sharing”, but don’t know much about it. Is this the problem with false sharing. As I need to share vary large number of arrays.

I don't know what is a “false sharing” so I may say something wrong but I think that shared data avoid copy so having all variable as shared should not cause any performance issue.

Regards,
Pierrick
pierrick

Posts: 3
Joined: Wed Oct 23, 2013 4:17 am

### Re: Optimization flag for OpenMP in GCC compiler

I think the main bottleneck in this code is likely to be memory bandwidth. Turning optimisation on (which is clearly the right thing to do because it reduces the wall clock time) will reduce the number instructions executed, but cannot really do anything about the number of loads/stores required. The memory system becomes saturated by 4 threads all demanding data at the same time.

Reordering the COEFF array from COEFF[7][nx][ny][nz] to COEFF[nx][ny][nz][7] might improve the cache locality a bit: this might be what Pierrick is trying to say, but I'm not sure!

False sharing occurs where multiple threads access addresses which are on the same cache line (and at least one of the threads is writing the data). This does not look like a problem in your code as the data accessed by different threads is well separated in memory.

pierrick wrote:Looking at your code, I don't understand why i is not annotate as private.

i is the iterator of the parallel loop, so is private by default.
MarkB

Posts: 770
Joined: Thu Jan 08, 2009 10:12 am
Location: EPCC, University of Edinburgh

### Re: Optimization flag for OpenMP in GCC compiler

Yes Mark and Pierrick. I completely agree with you regarding reordering the COEFF array. I made it COEFF[nx][ny][nz][7] and it is running fast; but parallel efficiency remains same.

After reading some documents I understand that the problem is not with false sharing.

Actually I came to about a bug with GCC compiler; it inhabits automatic vectorization available with -O3 when we use -fopenmp flag,

http://gcc.gnu.org/bugzilla/show_bug.cgi?id=46032

I will try it with icc compiler and let you know my findings. In the mean time if you have thought please let me know

regards,
saurish
saurishdas

Posts: 2
Joined: Sat Oct 26, 2013 1:50 pm

### Who is online

Users browsing this forum: No registered users and 3 guests