The OpenMP Forums are now closed to new posts. Please visit Stack Overflow if you are in need of help: https://stackoverflow.com/questions/tagged/openmp
I am trying to understand if there are any guidelines in the OpenMP specification for where PRIVATE copies of arrays are allocated and if they consider the status of the MASTER copy.
With large arrays, it is often preferable to have these arrays on the heap, as the stack can overflow, while the heap is extendable.
This applies to an operating system that has a fixed stack and extendable heap, such as I have with 64-bit gFortran on Windows 10.
For performance, it is preferable that private arrays are on separate memory pages, so performance can be improved if private arrays are on the thread stack. This is also likely to be significant with multiple small arrays that share the same memory page. However large arrays can overflow the stack(s) and should go on the heap.
In Fortran, the best way I know to direct large arrays to the heap is to make them ALLOCATABLE and allocate them within the !$OMP region (via recursive/openmp subroutines).
Some other questions, which relate to my managing private arrays in gFortran and if this is described in the OpenMP specification include:
Do private arrays go on the heap, if the master is on the heap ?
Do private arrays go on the thread stack, if the master is on the stack ?
gFortran provides options for local private arrays with -fmax-stack-var-size=n or -fstack-arrays, but local automatic arrays appear to not be as manageable ?
How are private ALLOCATABLE arrays allocated ? especially if the master is allocated or not ?
It would be preferable if private arrays allocated on the heap would have a separate memory page for each thread, although this is not discussed.
In Fortran, master arrays can be static, local, automatic, allocated or allocatable. Does the OpenMP specification give any guidance for providing private arrays or do compilers advise of their approach ?
gFortran Compile Options or environment variables for changing the stack size are vague as to if they affect the thread stacks or only the master stack.
This may be that gFortran and GCC openmp documentation apply to multiple operating systems, although stack vs heap is common to most O/S.
I have found that the following linking command appears to control all stacks:
gFortran @ofile_list_gf.txt -fopenmp -fstack-arrays -Wl,-stack,16777216,-Map=program.map -o program.exe >>%tce% 2>&1
I do not modify the Environment Variable OMP_STACKSIZE
Is there another forum where these issues relating to OpenMP performance are discussed ?
Thanks for your reply.
The three main performance issues I am attempting to address are:
1) Use of the thread stack to improve cache performance.
2) Use of the heap to avoid stack overflow when scaling up problem size.
3) Manage memory allocation to avoid cache/memory coherence delays.
Although I have identified gFortran on 64-bit Windows, most Operating Systems have a single extendable heap, but small fixed size stacks, with one per thread. The OS also manages a faster memory cache(s), whose management is a black art, especially via Fortran. My expectation is that PRIVATE copies of arrays are defined on the thread stack, which can overflow, creating an error response in most OS that I am aware of. Writing a robust program that deals with a range of problem sizes becomes difficult if reliance on PRIVATE stack arrays is considered.
I also expect that in Fortran, I can define private arrays on the heap if I have the array as ALLOCATABLE, but not allocated outside the !$OMP region, declare it as PRIVATE and then ALLOCATE when in the !$OMP region. It would be a good compiler response if large ALLOCATABLE PRIVATE arrays, when placed on the heap, were allocated to a new memory page, especially if they are to a different thread. For large arrays that are already ALLOCATEd outside the !$OMP region, their PRIVATE copies possibly go on the stack, although I have no guidance for this.
I don’t see any guidance to this behaviour in the OpenMP 4.5 specification.
Memory Allocators : do they help ?
I have now read the OpenMP 5.0 Specification for Memory Allocators, which may address my point of large arrays with the Allocator omp_large_cap_mem_alloc, but does not clearly relate to having what I would expect is for small arrays that are on a unique thread stack. (what are default storage attributes? while other options refer to “close to all threads”, which would be more relevant for SHARED arrays that are defined outside the !$OMP region, but not for PRIVATE arrays)
For good cache performance, I would like all private variables and small arrays to be allocated on the thread stack, which would provide for these to be on a single memory page. To avoid cache coherence problems, private arrays on the heap from different threads should have a separate memory page.
Are Memory Allocators intended to manage PRIVATE arrays? although the examples I reviewed did not appear to include this.
Is the intention of the Memory Allocators to address the problems I have identified in these posts?
It looks like you have looked a little bit already into the memory allocators feature that was added in OpenMP 5.0. From my understanding of your problem, I think that offers the most promising solution.
You can initialize a custom allocator with requested traits. For your problem, I think you would want to set the "partition" trait to "nearest", which per the spec will mean you request that the allocated memory be "placed in the storage resource that is nearest to the thread that requests the allocation."
You may use the omp_init_allocator routine to initialize an allocator, and then use that allocator inside an ALLOCATE clause on your PARALLEL DO construct. Example:
my_alloc = omp_init_allocator( omp_default_mem_space, 1, [omp_alloctrait(omp_atk_partition, omp_atv_nearest)] )
!$OMP PARALLEL DO PRIVATE(arr) ALLOCATE(my_alloc: arr)
(please excuse syntax errors, if any)
This perhaps doesn't provide the explicit control you are looking for, but gFortran might be able to do what you're asking for with it.
When using gFortran in a 64-bit windows environment they can go to two possible locations.
1) it can be the local thread stack or
2) it can be a single shared heap.
I have written a program to test a number of different array types.
The array types I considered are:
1 Fixed size arrays declared in a MODULE ( array mod_array(mm) )
2 Fixed size arrays declared in a COMMON ( array com_array(mm) )
3 Local arrays declared in the OMP routine ( array local_array(mm) )
4 Automatic arrays declared in the OMP routine ( array auto_array(m) )
5 OMP routine argument dummy_array(m) that was a local array on the master stack
6 OMP routine argument alloc_array(m) that was allocated on the heap
Allocatable arrays declared in OMP routine as array_a(:) and array_b(:)
7 array_a has been previously allocated before/outside OMP region
8 array_b is allocated in the OMP region
For these INTEGER arrays, mm is a parameter and m is a routine argument
! integer, parameter :: mm = 10*1024-4 ! would work best if mm = x * 1024 - 4 for 16 byte gap on Heap
! integer :: m ! is a routine argument
The key results for using gFortran 8.3 in Windows x86-64 environment are:
# most OMP PRIVATE arrays are allocated on the thread stack. There is a seperate stack for each thread.
# this includes master thread=0, so duplicate PRIVATE copies of thread 0 arrays are generated.
# while automatic arrays can be placed on the heap, their private copies are placed on the stack.
# only private copies of arrays with allocatable status are placed on the heap.
# private arrays placed on the heap are separated by 16 bytes ( 2 x 8 byte size header/trailer record?).
# this included private arrays from different threads, being separated by only 16 bytes.
The test program and other discussion is on the Intel Fortran Compiler discussion forum at:
https://software.intel.com/en-us/forums ... pic/781199
The linked program provided identifies a way of reporting the size and location of all stacks in 64-bit Windows environment. This can be a useful way to confirm changing the stack sizes.
( in Gfortran, the stack size can be changed using -Wl,-stack,16777216,-Map=omp_alloc.map. I have not used the environment variable OMP_STACKSIZE, although the thread 0 stack is most likely to overflow )
In summary, the following points of interest are:
# Stack size : Master thread 0 private arrays are duplicated on the master stack, which can be a problem for stack overflow.
# Only arrays that are identified as ALLOCATABLE in the !$OMP routine have their PRIVATE copies placed on the HEAP. Allocated arrays supplied as routine arguments do not, unless they are identified as ALLOCATABLE in the omp routine (not tested).
# Private heap arrays for different threads can share the same memory page. A useful option for gFortran could be when adding a private array to the heap that is for a different thread from the previous array, to start it on a new memory page. The thread associated with each heap array may not be available at present.
# Management of AUTOMATIC arrays (based on size) between the stack and heap is done poorly using -fmax-stack-var-size=, although this would require the allocation location selected at run time. It is only a memory address !
The main reason for investigating this issue is that the Windows Stack (unique for each thread) once defined, is not extendable while the extendable Heap is shared between all threads. The default stack size is very small, which can lead to "Stack Overflow" errors, which should not occur in a modern 64-bit environment.
Managing arrays between the stack and heap may be addressed with Version 5.0 OpenMP Memory Model, although the options are very cryptic. I am not sure of the status for implementation of Version 5.0 or 5.1 of OpenMP.
I hope these findings are of use to others and would welcome any advice on corrections or omissions.
This is mandated by the specification (wasn't always the case - before 3.0 it was permissible to reuse the original storage for the master's private copy)duplicate PRIVATE copies of thread 0 arrays are generated.
This likely holds the array descriptor (a.k.a. dope vector) - see https://thinkingeek.com/2017/01/14/gfor ... escriptor/ for details.private arrays placed on the heap are separated by 16 bytes ( 2 x 8 byte size header/trailer record?).