The problem I am having is related to "Each barrier must be encountered by all threads in a team" (Using OpenMP by Chapman et al, page 84)
I am using an i7-8700K with 12 possible threads.
I first reduce the skyline storage matrix using an OMP solver, using the default 12 threads.
I then solve 10 different solution cases for multiple time steps in a !$OMP PARALLEL DO loop ; do i = 1,10 ( it could also be 9 or 11)
The solution had shown significant variation in the elapse time of each thread, with each time step averaging between 2.9 to 6.5 seconds, depending on different runs and different threads.
I assumed this may be related to the memory <> cache transfer capacity so tried to apply a barrier at the start of each time step so that each thread is using a similar part of the 16GB of memory ( which appears to work and make run times more uniform, averaging an improved 2.5 seconds per time step ).
In all cases all 10 or 11 threads have very similar calculation load, but appear to get delayed if their memory demand is out of step with the other threads.
(11 threads was much slower than 10 threads, averaging 8 + seconds per step, which appears to show a memory transfer problem)
My problem is the first time I ran this BARRIER approach with 10 "active threads" (for DO i = 1,10) the barrier failed and the program crashed at the first time step.
By including a "call omp_set_num_threads (10)" before this OMP region, this apparently overcame the problem. (by adjusting the team size to the loop size)
I identify from task manager that the number of threads associated with the program reaches 15, then declines to 12 during the reduction, then 10 after a few iterations of the successful runs. ( not sure about the 15 threads early on, but the decline to 12 during reduction and then 10 during time steps looks ok)
Assuming I have understood the problem correctly, I have a number of questions.
1) Why can't BARRIER identify the number of active threads in this !$OMP region (10), rather than all threads in a team (12 available and used in previous !$OMP usage).
2) What did these 2 idle threads do, as I thought they would not be involved ?
3) If I use, say my i7-4790K, with an 8 thread limit and require a DO i = 1,11, which will have 8 threads in the first pass and then only 3 threads in the second pass, will the BARRIER cope with synchronising, first the 8 threads and then 3 threads ? ( or 6 threads then 5 threads or possibly 6 + 6 by having an extra 12th phantom run that is not needed but keeps all 6 selected threads active)
Importantly, what will happen to BARRIER in the second pass if not all threads are active ?
Essentially I am seeing a difference between "all threads in a team" and "active threads in a !$OMP region". Have I got this wrong ?
I am using gFortran Ver 8.3 on a Windows 10 system: i7-8700K + 32GB memory
I was hoping that BARRIER would identify the active threads and respond when all active threads have reached the barrier.
ie call set_num_threads (4) ; do loop=1,7 ! should identify 4 threads in 1st pass then 3 threads in second pass of parallel loop.
Is there an alternative construct for this ?
( call set_num_threads (num_threads) ; do loop = 1,n*num_threads appears to work with BARRIER, if legal )
I am thinking my use of BARRIER must not be allowed in a !$OMP PARALLEL DO region ?
Code: Select all
program test use omp_lib implicit none integer :: i, n, id, nloop, jj integer*4, parameter :: ni=16, nj=14, nk=27 real*8 a(ni,nj), b(nj,nk), c(ni,nk) ! n = omp_get_num_threads() id = omp_get_thread_num() write (*,11) 'outside parallel',n,id 11 format (a,' : threads = ',i0,' : thread = ',i0) a = 1 b = 2 !$omp parallel default(none) private(i,n,id) n = omp_get_num_threads() id = omp_get_thread_num() write (*,11) 'in parallel',n,id !$omp do do i = 1, omp_get_num_threads() n = omp_get_num_threads() id = omp_get_thread_num() write (*,12) 'in loop',n,id,i 12 format (a,' : threads = ',i0,' : thread = ',i0,' : loop = ',i0) end do !$omp end do !$omp end parallel ! nloop = 12 write (*,*) 'give number of loops for work ( threads=4 > 8=good, 7=bad) ?' read (*,*) nloop ! !$omp parallel do & !$OMP& private(i,n,id,c,jj) & !$OMP& shared (nloop,a,b) & !$OMP& SCHEDULE (DYNAMIC) do i = 1, nloop ! id = omp_get_thread_num() jj = nj+1-id call do_some_work (i, c,a,b, ni,jj,nk) ! end do !$omp end parallel do write (*,*) 'end of test' end program test subroutine matmul_omp (c,a,b, ni,nj,nk) integer*4 ni,nj,nk real*8 :: c(ni,nk), a(ni,nj), b(nj,nk) ! integer*4 ii,jj,kk real*8 s do ii = 1,ni do kk = 1,nk s = 0 do jj = 1,nj s = s + a(ii,jj)*b(jj,kk) end do c(ii,kk) = s end do end do end subroutine matmul_omp subroutine do_some_work (li, c,a,b, ni,nj,nk) use omp_lib integer li, ni,nj,nk real*8 c(ni,nk), a(ni,nj), b(nj,nk) ! integer*4 n, id ! n = omp_get_num_threads() id = omp_get_thread_num() write (*,12) 'in work loop',n,id,li 12 format (a,' : threads = ',i0,' : thread = ',i0,' : loop = ',i0) ! write (*,12) 'before barrier',n,id,li call matmul_omp (c,a,b, ni,nj,nk) write (*,13) id,li, c(1,1) 13 format (10x,'thread = ',i0,' : loop = ',i0,' c(1,1) = ',f0.2) ! !$omp barrier write (*,12) 'past barrier',n,id,li call matmul_omp (c,a,b, ni,nj,nk) write (*,13) id,li, c(1,1) ! end subroutine do_some_work
That's correct - you cannot have a barrier inside a worksharing loop region (see Section 2.20 on page 328 of the Version 5.0 spec).johncampbell wrote:I am thinking my use of BARRIER must not be allowed in a !$OMP PARALLEL DO region ?
I think you might be able to solve the problem with this pattern:
Code: Select all
call omp_set_num_threads(nthreads) do ii = 1, nloop, nthreads if (nloop - ii .lt. nthreads) call omp_set_num_threads(nloop - ii) !$omp parallel i = ii + omp_get_thread_num() call sub1(i) !$omp barrier call sub2(i) !$omp end parallel end do
However, if you simply restrict the number of threads to the number of physical cores (i.e. 6 on the i7-8700K), then you might find the problem goes away in any case.
I have overcome the problem by setting the number of runs as a multiple of the number of threads.
This requires adjusting both the loop iterations and the active threads.
However, (typically in my case) if loops <= max_threads then the fix has minimal effect, although num_threads must be reset.
Code: Select all
max_threads = omp_get_max_threads () num_pass = (num_loops-1)/max_threads + 1 num_threads = (num_loops-1)/num_pass + 1 extra_loops = num_pass * num_threads - num_loops num_loops = num_pass * num_threads ! call omp_set_num_threads (num_threads) ! !$OMP PARALLEL DO & !$OMP& PRIVATE (i, ...) & !$OMP& SHARED (num_loops, ...) & !$OMP& SCHEDULE (DYNAMIC) ! do i = 1, num_loops
Did you mean 12 (max_threads), rather than 6 (cores) as the 8700K allows hyper-threading ?
I do find for this processor, that num_loops = num_threads = 10 performs much better that num_loops = 11; for the case where I can cope with 10 solution sets.
For this processor, I also find there is minimal penalty setting num_threads=max_threads-1 when performing the reduction of a large set of linear equations using my omp skyline solver which shows there could be more potential for tuning (or understanding) of multi-threaded calculations. It is not always clear where the performance bottlenecks are when using many threads and I suspect the understanding of performance would become more difficult as the number of available threads increases.
Including !$OMP BARRIER has been an effective approach in reducing the run times ( initially 5 hours now down to near 3 hours )
I do not understand why BARRIER could not cope with number of active threads in a !$OMP PARALLEL DO ?
Thanks for your advice,
No, I really mean 6 - if your code is contending for memory bandwidth, L3 capacity and/or bandwidth, then using the hyperthreading may be counter-productive.johncampbell wrote:Did you mean 12 (max_threads), rather than 6 (cores) as the 8700K allows hyper-threading ?
The answer is partly performance - typically the runtime will allocate a lock-free data structure for the parallel region which required a fixed number of threads to work. I think it might also be difficult to specify the behaviour in all cases.johncampbell wrote:I do not understand why BARRIER could not cope with number of active threads in a !$OMP PARALLEL DO ?
Thanks for your advice on reducing the number of threads to the number of cores, however the optimised solution appears to be more complex.
I thought I would summarise my results, as this may assist others.
The basic problem I have is that I am now running 10 calculation threads in parallel on a 6-core, 12-hyper-thread I7-8700K.
Each calculation is for 5,000 time steps for solving a set of linear equations of a pre-reduced symmetric matrix stored in a linear skyline solver format, with a storage requirement of 16 GBytes, as a SHARED array.
Each thread basically reads the 16Gb matrix 10,000 times.
By placing a !$Barrier at the start of forward reduction and backward substitution, this appears to improve the synchronisation of each time step so that multiple threads can share the same available cached memory.
In a previous worst case, without Barrier, each time step was averaging about 6 seconds (and in a worst case up to 8 seconds for 11 threads), with threads showing significant variability in run time performance.
With BARRIERx2, and 10 threads the run time is now 2.5 seconds per time step. ( 3.0 seconds with BARRIERx1 )
Reducing the number of threads to 6, the run time is 1.7 seconds per time step, although there are 2 passes, (1.5 seconds for 5 threads and 1.4 threads for 4 threads)
So a single pass of 10 threads and BARRIERx2 provides the quickest solution.
My conclusion is that the bottleneck is the memory transfer bandwidth required to support the 10 threads, which is reduced when multiple threads share the same memory pages in cache.
My next target is to look at something like the AMD EPYC 7402p, which has 24 cores and importantly 8 memory channels and see if a processor like this may address the bottleneck.
If anyone has a suggestion as to an alternative approach I would be interested.