Help with !$OMP BARRIER

General OpenMP discussion

Help with !$OMP BARRIER

Postby johncampbell » Mon Oct 14, 2019 6:36 pm

I have tried to use BARRIER to synchronise multiple threads that are doing a long computation, utilising the same large (16GByte) matrix.
The problem I am having is related to "Each barrier must be encountered by all threads in a team" (Using OpenMP by Chapman et al, page 84)

To explain:
I am using an i7-8700K with 12 possible threads.
I first reduce the skyline storage matrix using an OMP solver, using the default 12 threads.
I then solve 10 different solution cases for multiple time steps in a !$OMP PARALLEL DO loop ; do i = 1,10 ( it could also be 9 or 11)

The solution had shown significant variation in the elapse time of each thread, with each time step averaging between 2.9 to 6.5 seconds, depending on different runs and different threads.
I assumed this may be related to the memory <> cache transfer capacity so tried to apply a barrier at the start of each time step so that each thread is using a similar part of the 16GB of memory ( which appears to work and make run times more uniform, averaging an improved 2.5 seconds per time step ).
In all cases all 10 or 11 threads have very similar calculation load, but appear to get delayed if their memory demand is out of step with the other threads.
(11 threads was much slower than 10 threads, averaging 8 + seconds per step, which appears to show a memory transfer problem)

My problem is the first time I ran this BARRIER approach with 10 "active threads" (for DO i = 1,10) the barrier failed and the program crashed at the first time step.
By including a "call omp_set_num_threads (10)" before this OMP region, this apparently overcame the problem. (by adjusting the team size to the loop size)

I identify from task manager that the number of threads associated with the program reaches 15, then declines to 12 during the reduction, then 10 after a few iterations of the successful runs. ( not sure about the 15 threads early on, but the decline to 12 during reduction and then 10 during time steps looks ok)

Assuming I have understood the problem correctly, I have a number of questions.
1) Why can't BARRIER identify the number of active threads in this !$OMP region (10), rather than all threads in a team (12 available and used in previous !$OMP usage).
2) What did these 2 idle threads do, as I thought they would not be involved ?
3) If I use, say my i7-4790K, with an 8 thread limit and require a DO i = 1,11, which will have 8 threads in the first pass and then only 3 threads in the second pass, will the BARRIER cope with synchronising, first the 8 threads and then 3 threads ? ( or 6 threads then 5 threads or possibly 6 + 6 by having an extra 12th phantom run that is not needed but keeps all 6 selected threads active)
Importantly, what will happen to BARRIER in the second pass if not all threads are active ?

Essentially I am seeing a difference between "all threads in a team" and "active threads in a !$OMP region". Have I got this wrong ?

John

I am using gFortran Ver 8.3 on a Windows 10 system: i7-8700K + 32GB memory
johncampbell
 
Posts: 14
Joined: Tue Aug 27, 2013 6:48 pm

Re: Help with !$OMP BARRIER

Postby johncampbell » Tue Oct 15, 2019 2:32 am

The following example appears to demonstrate my problem on my 4-thread i5-2300, that works for loops=8 but fails/hangs for loops=7.

I was hoping that BARRIER would identify the active threads and respond when all active threads have reached the barrier.
ie call set_num_threads (4) ; do loop=1,7 ! should identify 4 threads in 1st pass then 3 threads in second pass of parallel loop.
Is there an alternative construct for this ?
( call set_num_threads (num_threads) ; do loop = 1,n*num_threads appears to work with BARRIER, if legal )

I am thinking my use of BARRIER must not be allowed in a !$OMP PARALLEL DO region ?
Code: Select all
program test

   use omp_lib
   implicit none

   integer :: i, n, id, nloop, jj
   integer*4, parameter :: ni=16, nj=14, nk=27
   real*8 a(ni,nj), b(nj,nk), c(ni,nk)
!
   n  = omp_get_num_threads()
   id = omp_get_thread_num()
   write (*,11) 'outside parallel',n,id
11 format (a,' : threads = ',i0,' : thread = ',i0)

   a = 1
   b = 2
   
   !$omp parallel default(none) private(i,n,id)
   n  = omp_get_num_threads()
   id = omp_get_thread_num()
   write (*,11) 'in parallel',n,id
   !$omp do
   do i = 1, omp_get_num_threads()
     n  = omp_get_num_threads()
     id = omp_get_thread_num()
     write (*,12) 'in loop',n,id,i
12 format (a,' : threads = ',i0,' : thread = ',i0,' : loop = ',i0)
   end do
   !$omp end do

   !$omp end parallel

!   nloop = 12
   write (*,*) 'give number of loops for work ( threads=4 > 8=good, 7=bad) ?'
   read (*,*) nloop
!
   !$omp parallel do            &
   !$OMP& private(i,n,id,c,jj)  &
   !$OMP& shared (nloop,a,b)    &
   !$OMP& SCHEDULE (DYNAMIC)
   do i = 1, nloop
!
     id = omp_get_thread_num()
     jj = nj+1-id
     call do_some_work (i, c,a,b, ni,jj,nk)
!
   end do
   !$omp end parallel do
   write (*,*) 'end of test'
end program test

subroutine matmul_omp (c,a,b, ni,nj,nk)
   integer*4 ni,nj,nk
   real*8 :: c(ni,nk), a(ni,nj), b(nj,nk)
!
   integer*4 ii,jj,kk
   real*8    s

   do ii = 1,ni
     do kk = 1,nk
       s = 0
       do jj = 1,nj
         s = s + a(ii,jj)*b(jj,kk)
       end do
       c(ii,kk) = s
     end do
   end do

end subroutine matmul_omp   

subroutine do_some_work (li, c,a,b, ni,nj,nk)
   use omp_lib
    integer li, ni,nj,nk
    real*8  c(ni,nk), a(ni,nj), b(nj,nk)
!
    integer*4 n, id   
!
     n  = omp_get_num_threads()
     id = omp_get_thread_num()
     write (*,12) 'in work loop',n,id,li
12 format (a,' : threads = ',i0,' : thread = ',i0,' : loop = ',i0)
!
     write (*,12) 'before barrier',n,id,li
     call matmul_omp (c,a,b, ni,nj,nk)
     write (*,13) id,li, c(1,1)
13  format (10x,'thread = ',i0,' : loop = ',i0,'  c(1,1) = ',f0.2)
!
     !$omp barrier
     write (*,12) 'past barrier',n,id,li
     call matmul_omp (c,a,b, ni,nj,nk)
     write (*,13) id,li, c(1,1)
!
end subroutine do_some_work
johncampbell
 
Posts: 14
Joined: Tue Aug 27, 2013 6:48 pm

Re: Help with !$OMP BARRIER

Postby MarkB » Mon Oct 28, 2019 1:01 pm

johncampbell wrote:I am thinking my use of BARRIER must not be allowed in a !$OMP PARALLEL DO region ?


That's correct - you cannot have a barrier inside a worksharing loop region (see Section 2.20 on page 328 of the Version 5.0 spec).
I think you might be able to solve the problem with this pattern:

Code: Select all
call omp_set_num_threads(nthreads)
do ii = 1, nloop, nthreads
   if (nloop - ii .lt. nthreads) call omp_set_num_threads(nloop - ii)
   !$omp parallel
         i = ii + omp_get_thread_num()
         call sub1(i)
   !$omp barrier
         call sub2(i)
   !$omp end parallel
end do


(I may have some +/-1 errors in there!)

However, if you simply restrict the number of threads to the number of physical cores (i.e. 6 on the i7-8700K), then you might find the problem goes away in any case.
MarkB
 
Posts: 782
Joined: Thu Jan 08, 2009 10:12 am
Location: EPCC, University of Edinburgh

Re: Help with !$OMP BARRIER

Postby johncampbell » Sat Nov 02, 2019 11:09 pm

Thanks Mark,

I have overcome the problem by setting the number of runs as a multiple of the number of threads.
This requires adjusting both the loop iterations and the active threads.
However, (typically in my case) if loops <= max_threads then the fix has minimal effect, although num_threads must be reset.
Code: Select all
      max_threads = omp_get_max_threads ()
      num_pass    = (num_loops-1)/max_threads + 1
      num_threads = (num_loops-1)/num_pass + 1
      extra_loops = num_pass * num_threads - num_loops
      num_loops   = num_pass * num_threads
!
      call omp_set_num_threads (num_threads)
!
!$OMP  PARALLEL DO                &
!$OMP& PRIVATE  (i, ...)          &
!$OMP& SHARED   (num_loops, ...)  &
!$OMP& SCHEDULE (DYNAMIC)
!
         do i = 1, num_loops


You also mentioned "if you simply restrict the number of threads to the number of physical cores (i.e. 6 on the i7-8700K), then you might find the problem goes away in any case."
Did you mean 12 (max_threads), rather than 6 (cores) as the 8700K allows hyper-threading ?
I do find for this processor, that num_loops = num_threads = 10 performs much better that num_loops = 11; for the case where I can cope with 10 solution sets.

For this processor, I also find there is minimal penalty setting num_threads=max_threads-1 when performing the reduction of a large set of linear equations using my omp skyline solver which shows there could be more potential for tuning (or understanding) of multi-threaded calculations. It is not always clear where the performance bottlenecks are when using many threads and I suspect the understanding of performance would become more difficult as the number of available threads increases.

Including !$OMP BARRIER has been an effective approach in reducing the run times ( initially 5 hours now down to near 3 hours )
I do not understand why BARRIER could not cope with number of active threads in a !$OMP PARALLEL DO ?

Thanks for your advice,

John
johncampbell
 
Posts: 14
Joined: Tue Aug 27, 2013 6:48 pm

Re: Help with !$OMP BARRIER

Postby MarkB » Fri Nov 08, 2019 3:10 am

johncampbell wrote:Did you mean 12 (max_threads), rather than 6 (cores) as the 8700K allows hyper-threading ?


No, I really mean 6 - if your code is contending for memory bandwidth, L3 capacity and/or bandwidth, then using the hyperthreading may be counter-productive.

johncampbell wrote:I do not understand why BARRIER could not cope with number of active threads in a !$OMP PARALLEL DO ?


The answer is partly performance - typically the runtime will allocate a lock-free data structure for the parallel region which required a fixed number of threads to work. I think it might also be difficult to specify the behaviour in all cases.
MarkB
 
Posts: 782
Joined: Thu Jan 08, 2009 10:12 am
Location: EPCC, University of Edinburgh


Return to Using OpenMP

Who is online

Users browsing this forum: Google [Bot], Majestic-12 [Bot] and 5 guests