Why exec. time varies greatly for SIMD with different sched?

General OpenMP discussion

Why exec. time varies greatly for SIMD with different sched?

Postby ClaudiaWhite » Thu Sep 28, 2017 12:51 am

Hi,

I've tried running the code below with 3 different schedules, i.e. static, guided, & dynamic.
I found that no matter what kind of schedule is being used, the execution time of the code is approximately equal (the difference is only 14 sec).
Code: Select all
#pragma omp parallel shared(a,b,c) private(i,j,k)
   {
#pragma omp for schedule(static, 1) collapse (3)
   for (i=0; i<m; i++){
      for (j=0; j<n; j++){
         for (k=0; k<p; k=k+1){
            a[i][j]=(a[i][j])+((b[i][k])*(c[k][j]));
         } } }
   }


However, when I run the code with additional simd clause, the execution time of code with different schedules varies greatly.
In case of
m = 5000,
n = 5000,
p = 5000,
The execution time recorded using clock_gettime() at Ubuntu 14.04, Intel Core i5,
is as below:
static - 2 sec
guided - 1 min
dynamic - 2 hrs

Code: Select all
#pragma omp parallel shared(a,b,c) private(i,j,k)
   {
#pragma omp for simd schedule(static, 1) collapse (3)
   for (i=0; i<m; i++){
      for (j=0; j<n; j++){
         for (k=0; k<p; k=k+1){
            a[i][j]=(a[i][j])+((b[i][k])*(c[k][j]));
         } } }
   }


What is the cause of that vast difference? How can I investigate that?
Any suggestion of document/page that I can refer to?

Thank you.

Best Regards,
Claudia
ClaudiaWhite
 
Posts: 7
Joined: Tue Jul 04, 2017 7:44 am

Re: Why exec. time varies greatly for SIMD with different sc

Postby MarkB » Thu Sep 28, 2017 8:52 am

Collapsing all three loops and using a schedule (such as static, 1 or dynamic) that will assign different iterations of the innermost loop to different threads isn't going to combine well with vectorisation, and may inhibit a lot of other compiler optimisations as well.

It will likely be better to use separate omp for and omp simd directives to parallelise only the outer (or outer two) loop(s) and vectorise the innnermost one.
MarkB
 
Posts: 768
Joined: Thu Jan 08, 2009 10:12 am
Location: EPCC, University of Edinburgh

Re: Why exec. time varies greatly for SIMD with different sc

Postby ClaudiaWhite » Mon Oct 02, 2017 6:36 am

Hi,

Is there any way to investigate the cause of the varying time?
I mean, why difference is so big when dynamic schedule is used?

Thank you.

Best Regards,
Claudia
ClaudiaWhite
 
Posts: 7
Joined: Tue Jul 04, 2017 7:44 am

Re: Why exec. time varies greatly for SIMD with different sc

Postby MarkB » Mon Oct 02, 2017 7:16 am

Using a dynamic schedule for this case will always have a big overhead: the threads need to synchronise internally (via a lock or an atomic counter) for every one of the 125 billion iterations of the innermost loop, to determine which thread will execute the next iteration. If the overhead of this internal synchronisation is of the order of 100 CPU clock cycles per iteration, which is not unreasonable, then that would explain the 2 hour runtime.
MarkB
 
Posts: 768
Joined: Thu Jan 08, 2009 10:12 am
Location: EPCC, University of Edinburgh

Re: Why exec. time varies greatly for SIMD with different sc

Postby ClaudiaWhite » Tue Oct 10, 2017 7:37 pm

Hi,

I need a clue here.

MarkB wrote:the threads need to synchronise internally (via a lock or an atomic counter) for every one of the 125 billion iterations of the innermost loop, to determine which thread will execute the next iteration.


How do you know that it is "every one of the 125 billion iterations"?

MarkB wrote:If the overhead of this internal synchronisation is of the order of 100 CPU clock cycles per iteration, which is not unreasonable, then that would explain the 2 hour runtime.


Is this "CPU clock cycles per iteration" determined by CPU clock frequency? Is there a way to find the accurate figure?

p/s: Would be great if there is a guideline to study about this kind of stuff. I mean internal architecture vs. OpenMP. :) :) :)

Thank you.

Best Regards,
Claudia
ClaudiaWhite
 
Posts: 7
Joined: Tue Jul 04, 2017 7:44 am

Re: Why exec. time varies greatly for SIMD with different sc

Postby MarkB » Wed Oct 11, 2017 10:28 am

Hi Claudia,

ClaudiaWhite wrote:How do you know that it is "every one of the 125 billion iterations"?


That's what the collapse(3) clause does: it forms a single loop of length m*n*p and then applies the schedule clause to that implied loop. If the schedule kind is dynamic, the default chunk size is one, so the runtime will implement a first-come-first-served scheme where each thread just takes a single iteration at a time.

ClaudiaWhite wrote:Is this "CPU clock cycles per iteration" determined by CPU clock frequency? Is there a way to find the accurate figure?


Yes, but the cost of a lock (or other atomic operation) is typically related to the memory latency, i.e. the number of CPU clock cycles it takes to load a piece of data from memory into the CPU, which is typically 50-100 on modern CPUs. It's hard to get an accurate figure other than by benchmarking it (which is effectively what you just did!)

ClaudiaWhite wrote:p/s: Would be great if there is a guideline to study about this kind of stuff. I mean internal architecture vs. OpenMP.


Some of my lecture slides might be helpful: have a look at the Architecture and Performance talks here: http://www.archer.ac.uk/training/course-material/2016/08/160802_AdvOpenMP_Bristol/index.php
MarkB
 
Posts: 768
Joined: Thu Jan 08, 2009 10:12 am
Location: EPCC, University of Edinburgh


Return to Using OpenMP

Who is online

Users browsing this forum: No registered users and 4 guests

cron