Page 1 of 1

Posted: Wed Oct 07, 2009 10:53 pm
Hello. I have a question concerning the use of task constructs inside of work-sharing constructs, such as parallel for.
Consider the code below. It's completely useless but demonstrates what I'm trying to accomplish.
We want to compute the arithmetic series 1+2+.....MAXVAL, repeating it for each element of arrays sum1 and sum2. sums1 is bigger than sum2 and thus requires more time to compute the series for each of its elements.

I'm using omp to parallelize the for loop, breaking the summation of 1+2+...MAXVAL into chunks and using parallel for to assign the chunks to NUM_THREADS threads. Each thread is given a part/copy of sums1 and sum2 array, which it incrementally fills, and then afterwards we add up the results of the threads to make up the total sum. So, one level of parallelism is the chunking of the summation loop.

Another problem I have is if I use the task based approach, I can't set the NUM_THREADS to be greater than 8. Otherwise, it causes Segmentation Fault. Using the parallel sections, I can create as big a team as I want. Also, when using the task-based approach, increasing the number of threads seems to decrease performance, rather than increase it. Strange goings-on.

Does anyone have any information about using tasks together with worksharing constructs, and what might be the best way to address such cases (where one section completes sooner than the other) occurring inside other parallelization constructs (worksharing in this case).

Thanks!

Code: Select all

``````#include <omp.h>
#include <stdio.h>
#include <unistd.h>
#include <stdlib.h>

#define Malloc(type,n) (type *)malloc((n)*sizeof(type))
#define CHUNK 10  //SIZE OF PARALLEL FOR CHUNKS

#define SECTIONBASED  //TURNS ON PARALLEL SECTIONS BASED METHOD FOR CONCURRENCY

#define MAXVAL 1000000  //MAX VALUE FOR THE ARITHMETIC SERIES
#define SIZES 10000  //SIZE OF FIRST ARRAY
#define SIZES_SM 10  //SIZE OF SECOND SMALLER ARRAY
int main()
{
double *sums1 = Malloc(double, NUM_THREADS*SIZES);   //allocate space for the size of array times the number of threads, since each thread will be adding part of the series
double *sums2 = Malloc(double, NUM_THREADS*SIZES_SM);  //same as before but for the smaller array
int i,j,k,tid;

{
printf("%d\n",tid);

sums1 += tid*SIZES;  //the part of the sums1 array that each thread can add to
sums2 += tid*SIZES_SM;  //same as above but for the smaller array

for(i=0;i<SIZES;i++)
sums1[i] = 0;   //initialize sums1

for(i=0;i<SIZES_SM;i++)
sums2[i] = 0;   //initialize the sums2

omp_set_nested(1);

#pragma omp for schedule(dynamic,CHUNK) nowait
for(j=1;j<=MAXVAL;j++)
{
#pragma omp task shared(j) private(k) firstprivate(sums2)
{
for(k=0;k<SIZES_SM;k++)
sums2[k] += j;
}

for(k=0;k<SIZES;k++)
sums1[k] += j;
#else
#pragma omp parallel sections shared(j) private(k) firstprivate(sums1,sums2) num_threads(2)
{
#pragma omp section
{
for(k=0;k<SIZES;k++)
sums1[k] += j;
}
#pragma omp section
{
for(k=0;k<SIZES_SM;k++)
sums2[k] += j;
}
}
#endif
}
}

//add up the results of the threads to make up the total sums, store the results in the part computed by the master thread
{
for(j=0;j<SIZES;j++)
sums1[j] += sums1[i*SIZES+j];

for(j=0;j<SIZES_SM;j++)
sums2[j] += sums2[i*SIZES_SM+j];
}

printf("sums bigger\n----------------------------------------");
for(j=0;j<SIZES;j++)
printf("%lf\n",sums1[j]);

printf("sums smaller\n----------------------------------");
for(j=0;j<SIZES_SM;j++)
printf("%lf\n",sums2[j]);

return 0;
}``````

### Re: tasks inside parallel loops

Posted: Thu Oct 08, 2009 8:48 am
I guess I've been thinking about my problem wrong. My actual problem and the illustrative code I presented are trying to do the wrong thing. Namely, it is guaranteed to be sub-obtimal in the sense of achieving highest utilization of CPUs, to include nested concurrency in a work-sharing construct like a parallel loop. It's a classical case of thinking of concurrency as utilization of threads, when in reality it is utilization of CPUs. Thus, in my previous case, whether I use parallel sections or the task method, I am forcing some thread, which is already/could be doing something, to handle the summation of sum1 or sum2. In other words, I'm not getting any increased efficiency, because the CPUs are already being used on part of the problem (chunks), and all I'm doing is adding unnecessary task/thread creation, aka overhead. Looking at the activity monitor, highest user (and lowest system) utilization of CPUs happens when I simply compute sum1 and sum2 arrays serially, rather than concurrently.

So, in my previous code, maximal efficiency is achieved with simply:

Code: Select all

``````      #pragma omp for schedule(dynamic,CHUNK) nowait
for(j=1;j<=MAXVAL;j++)
{
for(k=0;k<SIZES_SM;k++)
sums2[k] += j;

for(k=0;k<SIZES;k++)
sums1[k] += j;

}

``````