Question regarding taskloop

General OpenMP discussion

Question regarding taskloop

Postby medawsonjr » Thu Sep 27, 2018 12:01 pm

OS: CentOS Linux 7.3.1611 (Core)
ARCH: x86_64 (Intel Xeon CPU E5-2687W v2)
COMPILER: Clang v5.0.0

I'm trying to wrap my head around the inner workings of "taskloop", but for the life of me I can understand why this code:

Code: Select all
#include <stdlib.h>
#include <unistd.h>
#include <stdio.h>
#include <syscall.h>
#include <math.h>
#include <time.h>
#include <omp.h>
#include <pthread.h>

#define N 1000

pid_t main_pid;

void threadCount(int pid, char * msg){
   
    int total_threads;
    FILE *fp;
    char ps[256];

    sprintf(ps, "ps h -o nlwp -p %d", pid);
    fp = popen(ps, "r");
    fscanf(fp, "%d", &total_threads);
    pclose(fp);
    printf("%s -- total_threads = %d\n", msg, total_threads);
   
}

void do_work() {
   
    int num_cores = omp_get_num_procs();
    double *a = malloc(N*N*sizeof(double));
    double *b = malloc(N*N*sizeof(double));
    double *c = malloc(N*N*sizeof(double));
   
    int i,j,k;
   
    double start = omp_get_wtime();

    # pragma omp taskloop simd shared (a, b, c) private (j, k) num_tasks(num_cores) nogroup
    for ( i = 0; i < N; i++ ) {
      for ( j = 0; j < N; j++ ) {
        a[i * N + j] = 1;
        b[i * N + j] = 2;
      }
    }

    for ( i = 0; i < N; i++ ) {
      for ( j = 0; j < N; j++ ) {
        c[i * N + j] = 0;
        for ( k = 0; k < N; k++ ) {
          c[i * N + j] = c[i * N + j] + a[i * N + k] * b[k * N + j];
        }
      }
    }

    printf("C[100] = %f -- Time: %f\n", c[100], omp_get_wtime() - start);
   
    free(a); free(b); free(c);
       
}

int main() {
   
    int n_pthreads = 10;
    pthread_t threads[n_pthreads];
    main_pid = getpid();
    printf("main pid = %d\n", main_pid);
    double start;
    int i;
     
    // --------------
    // in main thread
    // --------------

    // -----------------------------------------------
    // in pthreads created run (mostly) simultaneously
    // -----------------------------------------------
    threadCount(main_pid, "before parallel threaded");

    start = omp_get_wtime();

# pragma omp parallel
# pragma omp single
    for (i = 0; i < n_pthreads; i++){
        pthread_create(&threads[i], NULL, (void *) &do_work, NULL);
    }       
    for (i = 0; i < n_pthreads; i++){
        pthread_join(threads[i], NULL);
    }
    printf("(parallel threaded) time = %f\n", omp_get_wtime() - start);

    threadCount(main_pid, "after parallel threaded");
    printf("\n");
   
    return 0;
   
}



runs twice as fast as this code:

Code: Select all
#include <stdlib.h>
#include <unistd.h>
#include <stdio.h>
#include <syscall.h>
#include <math.h>
#include <time.h>
#include <omp.h>
#include <pthread.h>

#define N 1000

pid_t main_pid;

void threadCount(int pid, char * msg){
   
    int total_threads;
    FILE *fp;
    char ps[256];

    sprintf(ps, "ps h -o nlwp -p %d", pid);
    fp = popen(ps, "r");
    fscanf(fp, "%d", &total_threads);
    pclose(fp);
    printf("%s -- total_threads = %d\n", msg, total_threads);
   
}

void do_work() {
   
    int num_cores = omp_get_num_procs();
    double *a = malloc(N*N*sizeof(double));
    double *b = malloc(N*N*sizeof(double));
    double *c = malloc(N*N*sizeof(double));
   
    int i,j,k;
   
    double start = omp_get_wtime();

    # pragma omp taskloop simd shared (a, b, c) private (j, k) num_tasks(num_cores) nogroup
    for ( i = 0; i < N; i++ ) {
      for ( j = 0; j < N; j++ ) {
        a[i * N + j] = 1;
        b[i * N + j] = 2;
      }
    }

    # pragma omp taskloop simd shared (a, b, c) private (j, k) num_tasks(num_cores) nogroup   
    for ( i = 0; i < N; i++ ) {
      for ( j = 0; j < N; j++ ) {
        c[i * N + j] = 0;
        for ( k = 0; k < N; k++ ) {
          c[i * N + j] = c[i * N + j] + a[i * N + k] * b[k * N + j];
        }
      }
    }

    printf("C[100] = %f -- Time: %f\n", c[100], omp_get_wtime() - start);
   
    free(a); free(b); free(c);
       
}

int main() {
   
    int n_pthreads = 10;
    pthread_t threads[n_pthreads];
    main_pid = getpid();
    printf("main pid = %d\n", main_pid);
    double start;
    int i;
     
    // --------------
    // in main thread
    // --------------

    // -----------------------------------------------
    // in pthreads created run (mostly) simultaneously
    // -----------------------------------------------
    threadCount(main_pid, "before parallel threaded");

    start = omp_get_wtime();

# pragma omp parallel
# pragma omp single
    for (i = 0; i < n_pthreads; i++){
        pthread_create(&threads[i], NULL, (void *) &do_work, NULL);
    }       
    for (i = 0; i < n_pthreads; i++){
        pthread_join(threads[i], NULL);
    }
    printf("(parallel threaded) time = %f\n", omp_get_wtime() - start);

    threadCount(main_pid, "after parallel threaded");
    printf("\n");
   
    return 0;
   
}


In the first case, only one (1) of the "for loops" is being parallelized, while both "for loops" are being parallelized in the second case. Am I misunderstanding something here?
medawsonjr
 
Posts: 3
Joined: Wed Aug 29, 2018 10:42 am

Re: Question regarding taskloop

Postby MarkB » Fri Sep 28, 2018 10:55 am

Can you explain why you are creating pthreads as well as OpenMP threads, please?
MarkB
 
Posts: 768
Joined: Thu Jan 08, 2009 10:12 am
Location: EPCC, University of Edinburgh

Re: Question regarding taskloop

Postby medawsonjr » Mon Oct 01, 2018 1:10 pm

Yes, MarkB! Therein lies the problem!

This code, as is the case for many such questions posted on this forum I would guess, is inherited. The way it works is that requests for a model calculation come into a server for a specific financial exchange tradeable options derivative (think Monte Carlo type calculation). Based on the type of option, this request will be routed to a specific pthread, which will then run the model calc as an OpenMP worksharing parallel-for code chunk. There are quite a few of these pthreads, each designated for a different options derivative. This means that, by the end of the day, *hundreds* of OpenMP threads have been forked and are running or idle on this server, with all the scheduling performance overhead associated with it.

What I'd like to do is change things so that only a fixed number of OpenMP threads are ever in play - equal to the number of cores on the box. From what I can gather, I can do that in either one of two ways:

    1. Create all the OpenMP threads at the outset with "#pragma omp parallel" before spawning the separate pthreads (like in the example code included in the original post), and then only use 'taskloop' within each pthread so that no additional OpenMP threads are created but task chunks are simply doled out to existing OpenMP threads that are bound to cores via OMP_PROC_BIND/OMP_PLACES

    2. Get rid of the pthread creation part completely and just funnel incoming requests into a queue that the main thread reads from inside the "#pragma omp single" section. Those jobs are then sent to a 'taskloop' section within the same main thread, which will dole out the work to the preexisting OpenMP threads from the aforementioned "#pragma omp parallel".

My guess is that the performance issue in the code in my original post where a pthread trying to dole out tasks to OpenMP threads created in another pthread is incurring a lot of locking overhead in the Clang v5 OpenMP implementation - thus, my original question. There might even be a third pattern for tackling this that I haven't even thought of that you may be able to illuminate for me, too.

Hope this explanation helps, MarkB.
medawsonjr
 
Posts: 3
Joined: Wed Aug 29, 2018 10:42 am

Re: Question regarding taskloop

Postby MarkB » Wed Oct 03, 2018 3:46 am

I think that getting rid of the pthreads is the right thing to do in the long run, as otherwise you are relying on implementation-specific (and likely undocumented) behaviour with respect to which thread does what. Option 2 sounds reasonable, unless there is a risk that one thread processing all the incoming requests becomes a bottleneck, in which case you might want to try to let any idle thread process them.
MarkB
 
Posts: 768
Joined: Thu Jan 08, 2009 10:12 am
Location: EPCC, University of Edinburgh

Re: Question regarding taskloop

Postby medawsonjr » Wed Oct 03, 2018 8:43 am

Regarding your comment "Option 2 sounds reasonable, unless there is a risk that one thread processing all the incoming requests becomes a bottleneck, in which case you might want to try to let any idle thread process them":

Is there an OpenMP construct that will provide that?
medawsonjr
 
Posts: 3
Joined: Wed Aug 29, 2018 10:42 am

Re: Question regarding taskloop

Postby MarkB » Thu Oct 04, 2018 2:26 am

medawsonjr wrote:Is there an OpenMP construct that will provide that?


One option might be to poll for incoming work at the end of a task, and generate any new tasks at that point (i.e. before exiting the original task).
MarkB
 
Posts: 768
Joined: Thu Jan 08, 2009 10:12 am
Location: EPCC, University of Edinburgh


Return to Using OpenMP

Who is online

Users browsing this forum: No registered users and 3 guests