task pragma w/ 4 threads slower than single-thread

General OpenMP discussion

task pragma w/ 4 threads slower than single-thread

Postby five4time » Thu Mar 01, 2012 3:27 pm

Hi -

I'm new to openMP and see others have run into the issue of slower performance using multiple threads than a single thread. In this case I'm using the task pragma. I've set the number of threads via environment variable - "export OMP_NUM_THREADS=8". I'm running the program on a HP DL380G7 with two Intel® Xeon® X5680 processors, 6 cores each. I have an application that does a lot of memory manipulation that can be done in parallel so process() does memset multiple times on a large buffer.

The app calls "process()" 4 times. As expected, the openMP version creates 4 threads right away. The function takes a couple of seconds to complete so there should be a significant boost from parallel processing. There's no synchronization, shared memory etc. Is the issue with my use of openMP directives or the nature of the benchmark?

Results (seconds/nanoseconds):
openMP, 4 threads: 5:348410265
single thread: 5:28359497

Any suggestions will be greatly appreciated!

Contents of task_benchmark.h:

Code: Select all
#define BUFFER_SIZE 1000000
#define ITERATIONS 10000


Code: Select all
#include <stdlib.h>
#include <string.h>
#include <stdio.h>
#include <time.h>
#include <omp.h>
#include "task_benchmark.h"

void process()
  char buf[BUFFER_SIZE];
  char * pBuf = &buf[0];
  int i;

  for (i = 0; i < ITERATIONS; i++)
    memset(pBuf, '\0', BUFFER_SIZE);

struct timespec diff(struct timespec start, struct timespec end)
  struct timespec temp;
  if ((end.tv_nsec-start.tv_nsec)<0) {
    temp.tv_sec = end.tv_sec-start.tv_sec-1;
    temp.tv_nsec = 1000000000+end.tv_nsec-start.tv_nsec;
  } else {
    temp.tv_sec = end.tv_sec-start.tv_sec;
    temp.tv_nsec = end.tv_nsec-start.tv_nsec;
  return temp;

int main(int argc, char * argv[])
    int i;
    struct timespec time0;
    struct timespec time1;
    clock_gettime(CLOCK_PROCESS_CPUTIME_ID, &time0);

    #pragma omp parallel
        #pragma omp single nowait
            for(i = 0; i < 4; i++)
                #pragma omp task

    clock_gettime(CLOCK_PROCESS_CPUTIME_ID, &time1);
    printf("\t\tin-line processing done :: %li:%li\n", diff(time0,time1).tv_sec, diff(time0,time1).tv_nsec);

    return 0;

Re: task pragma w/ 4 threads slower than single-thread

Postby MarkB » Fri Mar 02, 2012 9:35 am

The problem may be that you are recording CPU time, which could be accumulated across all threads, instead of wall clock time.
Can you try using omp_get_wtime() to do the timing instead?
Posts: 772
Joined: Thu Jan 08, 2009 10:12 am
Location: EPCC, University of Edinburgh

Re: task pragma w/ 4 threads slower than single-thread

Postby five4time » Fri Mar 02, 2012 11:11 am

You are correct - CLOCK_PROCESS_CPUTIME_ID was wrong. I tried CLOCK_REALTIME and got the expected results, a 4x speedup, same as with omp_get_getwtime(). Thanks!

Return to Using OpenMP

Who is online

Users browsing this forum: No registered users and 6 guests