OMP and Deletion/Insertion in STL containers

General OpenMP discussion

OMP and Deletion/Insertion in STL containers

Postby michael_ims » Fri Sep 12, 2008 1:13 am

Dear OpenMP community,

I am working in image processing and want to use OpenMP to speed up my application. I am running Windows XP and my machine has 4 CPUS (2 * Dual Core).

My application uses C++ STL. More precisely, it makes use of the types list and vector. I am using OpenMP to manipulate these list and vectors in parallel. (insert and delete elements). To give you a rough idea about the program, I give some code in the following. This should be understood as pseudo code:

Code: Select all
#define NUMCPUS 4

typedef struct {
   char colours[3];
} Pixel;

void process_buffer (vector<list<Pixel>>> &PixelBuffer)
   insert_pixels (PixelBuffer);
   delete_pixels (PixelBuffer);

void main ()
   vector<list<Pixel>>> PixelBuffer;

   // replicate the PixelBuffer to allow for parallel processing
   vector<list<Pixel>>> ReplicatedPixelBuffers[NUMCPUS];
   for (int i = 0; i < NUMCPUS; i++)
      ReplicatedPixelBuffers[i] = PixelBuffer;

   #pragma omp parallel for
   for (int i = 0; i < NUMCPUS; i++)
      // this is the computational expensive part
      process_buffer (ReplicatedPixelBuffers[i]);
   #pragma omp barrier

   for (int i = 0; i < NUMCPUS; i++)
      // merge the results of the ReplicatedPixelBuffers
      merge (PixelBuffer, ReplicatedPixelBuffers[i];

My problem is relatively simple. The parallelization almost does not give any speedup. If I run my program just using one CPU (#define NUMCPUS 1) it takes around 100 seconds, while 4 CPUs do the same thing in 80 seconds. (I am working on images. The high processing time are expected.)

I know that my program is memory intensive. So my first explanation was that the problem is memory saturation. Now I did the following test. I used the command line to start 4 instances of my program simultaneously. Since the single-core implementation consumes 100 seconds, I assumed that the 4 instances would finish after 400 seconds (if the problem was memory saturation). However, it took only 120 seconds until all of them have finished. So the question was if the operating system manages to parallelize my program why does OpenMP fail to do this efficiently?

My current explanation for this behaviour is as follows. All threads generate new entries in the lists and vectors. This may cause that the entries are cluttered in memory. They probably look like this: (entry_cpu1 - entry_cpu4 - entry_cpu2 ...). Obviously better cache performance could be achieved if they would look like this: (entry_cpu1 - entry_cpu1 - entry_cpu1 ...). But how can I achieve this? Can I tell the 1st thread to use the first quarter of RAM, the 2nd to use the 2nd quarter and so on? I assume that the operating system does something like this.

Thanks a lot for your time.


Re: OMP and Deletion/Insertion in STL containers

Postby ejd » Mon Sep 15, 2008 8:54 am

It could be a memory problem as you describe. However, it could also be a problem with the insertion and deletion of items from the lists or from the memory allocation routines being used. The code you are showing is also confusing. You have a barrier outside of the parallel region. Could you please check to see whether your code example is correct and give me some more information about what these routines (insert_pixel, etc) are doing. Thanks.
Posts: 1025
Joined: Wed Jan 16, 2008 7:21 am

Return to Using OpenMP

Who is online

Users browsing this forum: No registered users and 5 guests