large private arrays

General OpenMP discussion

large private arrays

Postby crjensen » Mon May 19, 2008 2:04 pm

I'm having trouble making private arrays for a simulation program. Each thread gets its own piece of buffer memory where intermediate results are stored but I'm having trouble doing this efficiently. At first I used automatic management calling the buffer as buffer[x][y][z] and hoped that it would be stored between chunks of the parallel for loop it was in so that each thread had a constantly allocated piece of memory for the buffer.

However, scaling up the program meant that using automatic management caused these buffers to overrun the stack. The only solution I could think of was to use malloc to create the buffer in each block and then free at the end. So for each chunk of the for loop a large array is allocated and destroyed. To make matters worse, I read that malloc and free are made thread safe by having their own built in locks. So each time step, a thread will go over several different blocks and have to do this each time - it's far from ideal.

I tried allocating the memory outside the parallel region and using first private, but that just copies the pointer. The parallel region is divided into two single sections and the parallel for so I can't think of a way to allocate a buffer for each thread and have it persist from chunk-to-chunk and timestep-to-timestep. Any help would be greatly appreciated.

Re: large private arrays

Postby ejd » Mon May 19, 2008 3:13 pm

What OS are you using? If you are using Unix (Solaris, Linux, etc), then you can increase the stacksize. A lot of people forget about doing this. Depending on your real resource though, you might run into problems. You can decalre the pointer private and then malloc storage within the parallel region setting each pointer to a different malloc'ed region, before you enter the loop. There are many different versions of malloc - some more efficient than others when running multi-threaded. Again it depends on the OS as to what I can suggest. You shouldn't have to malloc/free the space for each each iteration. This would cause a lot of overhead.
Posts: 1025
Joined: Wed Jan 16, 2008 7:21 am

Re: large private arrays

Postby crjensen » Mon May 19, 2008 4:36 pm

Thanks for your comments. Making sure each thread got its own pointer was the solution. I declared a static pointer and then set the pointer as threadprivate and allocate the memory in a parallel for inside the parallel region. If you're sure to turn off the dynamic thread management then the pointer value persists through all of the parallel regions, so that works and removes the unnecessary malloc and free.

Doing this didn't really help my scaling problems though. I'm wondering if I'm running into the NUMA bottlenecks of keeping my fluid field data in a giant shared memory pool. The threads themselves work on their local buffer data, but I still seem constrained by the slower non-local memory accesses. I've disabled all of the locks and synchronization and it still doesn't seem to scale past about 4 processors. Is there any advice for this sort of thing because I REALLY don't have time to rewrite this in MPI.

Thanks again for any help.

Re: large private arrays

Postby ejd » Mon May 19, 2008 5:02 pm

You haven't given me much information to go on. What is the hardware? What is the OS? How big is the shared memory pool? Are you only reading from it or writing as well? How are you synchronizing writes? How big are the local buffer areas? How many iterations are in the loop? When you allocated the memory, did you have the threads allocate it and initialize it or is one thread doing it all? Some example code would really help!

You say that you are constrained by the slower non-local memory accesses. Have you run a profiler on this and this is what you are seeing or is this just your impression?

The good thing about OpenMP is that you can spend a small amount of time and get reasonable speedups. The bad thing about it, is that to get large speedups you may have to spend some time looking at the code to see where the bottlenecks are.
Posts: 1025
Joined: Wed Jan 16, 2008 7:21 am

Re: large private arrays

Postby crjensen » Tue May 20, 2008 12:36 pm

ejd wrote:You haven't given me much information to go on.

Sorry about that, I was sort of straightening things out in my head at the time. The server is running SUSE Linux and each processor is an Itanium 2 at 1.3 GHz with 1 GB memory. I believe each node has two processors and that the rest is tied together with SGI's numalink. The shared memory pool is multiple gigabytes; as my code is written it represents the main dataset. I read from that pool into temporary buffers that are 10's of megabytes, update the fluid data, and write it back to the shared memory pool. For synchronization, the domain is decomposed along one dimension so each thread updates a block of data with each block having a lock at either end to ensure that the threads don't read and write to the same area at the same time.

The main shared memory pool is allocated and initialized in the master thread, and this is the part I'm working on now. Realizing how easy it was to use threadprivate for the temporary buffers, I'm going to try partitioning the field data into private memory for each thread and then use regions of shared memory for communication between threads - almost like a MPI implementation. The funny thing is that this won't actually mean much rewriting. Here's an example of what I wrote for the buffer memory and what I intend to do with most of the global data:

Code: Select all
static buffer *tempBuffer=NULL;
static buffer *leftBuffer=NULL;
#pragma omp threadprivate(tempBuffer, leftBuffer)
//parallel region

#pragma omp for schedule(static,1)
      for (int i=0;i<numThreads;i++)
         tempBuffer = (buffer*) malloc(sx*sy*2*sizeof(buffer));
         leftBuffer = (buffer*) malloc(sx*sy*sizeof(buffer));
         int threadNum = omp_get_thread_num();
         printf("threadNum %d address = %p\n", threadNum, tempBuffer);

This is all based on my impressions of things, though. I don't know how to profile the openmp code. The people that run the server make no mention of any profiling software, and I can't even find any kind of profiler at my desktop in visual studio.

I also have an unrelated question that's been bothering me. For my main data set, I currently allocate it as one large block of linear memory and then address it with *(address + x + (y*sizeX) + (z*sizeX*sizeY)). I did this because I didn't want to use a triple indirection, but I've been wondering when it gets down to it which is the fastest way to manage large sets of data.

Thanks again for the help.

Return to Using OpenMP

Who is online

Users browsing this forum: No registered users and 2 guests