False sharing

General OpenMP discussion

False sharing

Postby flaggy » Fri Nov 28, 2008 12:13 am

I wanted to make a pi calculation method parallel, so I used the following code (it's an exercise from http://openmp.org/mp-documents/omp-hands-on-SC08.pdf):
Code: Select all
#include <stdio.h>
#include <omp.h>
#define THREADS 2
static long num_steps = 10000000;
double step;
int main(void)
   int k;
   double sums[17*THREADS];
   for (k = 0; k < 17*THREADS; ++k)
      sums[k] = 0.0;

   int slice_size = (num_steps/THREADS);
#pragma omp parallel num_threads(THREADS)
      double x;
      int i;
      int tnum = omp_get_thread_num();
      step = 1.0/(double) num_steps;
      for (i = tnum*slice_size; i < (tnum*slice_size+slice_size); ++i) {
         x = (i+0.5)*step;
         sums[17*tnum] = sums[17*tnum] + 4.0/(1.0+x*x);
   double sum=0.0, pi;
   for (k = 0; k < 17*THREADS; ++k) {
      sum += sums[k];
   pi = step * sum;
   printf("pi: %lf\n", pi);
   return 0;

Note that the 17 wasn't there at first, but then I read that my solution lead to false sharing. The reason of false sharing is that the two threads share memory in the same cache line, so I enlarged the array so the two values I use are in different cache lines. That didn't help, though. Sometimes it gets faster than the sequential pi program, but sometimes it gets slower.

I realise that the proper way of doing it is to create a shared variable and sum to it on a critical section, that results in times much better than the times when I try to solve the false sharing by enlarging the array. But why does that happen? Why doesn't the solution above work?

Re: False sharing

Postby ejd » Sun Nov 30, 2008 10:24 pm

You didn't say what type of chipset and operating system you are using. This can make a big difference depending on whether it is a numa machine or not. Your first loop is now initializing all of the array space - even though it isn't all being used. You could change your step and decrease the time a little. The more important issue is that if this is a numa machine, then the usually algorithm used for memory placement is "first touch". Since the entire array is being initialized by the "initial (or master) thread", the memory will start out being "placed" on one processor.

That means that when you start running the parallel region, each thread run on a different processor than where the memory was initialized will have to have the memory it is going to use migrated to that processor. This can take quite a bit of time depending on the chipset and operating system (as well as what else is running at the same time).

Then at the end, you have another loop that is combining all of the values from the array. Again, it is summing more values than needed and changing the step would help a little. However, depending on whether it is a numa machine or not, the master thread is the one doing the summation and the storage may have to be migrated again.

There are many "proper" ways to do this, if what you mean by "proper" is getting the correct result. However, depending on the machine architecture, operating system, etc., some are faster than others. Unfortunately, the performance depends on a lot of factors.
Posts: 1025
Joined: Wed Jan 16, 2008 7:21 am

Return to Using OpenMP

Who is online

Users browsing this forum: No registered users and 2 guests