my first openmp program

General OpenMP discussion

my first openmp program

Postby tch » Fri Dec 05, 2008 9:00 am

Hi
i just started with openmp in fortran. As a first application i want to paralize the computation of the scalarproduct of some vectors.
The number of vectors is ~200 and their dimension is ~(50,50,50,2) and my program is running on a quad core machine. I tried two ways: doing the outer loop (vectors) in parallel and doing the inner loop (entries of the vectors) in parallel. But to my surprise both ways take much more time than the sequential version...
Code: Select all
PROGRAM test
IMPLICIT NONE
INTEGER,PARAMETER :: db=SELECTED_REAL_KIND(12,100)
INTEGER,PARAMETER :: N = 150
COMPLEX(db) :: c,p(50,50,50,2,N)
REAL(db) ::     im(50,50,50,2,N)
INTEGER :: i,j
COMPLEX(db) :: tst(N,N)
REAL :: t0,t1,t2,t3
! INIT p
CALL init_random_seed()
CALL RANDOM_NUMBER(im)
p = im
CALL RANDOM_NUMBER(im)
p = p + (0.0,1.0)* im
! OUTER LOOP
call cpu_time(t0)
!$OMP  PARALLEL DO SHARED(tst,p1) PRIVATE(i,j) SCHEDULE(DYNAMIC,10)
DO i = 1,N
   DO j = 1,N
      tst(i,j) = scalarpr(p(:,:,:,:,i),p(:,:,:,:,j))
   END DO
END DO
!$OMP  END PARALLEL DO
write(*,*) "**************************************************************"
! SEQUENTIAL
call cpu_time(t1)
DO i = 1,N
        DO j = 1,N
                tst(i,j) = scalarpr(p(:,:,:,:,i),p(:,:,:,:,j))
        END DO
END DO
write(*,*) "**************************************************************"
! INNER LOOP
call cpu_time(t2)
DO i = 1,N
        DO j = 1,N
                tst(i,j) = scalarpr_p(p(:,:,:,:,i),p(:,:,:,:,j))
        END DO
END DO
write(*,*) "**************************************************************"
call cpu_time(t3)
! OUTPUT
write(*,*) t1-t0
write(*,*) t2-t1
write(*,*) t3-t2
CONTAINS
  FUNCTION scalarpr(pl,pr)  RESULT(c)
    COMPLEX(db) :: c,pl(:,:,:,:),pr(:,:,:,:)
    INTENT(IN) :: pl,pr
    c=SUM(CONJG(pl)*pr)   
  END FUNCTION scalarpr
  FUNCTION scalarpr_p(pl,pr)  RESULT(c)
    COMPLEX(db) :: c,pl(:,:,:,:),pr(:,:,:,:)
    INTEGER :: i
    INTENT(IN) :: pl,pr
   c = 0
   !$OMP  PARALLEL DO SHARED(pl,pr) PRIVATE(i) SCHEDULE(STATIC,5) REDUCTION(+:c)
   DO i = 1,size(pl,dim=3)
      c = c+SUM(CONJG(pl(:,:,i,:))*pr(:,:,i,:))
   END DO
   !$OMP  END PARALLEL DO
  END FUNCTION scalarpr_p
END PROGRAM test

(i have omitted the init_random_seed subroutine)
I made sure that the loops are distributed to the 4 processors but i have removed the output again not to waste the timing.
Am i doing something essentially wrong? Or is the size of my input just too small ?
Where can i find some information on openmp threading overhead?
tch
 

Re: my first openmp program

Postby ejd » Fri Dec 05, 2008 7:05 pm

You don't say what OS, compiler, or processor you are using, so it is hard to say much. I took your program and ran it on an old Sun system using the Sun Studio compiler and got the following timings:
Code: Select all
78.5494         parallel outer loop
252.47574       sequential
89.86261        parallel inner loop

While I didn't see a 4x speedup, I clearly didn't see either of the parallel regions taking more than the sequential run. Can you give me some more information.
ejd
 
Posts: 1025
Joined: Wed Jan 16, 2008 7:21 am

Re: my first openmp program

Postby tch » Wed Dec 10, 2008 7:41 am

I am using the intel ifort (9.1.037) and the processor is a AMD Opteron 270 (Dual-Core), 2.0 GHz on a dual processor board.
I have changed the structure of the code a little bit such that it is more consistent with the way it should be used in the actual program.
The vectors are allocated according to parameters read from a file and the subroutines are working on these arrays.
Code: Select all
PROGRAM test

IMPLICIT NONE
INTEGER,PARAMETER :: db=SELECTED_REAL_KIND(12,100)
INTEGER :: N
INTEGER :: s
INTEGER :: N_min,N_max,N_step
INTEGER :: s_min,s_max,s_step
COMPLEX(db),DIMENSION(:,:,:,:,:),ALLOCATABLE :: p
REAL(db),DIMENSION(:,:,:,:,:),ALLOCATABLE    :: p_real

!%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
!%    read parameters
!%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
OPEN(3,FILE='benchmark.par',ACCESS='SEQUENTIAL',STATUS='OLD')
READ(3,*) N_min
READ(3,*) N_max
READ(3,*) N_step
READ(3,*) s_min
READ(3,*) s_max
READ(3,*) s_step
!%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
!%    LOOP
!%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
DO N = N_min,N_max,N_step
   DO s = s_min,s_max,s_step
   !%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
   !%    allocate & init vectors
   !%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
      ALLOCATE(p(s,s,s,2,N))
      ALLOCATE(p_real(s,s,s,2,N))
      CALL init_random_seed()
      CALL RANDOM_NUMBER(p_real)
      p = p_real
      CALL RANDOM_NUMBER(p_real)
      p = p+(0.0,1.0)*p_real
      WRITE(*,*) "% sequential %%%%%%%%%%%%%%%%%%%%%%%"
      CALL CPU_TIME(t0)
      CALL run_seq
      CALL CPU_TIME(t)
      WRITE(*,*) "cputime: ",t-t0
      WRITE(*,*) "% parallel  %%%%%%%%%%%%%%%%%"
      CALL CPU_TIME(t0)
      CALL run_p1
      CALL CPU_TIME(t)
      WRITE(*,*) "cputime: ",t-t0
      WRITE(*,*) "% parallel II %%%%%%%%%%%%%%"
      CALL CPU_TIME(t0)
      CALL run_p2
      CALL CPU_TIME(t)
      WRITE(*,*) "cputime: ",t-t0
   !%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
   !%    deallocate vectors
   !%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
      DEALLOCATE(p,p_real)
   END DO
END DO

CONTAINS
   SUBROUTINE run_seq
    INTEGER :: i,j
    COMPLEX(db) :: sum
    sum = 0
    DO i = 1,N
      DO j = 1,i-1
         sum = sum + scalarpr(p(:,:,:,:,i),p(:,:,:,:,j))
      END DO
      sum = sum + scalarpr(p(:,:,:,:,i),p(:,:,:,:,i))
    END DO   
   END SUBROUTINE run_seq
   SUBROUTINE run_p1
    INTEGER :: i,j
    COMPLEX(db) :: sum
    sum = 0
    DO i = 1,N
      DO j = 1,i-1
         sum = sum + scalarpr_p(p(:,:,:,:,i),p(:,:,:,:,j))
      END DO
      sum = sum + scalarpr_p(p(:,:,:,:,i),p(:,:,:,:,i))
    END DO   
   END SUBROUTINE run_p1
   SUBROUTINE run_p2
    INTEGER :: i,j
    COMPLEX(db) :: sum
    sum = 0
    DO j = 1,N
      sum = sum + scalarpr(p(:,:,:,:,i),p(:,:,:,:,i))
      !$OMP  PARALLEL DO PRIVATE(i) SCHEDULE(STATIC,5) REDUCTION(+:sum)
      DO i = i+1,N
         sum = sum + scalarpr(p(:,:,:,:,i),p(:,:,:,:,j))
      END DO
      !$OMP  END PARALLEL DO
    END DO   
    write(*,*) "dummy output:",sum
   END SUBROUTINE run_p2
END PROGRAM test


For the same input size as before (N=100,s=50) i get a error
Code: Select all
forrtl: severe (174): SIGSEGV, segmentation fault occurred
Image              PC        Routine            Line        Source

Stack trace terminated abnormally.

The stack size is already set to unlimited...

And for smaller input sizes the sequential part is always faster.
eg. for N=30 and s=40:
seq: 0.69
p1 : 1.61
p2 : 3.7

can you give me a hint how to solve this problem?
tch
 

Re: my first openmp program

Postby ejd » Wed Dec 10, 2008 9:47 am

I see a couple of problems. You have used "implicit none", which is a good practice. However, you have not declared variables "t" and "t0". There is also a problem with subroutine run_p2 - listed here:
Code: Select all
       SUBROUTINE run_p2
        INTEGER :: i,j
        COMPLEX(db) :: sum
        sum = 0
        DO j = 1,N
          sum = sum + scalarpr(p(:,:,:,:,i),p(:,:,:,:,i))
          !$OMP  PARALLEL DO PRIVATE(i) SCHEDULE(STATIC,5) REDUCTION(+:sum)
          DO i = i+1,N
             sum = sum + scalarpr(p(:,:,:,:,i),p(:,:,:,:,j))
          END DO
          !$OMP  END PARALLEL DO
        END DO
        write(*,*) "dummy output:",sum
       END SUBROUTINE run_p2

You will see the first do loop uses "j" as a loop index, but the next line uses "i". The next problem is the second loop which uses "i" as an index and you have put "i" in a private clause. However, you start the loop with a value of "i+1". Since "i" is private, the initial value is garbage.
ejd
 
Posts: 1025
Joined: Wed Jan 16, 2008 7:21 am

Re: my first openmp program

Postby tch » Wed Dec 10, 2008 10:28 am

thx for the reply
t and t0 are not declared becuase i tried to post only the relevant part (also the init_random_seed() and the scalarpr() are not included)
I just wonder why the code runs without error when i use smaller parameters for N and s.
instead i would rather expect the program to produce an error for small values of N such that i+1 (uninitialized) causes an out of bounds error..
however i will fix the mistake and try to get some reasonable results.
tch
 

Re: my first openmp program

Postby tch » Thu Dec 11, 2008 7:51 am

The times i was measuring was the total cpu time but not the wall time, thus its clear that the parallel versions have a larger total consumption than the
sequential version. What time is measured with CPU_TIME() depends on the run-time option cpu_time_type. Instead i use now OMP_GET_WTIME() to measure the time and in addition CPU_TIME() to get an estimate of the deadtime.

The uninitalized private variable was not the problem which caused the segmentation fault. In fact the intel compiler initialises the value with 0. It looks like there was going something wrong with the REDUCTION. I removed it and now the error doesnt occur anymore.
tch
 

Re: my first openmp program

Postby ejd » Thu Dec 11, 2008 5:28 pm

I don't agree with your "solution". The variable "i" is undefined in both uses and when used can cause problems, since there is no guarantee what the value will be. As for removing the reduction clause, that is also fine - if you don't care about getting a consistent or correct result.
ejd
 
Posts: 1025
Joined: Wed Jan 16, 2008 7:21 am

Re: my first openmp program

Postby tch » Fri Dec 12, 2008 3:25 am

No, the value of i is not undefined in this case. Its the compiler who does the initialization. I think this feature can be disabled but the default setting is 'enabled'. If you dont agree....discuss with intel ;). Of course the initial value of 0 caused the subroutine run_p2 to do much more iterations than the sequential code.
And yes i dont care about consistent or correct results. I only added the reduction and an output of the result to prevent to compiler from to much optimizing i.e. not doing calculations when the result is not used afterwards. But it turned out that in case of nested loops the compiler does (can?) not decide this.
tch
 

Re: my first openmp program

Postby ejd » Mon Dec 15, 2008 9:43 am

I am not that familiar with the Intel compiler and didn't know that they had this type of default. Most compilers don't initialize values and you usually get zero, though it isn't guaranteed. When the variables are put on the stack, you will generally get garbage. If you are always running on the same compiler and it does the initialization, then that is great. However, it you change to a different compiler, then using uninitialized values like this is dangerous. Also, since you are inputting the values for the p array dimensions, I had no idea what you were using. I didn't use zero as the minimum (I started at 1) and thus when I did the sum in subroutine run_p2, it picked up garbage for the last dimension and the program aborted.
ejd
 
Posts: 1025
Joined: Wed Jan 16, 2008 7:21 am


Return to Using OpenMP

Who is online

Users browsing this forum: Google [Bot] and 1 guest

cron