## Parallel program showing speedup, but same wall time [F90]

General OpenMP discussion

### Parallel program showing speedup, but same wall time [F90]

I'm following an introduction class to (parallel) programming, in which we devoted two lessons to making Fortran90-code parallel with the aid of OpenMP. As exam assignment, we had to parallelize a program by ourself. I chose to work with Game Of Life (http://www.pdc.kth.se/education/tutorials/summer-school/mpi-exercises/mpi-lab-codes/game_of_life-serial.f90/view) and this is what I came up with:
Code: Select all
!----------------------
!  Conway Game of Life
!    serial version
!----------------------

program life

use omp_lib

implicit none

integer, parameter :: ni=2000, nj=2000
integer :: i, j, n, im, ip, jm, jp, nsum, isum, num_thr, nsteps
integer, allocatable, dimension(:,:) :: old, new
real :: arand, et, t1, e1, t2, e2, tarray(2)

! request the ammount of iterations

! initiate time measurement

et=0.0
num_thr=1
t1=dtime(tarray)
e1=secnds(et)

! allocate arrays, including room for ghost cells

allocate(old(0:ni+1,0:nj+1), new(0:ni+1,0:nj+1))

do j = 1, nj
do i = 1, ni
call random_number(arand)
old(i,j) = nint(arand)
enddo
enddo

!  iterate

time_iteration: do n = 1, nsteps

! corner boundary conditions

old(0,0) = old(ni,nj)
old(0,nj+1) = old(ni,1)
old(ni+1,nj+1) = old(1,1)
old(ni+1,0) = old(1,nj)

! left-right boundary conditions

old(1:ni,0) = old(1:ni,nj)
old(1:ni,nj+1) = old(1:ni,1)

! top-bottom boundary conditions

old(0,1:nj) = old(ni,1:nj)
old(ni+1,1:nj) = old(1,1:nj)

!\$omp parallel private(jm,j,jp,im,i,ip,nsum)
!\$omp do
do j = 1, nj
do i = 1, ni

im = i - 1
ip = i + 1
jm = j - 1
jp = j + 1
nsum = old(im,jm) + old(im,j) + old(im,jp) &
+ old(i,jm )             + old(i,jp ) &
+ old(ip,jm) + old(ip,j) + old(ip,jp)

select case (nsum)
case (3)
new(i,j) = 1
case (2)
new(i,j) = old(i,j)
case default
new(i,j) = 0
end select

enddo
enddo
!\$omp enddo
!\$omp end parallel

! copy new state into old state

old = new

enddo time_iteration

! Iterations are done; sum the number of live cells

isum = sum(new(1:ni,1:nj))

! Calculate resources used

t2=dtime(tarray)
e2=secnds(et)

! Print final number of live cells, including resources used.

write(*,'(A14,A9,A10,A14,A9)') " Living Cells"," Threads"," CPU time"," Elapsed time"," Speedup"
write(*,'(I14,I9,F10.4,F14.4,F9.4)') isum,num_thr,t2,e2-e1,t2/(e2-e1)

deallocate(old, new)

end program life

And this is the output
Code: Select all
.../Par_Prog/OpenMP \$ ./ser_game_of_life
Please enter the number of iterations: 300
Living Cells  Threads  CPU time  Elapsed time  Speedup
259004        1   26.0216       26.0234   0.9999

.../Par_Prog/OpenMP \$ ./par_game_of_life
Please enter the number of iterations: 300
Living Cells  Threads  CPU time  Elapsed time  Speedup
259004        1   87.2094       25.1367   3.4694

So the parallel code is showing a speedup, but the same wall time and the seriel code, which I don't understand. Could somebody enlighten me?
Aertsvijand

Posts: 4
Joined: Mon Jun 17, 2013 4:33 am

### Re: Parallel program showing speedup, but same wall time [F9

Hi,

There are some issues in your code and in your speedup calculation:
0) What is/are the difference/s among ./ser_game_of_life and ./par_game_of_life?
1) The number of actual OpenMP threads is not related to the variable num_thr
2) You are computing serial time as the CPU time of the parallel code, which is not fair, since you are computing parallel overhead/s time as being part of the serial time. The standard way of computing speedup is by taking serial and parallel wall clock time, for which the OpenMP function omp_get_wtime() is usually suggested.

HTH,

Fernando.
PS: the source code seems a little bit strange to me, because array "new" first and last rows and columns are never assigned... but maybe I'm losing something...
ftinetti

Posts: 603
Joined: Wed Feb 10, 2010 2:44 pm

### Re: Parallel program showing speedup, but same wall time [F9

ftinetti wrote:Hi,

There are some issues in your code and in your speedup calculation:
0) What is/are the difference/s among ./ser_game_of_life and ./par_game_of_life?
1) The number of actual OpenMP threads is not related to the variable num_thr
2) You are computing serial time as the CPU time of the parallel code, which is not fair, since you are computing parallel overhead/s time as being part of the serial time. The standard way of computing speedup is by taking serial and parallel wall clock time, for which the OpenMP function omp_get_wtime() is usually suggested.

HTH,

Fernando.
PS: the source code seems a little bit strange to me, because array "new" first and last rows and columns are never assigned... but maybe I'm losing something...

0) ser_game of life is compiled as fortran -o ser_game_of_life game_of_life.f90, while par_game_of_life is compiled as fortran -o par _game_of_life -fopenmp game_of_life.f90, aka the seriel and parallel version of the program.
1) I fixed that with
Code: Select all
...
!\$omp parallel private(jm,j,jp,im,i,ip,nsum)
!\$omp master
!\$omp end master
!\$omp do
...

2) I'm not quite following the explanation you give about the measurement of time. I have looked into omp_get_wtime(), but that function is only recognised when used in combination with the -fopenmp compiler flag, so I'm not sure how to use it...

About the outermost rows/columns: they function as "dummy"-rows/columns, since the next-to-outermost cells wouldn't have the required neighbours. This is fixed by making some kind of torus of the map by copying the last "real" row to the upper dummy row, the first "real" row to the lower dummy row and the same for the dummy columns.

However, the main issue still stand; why it takes the program just as much real time in both the seriel and parallel version.
Aertsvijand

Posts: 4
Joined: Mon Jun 17, 2013 4:33 am

### Re: Parallel program showing speedup, but same wall time [F9

However, the main issue still stand; why it takes the program just as much real time in both the seriel and parallel version.

Hmmm... maybe it will take a little bit of work, but I think it would be possible.

a) Computer: processor and number of processors. If you are in Linux, the output of
\$ cat /proc/cpuinfo
would be good enough.
b) Compiler and compiler options used for generating the serial as well as the parallel version.

Second: run the serial version, i.e. the one generated without the openmp compiler option and post the output (maybe it is the previous one you posted, but post it anyway, just for completeness).

third: run the sequence
\$ par _game_of_life
... <program output here>
\$ par _game_of_life
... <program output here>
(if you have 4 cores or more)
\$ par _game_of_life

Maybe at this point you'll find the explanation by yourself, but post the results anyway.

HTH,

Fernando.
ftinetti

Posts: 603
Joined: Wed Feb 10, 2010 2:44 pm

### Re: Parallel program showing speedup, but same wall time [F9

I'm using Linux Mint running as virtual machine with VirtualBox as environment. I have dedicated my two cores, which have access to HyperThreading, to the virtual machine.

a) CPU info by using \$ cat /proc/cpuinfo
Code: Select all
processor       : 0
vendor_id       : GenuineIntel
cpu family      : 6
model           : 58
model name      : Intel(R) Core(TM) i7-3517U CPU @ 1.90GHz
stepping        : 9
cpu MHz         : 2247.600
cache size      : 6144 KB
physical id     : 0
siblings        : 4
core id         : 0
cpu cores       : 4
apicid          : 0
initial apicid  : 0
fpu             : yes
fpu_exception   : yes
cpuid level     : 5
wp              : yes
flags           : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx rdtscp lm constant_tsc rep_good nopl pni ssse3 lahf_lm
bogomips        : 4495.20
clflush size    : 64
cache_alignment : 64
address sizes   : 36 bits physical, 48 bits virtual
power management:

processor       : 1
vendor_id       : GenuineIntel
cpu family      : 6
model           : 58
model name      : Intel(R) Core(TM) i7-3517U CPU @ 1.90GHz
stepping        : 9
cpu MHz         : 2247.600
cache size      : 6144 KB
physical id     : 0
siblings        : 4
core id         : 1
cpu cores       : 4
apicid          : 1
initial apicid  : 1
fpu             : yes
fpu_exception   : yes
cpuid level     : 5
wp              : yes
flags           : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx rdtscp lm constant_tsc rep_good nopl pni ssse3 lahf_lm
bogomips        : 4495.20
clflush size    : 64
cache_alignment : 64
address sizes   : 36 bits physical, 48 bits virtual
power management:

processor       : 2
vendor_id       : GenuineIntel
cpu family      : 6
model           : 58
model name      : Intel(R) Core(TM) i7-3517U CPU @ 1.90GHz
stepping        : 9
cpu MHz         : 2247.600
cache size      : 6144 KB
physical id     : 0
siblings        : 4
core id         : 2
cpu cores       : 4
apicid          : 2
initial apicid  : 2
fpu             : yes
fpu_exception   : yes
cpuid level     : 5
wp              : yes
flags           : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx rdtscp lm constant_tsc rep_good nopl pni ssse3 lahf_lm
bogomips        : 4495.20
clflush size    : 64
cache_alignment : 64
address sizes   : 36 bits physical, 48 bits virtual
power management:

processor       : 3
vendor_id       : GenuineIntel
cpu family      : 6
model           : 58
model name      : Intel(R) Core(TM) i7-3517U CPU @ 1.90GHz
stepping        : 9
cpu MHz         : 2247.600
cache size      : 6144 KB
physical id     : 0
siblings        : 4
core id         : 3
cpu cores       : 4
apicid          : 3
initial apicid  : 3
fpu             : yes
fpu_exception   : yes
cpuid level     : 5
wp              : yes
flags           : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx rdtscp lm constant_tsc rep_good nopl pni ssse3 lahf_lm
bogomips        : 4495.20
clflush size    : 64
cache_alignment : 64
address sizes   : 36 bits physical, 48 bits virtual
power management:

b) Compiler and compiler options
Compiler: gfortran, so that would be the the gcc compiler
Code: Select all
\$ gfortran -v
Using built-in specs.
COLLECT_GCC=gfortran
COLLECT_LTO_WRAPPER=/usr/lib/gcc/x86_64-linux-gnu/4.7/lto-wrapper
Target: x86_64-linux-gnu
Configured with: ../src/configure -v --with-pkgversion='Ubuntu/Linaro 4.7.2-2ubuntu1' --with-bugurl=file:///usr/share/doc/gcc-4.7/README.Bugs --enable-languages=c,c++,go,fortran,objc,obj-c++ --prefix=/usr --program-suffix=-4.7 --enable-shared --enable-linker-build-id --with-system-zlib --libexecdir=/usr/lib --without-included-gettext --enable-threads=posix --with-gxx-include-dir=/usr/include/c++/4.7 --libdir=/usr/lib --enable-nls --with-sysroot=/ --enable-clocale=gnu --enable-libstdcxx-debug --enable-libstdcxx-time=yes --enable-gnu-unique-object --enable-plugin --enable-objc-gc --disable-werror --with-arch-32=i686 --with-tune=generic --enable-checking=release --build=x86_64-linux-gnu --host=x86_64-linux-gnu --target=x86_64-linux-gnu
gcc version 4.7.2 (Ubuntu/Linaro 4.7.2-2ubuntu1)

Serial: gfortran -o ser_game_of_life game_of_life.f90
Parallel: gfortran -o par_game_of_life -fopenmp game_of_life.f90

c) Serial version
Code: Select all
\$ ./ser_game_of_life
Please enter the number of iterations: 300
Living Cells  Threads  CPU time  Elapsed time  Speedup
259004        1   27.6657       27.6680   0.9999

d) Sequence
Code: Select all
\$ ./par_game_of_life
Please enter the number of iterations: 300
Living Cells  Threads  CPU time  Elapsed time  Speedup
259004        1   40.2865       40.2891   0.9999

\$ ./par_game_of_life
Please enter the number of iterations: 300
Living Cells  Threads  CPU time  Elapsed time  Speedup
259004        2   51.2912       26.4102   1.9421

\$ ./par_game_of_life
Please enter the number of iterations: 300
Living Cells  Threads  CPU time  Elapsed time  Speedup
259004        4   84.7533       24.5039   3.4588

I guess the program ís speeding up, but the overheat of the parallelisation is causing the program to be just as fast as the serial version?
Aertsvijand

Posts: 4
Joined: Mon Jun 17, 2013 4:33 am

### Re: Parallel program showing speedup, but same wall time [F9

I guess the program ís speeding up, but the overheat of the parallelisation is causing the program to be just as fast as the serial version?

Exactly, and unfortunately you don't have more than 2 cores to see any actual improvement wrt serial time. What happens in a non-virtual pair of Xeon processors is similar:
\$ ./ser_game_of_life
Please enter the number of iterations: 300
Living Cells Threads CPU time Elapsed time Speedup
259004 1 53.3193 53.3184 1.0000

\$ ./par_game_of_life
Please enter the number of iterations: 300
Living Cells Threads CPU time Elapsed time Speedup
259004 1 98.2101 98.2090 1.0000

\$ ./par_game_of_life
Please enter the number of iterations: 300
Living Cells Threads CPU time Elapsed time Speedup
259004 1 102.4504 51.3730 1.9942

Now, please compile with some compiler optimization option, e.g.

gfortran -O2 -o ser_game_of_life game_of_life.f90
gfortran -O2 -o par_game_of_life -fopenmp game_of_life.f90

and please post the three runtimes (serial, parallel with one thread, and parallel with two threads). Usually, runtimes change a lot with optimized code (or, at least, with no debug-specific code generation).

HTH,

Fernando.
ftinetti

Posts: 603
Joined: Wed Feb 10, 2010 2:44 pm

### Re: Parallel program showing speedup, but same wall time [F9

Well indeed, what a difference!

\$ gfortran -o ser_game_of_life game_of_life.f90
\$ gfortran -o par_game_of_life -fopenmp game_of_life.f90
\$ gfortran -O2 -o ser_game_of_life_opt game_of_life.f90
\$ gfortran -O2 -o par_game_of_life_opt -fopenmp game_of_life.f90

\$ ./ser_game_of_life
Please enter the number of iterations: 300
Please enter the number of rows: 2000
Please enter the number of columns: 2000
Living Cells Threads CPU time Elapsed time Speedup
259004 1 23.4895 23.4844 1.0002

\$ ./ser_game_of_life_opt
Please enter the number of iterations: 300
Please enter the number of rows: 2000
Please enter the number of columns: 2000
Living Cells Threads CPU time Elapsed time Speedup
259004 1 6.5964 6.6016 0.9992

\$ ./par_game_of_life
Please enter the number of iterations: 300
Please enter the number of rows: 2000
Please enter the number of columns: 2000
Living Cells Threads CPU time Elapsed time Speedup
259004 1 38.0304 38.0508 0.9995

\$ ./par_game_of_life_opt
Please enter the number of iterations: 300
Please enter the number of rows: 2000
Please enter the number of columns: 2000
Living Cells Threads CPU time Elapsed time Speedup
259004 1 7.5285 7.5273 1.0001

\$ ./par_game_of_life
Please enter the number of iterations: 300
Please enter the number of rows: 2000
Please enter the number of columns: 2000
Living Cells Threads CPU time Elapsed time Speedup
259004 2 47.7430 24.5195 1.9471

\$ ./par_game_of_life_opt
Please enter the number of iterations: 300
Please enter the number of rows: 2000
Please enter the number of columns: 2000
Living Cells Threads CPU time Elapsed time Speedup
259004 2 9.4526 5.0625 1.8672

I guess I'll have enough to explain of what I learned from this assignment when I meet up with the professor Maybe it wasn't the best example of a program to parallelize, but at least I learned a lot, thanks for the help
Aertsvijand

Posts: 4
Joined: Mon Jun 17, 2013 4:33 am