## Parallel program showing speedup, but same wall time [F90]

General OpenMP discussion

### Parallel program showing speedup, but same wall time [F90]

I'm following an introduction class to (parallel) programming, in which we devoted two lessons to making Fortran90-code parallel with the aid of OpenMP. As exam assignment, we had to parallelize a program by ourself. I chose to work with Game Of Life (http://www.pdc.kth.se/education/tutorials/summer-school/mpi-exercises/mpi-lab-codes/game_of_life-serial.f90/view) and this is what I came up with:
Code: Select all
`!----------------------!  Conway Game of Life!    serial version!----------------------program life    use omp_lib    implicit none    integer, parameter :: ni=2000, nj=2000  integer :: i, j, n, im, ip, jm, jp, nsum, isum, num_thr, nsteps  integer, allocatable, dimension(:,:) :: old, new  real :: arand, et, t1, e1, t2, e2, tarray(2)    ! request the ammount of iterations    write(*,'(A)',advance='no') "Please enter the number of iterations: "  read(*,*) nsteps    ! initiate time measurement    et=0.0  num_thr=1  t1=dtime(tarray)  e1=secnds(et)  ! allocate arrays, including room for ghost cells  allocate(old(0:ni+1,0:nj+1), new(0:ni+1,0:nj+1))    do j = 1, nj     do i = 1, ni        call random_number(arand)        old(i,j) = nint(arand)     enddo  enddo  !  iterate    time_iteration: do n = 1, nsteps     ! corner boundary conditions     old(0,0) = old(ni,nj)     old(0,nj+1) = old(ni,1)     old(ni+1,nj+1) = old(1,1)     old(ni+1,0) = old(1,nj)     ! left-right boundary conditions     old(1:ni,0) = old(1:ni,nj)     old(1:ni,nj+1) = old(1:ni,1)     ! top-bottom boundary conditions     old(0,1:nj) = old(ni,1:nj)     old(ni+1,1:nj) = old(1,1:nj)          !\$omp parallel private(jm,j,jp,im,i,ip,nsum)     !\$omp do     do j = 1, nj               do i = 1, ni           im = i - 1           ip = i + 1           jm = j - 1           jp = j + 1           nsum = old(im,jm) + old(im,j) + old(im,jp) &                + old(i,jm )             + old(i,jp ) &                + old(ip,jm) + old(ip,j) + old(ip,jp)           select case (nsum)           case (3)              new(i,j) = 1           case (2)              new(i,j) = old(i,j)           case default              new(i,j) = 0           end select        enddo          enddo     !\$omp enddo     !\$omp end parallel     ! copy new state into old state     old = new  enddo time_iteration  ! Iterations are done; sum the number of live cells    isum = sum(new(1:ni,1:nj))    ! Calculate resources used    t2=dtime(tarray)  e2=secnds(et)    ! Print final number of live cells, including resources used.    write(*,'(A14,A9,A10,A14,A9)') " Living Cells"," Threads"," CPU time"," Elapsed time"," Speedup"  write(*,'(I14,I9,F10.4,F14.4,F9.4)') isum,num_thr,t2,e2-e1,t2/(e2-e1)  deallocate(old, new)end program life`

And this is the output
Code: Select all
`.../Par_Prog/OpenMP \$ ./ser_game_of_life Please enter the number of iterations: 300  Living Cells  Threads  CPU time  Elapsed time  Speedup        259004        1   26.0216       26.0234   0.9999.../Par_Prog/OpenMP \$ ./par_game_of_life Please enter the number of iterations: 300  Living Cells  Threads  CPU time  Elapsed time  Speedup        259004        1   87.2094       25.1367   3.4694`

So the parallel code is showing a speedup, but the same wall time and the seriel code, which I don't understand. Could somebody enlighten me?
Aertsvijand

Posts: 4
Joined: Mon Jun 17, 2013 4:33 am

### Re: Parallel program showing speedup, but same wall time [F9

Hi,

There are some issues in your code and in your speedup calculation:
0) What is/are the difference/s among ./ser_game_of_life and ./par_game_of_life?
1) The number of actual OpenMP threads is not related to the variable num_thr
2) You are computing serial time as the CPU time of the parallel code, which is not fair, since you are computing parallel overhead/s time as being part of the serial time. The standard way of computing speedup is by taking serial and parallel wall clock time, for which the OpenMP function omp_get_wtime() is usually suggested.

HTH,

Fernando.
PS: the source code seems a little bit strange to me, because array "new" first and last rows and columns are never assigned... but maybe I'm losing something...
ftinetti

Posts: 603
Joined: Wed Feb 10, 2010 2:44 pm

### Re: Parallel program showing speedup, but same wall time [F9

ftinetti wrote:Hi,

There are some issues in your code and in your speedup calculation:
0) What is/are the difference/s among ./ser_game_of_life and ./par_game_of_life?
1) The number of actual OpenMP threads is not related to the variable num_thr
2) You are computing serial time as the CPU time of the parallel code, which is not fair, since you are computing parallel overhead/s time as being part of the serial time. The standard way of computing speedup is by taking serial and parallel wall clock time, for which the OpenMP function omp_get_wtime() is usually suggested.

HTH,

Fernando.
PS: the source code seems a little bit strange to me, because array "new" first and last rows and columns are never assigned... but maybe I'm losing something...

0) ser_game of life is compiled as fortran -o ser_game_of_life game_of_life.f90, while par_game_of_life is compiled as fortran -o par _game_of_life -fopenmp game_of_life.f90, aka the seriel and parallel version of the program.
1) I fixed that with
Code: Select all
`     ...     !\$omp parallel private(jm,j,jp,im,i,ip,nsum)     !\$omp master     !\$ num_thr = omp_get_num_threads()     !\$omp end master     !\$omp do     ...`

2) I'm not quite following the explanation you give about the measurement of time. I have looked into omp_get_wtime(), but that function is only recognised when used in combination with the -fopenmp compiler flag, so I'm not sure how to use it...

About the outermost rows/columns: they function as "dummy"-rows/columns, since the next-to-outermost cells wouldn't have the required neighbours. This is fixed by making some kind of torus of the map by copying the last "real" row to the upper dummy row, the first "real" row to the lower dummy row and the same for the dummy columns.

However, the main issue still stand; why it takes the program just as much real time in both the seriel and parallel version.
Aertsvijand

Posts: 4
Joined: Mon Jun 17, 2013 4:33 am

### Re: Parallel program showing speedup, but same wall time [F9

However, the main issue still stand; why it takes the program just as much real time in both the seriel and parallel version.

Hmmm... maybe it will take a little bit of work, but I think it would be possible.

a) Computer: processor and number of processors. If you are in Linux, the output of
\$ cat /proc/cpuinfo
would be good enough.
b) Compiler and compiler options used for generating the serial as well as the parallel version.

Second: run the serial version, i.e. the one generated without the openmp compiler option and post the output (maybe it is the previous one you posted, but post it anyway, just for completeness).

third: run the sequence
\$ par _game_of_life
... <program output here>
\$ par _game_of_life
... <program output here>
(if you have 4 cores or more)
\$ par _game_of_life

Maybe at this point you'll find the explanation by yourself, but post the results anyway.

HTH,

Fernando.
ftinetti

Posts: 603
Joined: Wed Feb 10, 2010 2:44 pm

### Re: Parallel program showing speedup, but same wall time [F9

I'm using Linux Mint running as virtual machine with VirtualBox as environment. I have dedicated my two cores, which have access to HyperThreading, to the virtual machine.

a) CPU info by using \$ cat /proc/cpuinfo
Code: Select all
`processor       : 0vendor_id       : GenuineIntelcpu family      : 6model           : 58model name      : Intel(R) Core(TM) i7-3517U CPU @ 1.90GHzstepping        : 9cpu MHz         : 2247.600cache size      : 6144 KBphysical id     : 0siblings        : 4core id         : 0cpu cores       : 4apicid          : 0initial apicid  : 0fpu             : yesfpu_exception   : yescpuid level     : 5wp              : yesflags           : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx rdtscp lm constant_tsc rep_good nopl pni ssse3 lahf_lmbogomips        : 4495.20clflush size    : 64cache_alignment : 64address sizes   : 36 bits physical, 48 bits virtualpower management:processor       : 1vendor_id       : GenuineIntelcpu family      : 6model           : 58model name      : Intel(R) Core(TM) i7-3517U CPU @ 1.90GHzstepping        : 9cpu MHz         : 2247.600cache size      : 6144 KBphysical id     : 0siblings        : 4core id         : 1cpu cores       : 4apicid          : 1initial apicid  : 1fpu             : yes                                                                                                                                                                                              fpu_exception   : yes                                                                                                                                                                                              cpuid level     : 5                                                                                                                                                                                                wp              : yes                                                                                                                                                                                              flags           : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx rdtscp lm constant_tsc rep_good nopl pni ssse3 lahf_lm                           bogomips        : 4495.20                                                                                                                                                                                          clflush size    : 64                                                                                                                                                                                               cache_alignment : 64                                                                                                                                                                                               address sizes   : 36 bits physical, 48 bits virtual                                                                                                                                                                power management:                                                                                                                                                                                                                                                                                                                                                                                                                     processor       : 2                                                                                                                                                                                                vendor_id       : GenuineIntel                                                                                                                                                                                     cpu family      : 6                                                                                                                                                                                                model           : 58                                                                                                                                                                                               model name      : Intel(R) Core(TM) i7-3517U CPU @ 1.90GHzstepping        : 9cpu MHz         : 2247.600cache size      : 6144 KBphysical id     : 0siblings        : 4core id         : 2cpu cores       : 4apicid          : 2initial apicid  : 2fpu             : yesfpu_exception   : yescpuid level     : 5wp              : yesflags           : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx rdtscp lm constant_tsc rep_good nopl pni ssse3 lahf_lmbogomips        : 4495.20clflush size    : 64cache_alignment : 64address sizes   : 36 bits physical, 48 bits virtualpower management:processor       : 3vendor_id       : GenuineIntelcpu family      : 6model           : 58model name      : Intel(R) Core(TM) i7-3517U CPU @ 1.90GHzstepping        : 9cpu MHz         : 2247.600cache size      : 6144 KBphysical id     : 0siblings        : 4core id         : 3cpu cores       : 4apicid          : 3initial apicid  : 3fpu             : yesfpu_exception   : yescpuid level     : 5wp              : yesflags           : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx rdtscp lm constant_tsc rep_good nopl pni ssse3 lahf_lmbogomips        : 4495.20clflush size    : 64cache_alignment : 64address sizes   : 36 bits physical, 48 bits virtualpower management:`

b) Compiler and compiler options
Compiler: gfortran, so that would be the the gcc compiler
Code: Select all
`\$ gfortran -vUsing built-in specs.COLLECT_GCC=gfortranCOLLECT_LTO_WRAPPER=/usr/lib/gcc/x86_64-linux-gnu/4.7/lto-wrapperTarget: x86_64-linux-gnuConfigured with: ../src/configure -v --with-pkgversion='Ubuntu/Linaro 4.7.2-2ubuntu1' --with-bugurl=file:///usr/share/doc/gcc-4.7/README.Bugs --enable-languages=c,c++,go,fortran,objc,obj-c++ --prefix=/usr --program-suffix=-4.7 --enable-shared --enable-linker-build-id --with-system-zlib --libexecdir=/usr/lib --without-included-gettext --enable-threads=posix --with-gxx-include-dir=/usr/include/c++/4.7 --libdir=/usr/lib --enable-nls --with-sysroot=/ --enable-clocale=gnu --enable-libstdcxx-debug --enable-libstdcxx-time=yes --enable-gnu-unique-object --enable-plugin --enable-objc-gc --disable-werror --with-arch-32=i686 --with-tune=generic --enable-checking=release --build=x86_64-linux-gnu --host=x86_64-linux-gnu --target=x86_64-linux-gnuThread model: posixgcc version 4.7.2 (Ubuntu/Linaro 4.7.2-2ubuntu1)`

Serial: gfortran -o ser_game_of_life game_of_life.f90
Parallel: gfortran -o par_game_of_life -fopenmp game_of_life.f90

c) Serial version
Code: Select all
`\$ ./ser_game_of_life Please enter the number of iterations: 300  Living Cells  Threads  CPU time  Elapsed time  Speedup        259004        1   27.6657       27.6680   0.9999`

d) Sequence
Code: Select all
`\$ export OMP_NUM_THREADS=1\$ ./par_game_of_life Please enter the number of iterations: 300  Living Cells  Threads  CPU time  Elapsed time  Speedup        259004        1   40.2865       40.2891   0.9999\$ export OMP_NUM_THREADS=2\$ ./par_game_of_life Please enter the number of iterations: 300  Living Cells  Threads  CPU time  Elapsed time  Speedup        259004        2   51.2912       26.4102   1.9421\$ export OMP_NUM_THREADS=4\$ ./par_game_of_life Please enter the number of iterations: 300  Living Cells  Threads  CPU time  Elapsed time  Speedup        259004        4   84.7533       24.5039   3.4588`

I guess the program ís speeding up, but the overheat of the parallelisation is causing the program to be just as fast as the serial version?
Aertsvijand

Posts: 4
Joined: Mon Jun 17, 2013 4:33 am

### Re: Parallel program showing speedup, but same wall time [F9

I guess the program ís speeding up, but the overheat of the parallelisation is causing the program to be just as fast as the serial version?

Exactly, and unfortunately you don't have more than 2 cores to see any actual improvement wrt serial time. What happens in a non-virtual pair of Xeon processors is similar:
\$ ./ser_game_of_life
Please enter the number of iterations: 300
Living Cells Threads CPU time Elapsed time Speedup
259004 1 53.3193 53.3184 1.0000

\$ ./par_game_of_life
Please enter the number of iterations: 300
Living Cells Threads CPU time Elapsed time Speedup
259004 1 98.2101 98.2090 1.0000

\$ ./par_game_of_life
Please enter the number of iterations: 300
Living Cells Threads CPU time Elapsed time Speedup
259004 1 102.4504 51.3730 1.9942

Now, please compile with some compiler optimization option, e.g.

gfortran -O2 -o ser_game_of_life game_of_life.f90
gfortran -O2 -o par_game_of_life -fopenmp game_of_life.f90

and please post the three runtimes (serial, parallel with one thread, and parallel with two threads). Usually, runtimes change a lot with optimized code (or, at least, with no debug-specific code generation).

HTH,

Fernando.
ftinetti

Posts: 603
Joined: Wed Feb 10, 2010 2:44 pm

### Re: Parallel program showing speedup, but same wall time [F9

Well indeed, what a difference!

\$ gfortran -o ser_game_of_life game_of_life.f90
\$ gfortran -o par_game_of_life -fopenmp game_of_life.f90
\$ gfortran -O2 -o ser_game_of_life_opt game_of_life.f90
\$ gfortran -O2 -o par_game_of_life_opt -fopenmp game_of_life.f90

\$ ./ser_game_of_life
Please enter the number of iterations: 300
Please enter the number of rows: 2000
Please enter the number of columns: 2000
Living Cells Threads CPU time Elapsed time Speedup
259004 1 23.4895 23.4844 1.0002

\$ ./ser_game_of_life_opt
Please enter the number of iterations: 300
Please enter the number of rows: 2000
Please enter the number of columns: 2000
Living Cells Threads CPU time Elapsed time Speedup
259004 1 6.5964 6.6016 0.9992

\$ ./par_game_of_life
Please enter the number of iterations: 300
Please enter the number of rows: 2000
Please enter the number of columns: 2000
Living Cells Threads CPU time Elapsed time Speedup
259004 1 38.0304 38.0508 0.9995

\$ ./par_game_of_life_opt
Please enter the number of iterations: 300
Please enter the number of rows: 2000
Please enter the number of columns: 2000
Living Cells Threads CPU time Elapsed time Speedup
259004 1 7.5285 7.5273 1.0001

\$ ./par_game_of_life
Please enter the number of iterations: 300
Please enter the number of rows: 2000
Please enter the number of columns: 2000
Living Cells Threads CPU time Elapsed time Speedup
259004 2 47.7430 24.5195 1.9471

\$ ./par_game_of_life_opt
Please enter the number of iterations: 300
Please enter the number of rows: 2000
Please enter the number of columns: 2000
Living Cells Threads CPU time Elapsed time Speedup
259004 2 9.4526 5.0625 1.8672

I guess I'll have enough to explain of what I learned from this assignment when I meet up with the professor Maybe it wasn't the best example of a program to parallelize, but at least I learned a lot, thanks for the help
Aertsvijand

Posts: 4
Joined: Mon Jun 17, 2013 4:33 am