Page 1 of 1

### simdlen sometimes considered useful

Posted: Wed Jun 22, 2016 6:04 am
Just a quick note re: the simdlen clause on OpenMP simd directives (as it's so rarely mentioned anywhere and one of the few instances is the brief locked thread http://openmp.org/forum/viewtopic.php?f=18&t=1865...).

Although it may not be obvious what use simdlen could be, I'd suggest that it can be useful where vectorisation changes numeric results, in that it can be used to control vectorisation and therefore limit numeric variance.

We have a strict need to maintain numeric consistency (this may not be a strict issue for many others I realise) - a given sum should return precisely the same result on all machines as far as possible.
For many cases arbitrary simd operation is fine

Code: Select all

``````#pragma omp simd
for(i=0;i<n;++i) y[i] = a*x[i] + y[i];
``````
as the value assigned to y for each index is unaffected by serial or vectored implementation.

But for some operations, vectorisation and unrolling will execute fp operations in a different order and therefore may produce different results

Code: Select all

``````#pragma omp simd
double xsum = 0.0;
for(i=0;i<n;++i)
xsum += x[i];
``````
With the simd clause this will not accumulate a single sum, but a number of partial sums, and then sum those items on completion (remembering that ((a+b)+c) does not always equal (a+(b+c)) under arbitrary precision f.p. maths).

Now while many people do not need to concern themselves with precise numeric consistency, we are forced to use options such as /fp:precise which will normally prevent the compiler unrolling such a loop.
It is not so much that 2-way or 4-way partial summing produces the wrong answer, but it may produce a different answer to serially summing a given array.

Hence previously we may have unrolled such loops by hand, but with simdlen we can tell the compiler the simdlen to use on whichever platform is available (eg the Intel C++ can generate optimised codepaths for SSE3, AVX, and AVX512 processors and do runtime dispatching), and thereby fix the order of the calculation and therefore the result on each platform.
If we specify simdlen(8), the compiler can generate code that on older platforms will use 4 128 bit registers, on newer platforms it uses 2 256 bit registers, and on the newest platforms it will use a single 512 bit register.
If we can then get these 8 partial sums then added in the same order on each code path (which may require the compiler writers to understand this sort of situation), then we know we will get the same final result regardless of the spec of the machine that actually performs the calculation.

It's not a case that will necessarily arise for a lot of people, but I'd be interested to know of any other solid use cases for simdlen (or other techniques people use to address this issue).

--
Tim