Personally I have not seen a benefit to changing the chunk size in most cases.
The default for static if not specified, is to divide the iteration space into chunks that are approximately equal in size so that each thread is assigned one chunk. The idea here is so the workload given to each thread is hopefully the same. This of course is dependent on whether the same amount of work is being done for each iteration. You will also notice that since each thread gets one chunk, you don't have to go back and get another chunk, so the overhead is kept low.
For dynamic, if no chunk size is specified, it defaults to 1. This is a pretty low value and would almost guarantee that there is going to be more overhead by threads going back for more work. On the other hand, you would use dynamic when you have arying size workloads for each iteration, so that the work can be distributed more evenly. Here increasing the chunk size might be of benefit depending on the work loads, since you could decrease the overhead of going back for more work. However, you could also cause an imbalance if not careful - the same imbalance that you are trying to avoid by using dynamic.
For guided, chunk size is different. Here the chunks are proportional to the unassigned iterations divided by the number of threads and chunk size is used to determine the minimum size of a chunk. As such, it affects the proportion and has a definite impact. In fact, many vendors allow some other "tweaking" of this calculation by having an implementation specific environment variable that affects the calculation of chunk size.
For your specific question about matrix multiply, I would think that using a static schedule with the default chunk size would be about the best, since the work is about the same for each iteration (though it wouldn't be the first time I was wrong). What would have a greater impact, depending on the size of the array in question, is array blocking. This is not part of OpenMP, but is used to take advantage of caching and has been discussed in other posts in this forum.
As a side note, the one time I have really seen specifying chunk size help was in code like this:
Code: Select all
#pragma omp parallel for ordered schedule(static)
for (i = 0; i < N; i++)
{
... some large amount of work
#pragma omp ordered
{
... some small amount of work
}
...
}
Each chunk had several iterations and the "next thread" had to wait for a previous thread to finish with it's iterations before it could really do anything (other than the first large amount of work). By changing the chunk size to 1 in this case, the user saw a very large speedup.