Hello,

I am wondering how I partition loop iterations in an arbitrary and non-contiguous order in parallel for. It would be great if there is any prototype that support such a scheduling. As far as I know, there isn't any reference implementation for it except the user defined schedule [1] which is recently proposed and being under discussion.

More specifically, for my research, I'd like to evaluate different scheduling (more of partitioning than assignment) policies based on the information given by an application or profiled performance from hardware. The information could be simply expressed in a bit vector, each position of bit representing each iteration and each position in the vector specifying a thread number to run the iteration. For example, [1] [2] [2] [1] says the first and last iterations form thread1 while the rest forms thread2.

I think the only way to evaluate this is to modify runtime system and thus I am now reviewing the runtime system implementation of both LLVM/Intel and GCC. However, it seems a bit challenging and I feel like I am somewhat overwhelmed.

For those who happen to think about this stuff, any advice for me to begin with would be greatly appreciated.

[1] https://sites.google.com/site/userdefschedopenmp/