Sunday, November 8, 2009

On Loop Parallelism, Task Queue, Graph Partitioning

I was already familiar with the loop parallelism pattern and some of the transformations that can be made to loops and loop nests in order to achieve efficient parallel execution as this material has been covered in the Computer Architecture class. The paper formalizes the steps required to transform a sequential program into a parallel one and offers an extended set of transformations. Additionally it offers few valuable suggestions on how these steps should be applied. For instance, parallelism should be applied incrementally. It is desirable for optimizations to be applied one-by-one, rather than all at once. This eases testing and debugging of individual optimizations. Also, it is important to select the set of loops most amenable to optimization as less code re-structuring will be necessary to achieve the desired speedup. The use of profiling tools (e.g. Intel VTune or the open-source gprof) is highly recommended for loop identification as they record what fraction of a program’s runtime is spent in various portions of the code. The author also emphasizes the importance of the memory system performance, as simply exploiting parallelism is not usually enough. It is necessary to consider the data-movement and data-use necessitated by parallelization.

Task queue pattern is yet another parallel pattern that seems very intuitive and that offers a very good coverage on the implementation of the pattern. The author describes the problem forces very clearly. The centralized vs. distributed is one of the tradeoffs that an implementation should consider. While centralized task queues are simpler to implement, they can quickly become a bottleneck as the number of processing elements increases. In contrast, distributed task queues require more overhead, however they are able to scale to much larger number of processing elements. Task size, latency to acquire a task and the number of processors are three factors that should be considered before deciding to implement the centralized version. If the task size is large and the overhead of acquiring a task by a processor is small and the number of processors is small, then centralized queue is an efficient mechanism to use. Otherwise, the distributed queue is the better choice. One method for implementing a distributed task queue is distributed task stealing. In this model, each processor has its own queue, it enqueues its local tasks to this queue and operates on them, keeping the computation local. If the amount of available parallelism is limited or the amount of local memory is small, hierarchical task queues, a hybrid approach between distributed and a centralized queue, can sometimes perform better. Eager vs. lazy scheduling is a tradeoff concerned with the load balancing of the tasks. The eager scheduling happens periodically and can provide better load balancing at the expense of extra overhead, while lazy scheduling (performed only in the case when one processor becomes idle) might not yield as good of load balance, but requires less overhead.

Graph partitioning pattern seems to have been already covered under Graph Algorithm Pattern. Even though it makes partitioning an explicit objective, the partitioning step seems to be central in the Graph Algorithm as well. Even the solution steps seem to be roughly similar. They mainly involve recognizing problem specific graph structure and designing partitioning and parallel traversal technique. There is a limited number of parallelization opportunities for many graph structures. Both papers cover the most common of them: Kernighan-Lin, Fiduccia-Matheyses, BFS, ParMetic, etc. The graph partitioning pattern though, seems to offer more coverage on the theoretical aspects of the partitioning algorithms.

No comments:

Post a Comment