zamolx3s

Wednesday, November 11, 2009

On Branch-and-Bound, Graphical Models, Structured Grids

We use Branch & Bound Pattern when we have an extremely large space which we need to search to make a decision or to find an optimal solution. The space is so large that enumerating every point in the space is infeasible. The branch-and-bound pattern has four basic operations. The solution involves four steps: Branching, Evaluation, Bounding and Pruning. Branching is when a subproblem p cannot be solved directly and consequently it is decompose into smaller subproblems. Evaluation is the application of the objective function to a point in the search space. Here the reader is assumed to have familiarity with optimization problems. For instance, he should know that an objective function is a function associated with an optimization problem which determines how good a solution is, for instance, the total cost of edges in a solution to a traveling salesman problem. In the Bounding step, we should find a way to derive upper and lower bounds on the solutions contained in the space by solving a simplified version of the search problem. In the Pruning step, if a subproblem has been solved, or it can be proved that the subproblem contains no points better than the current optimal solution, or the subproblem is infeasible, we should eliminate the subproblem. The issues involved with parallelization seem to be well explained. The three main approaches to parallelize the branch-and-bound algorithm are: Operation-based parallelization (e.g. apply the same operation on several subproblems in parallel), Structure-based parallelization (solve different subproblems in parallel) and Construction-based parallelization (construct several branch-and-bound trees in parallel).

Although less than three pages long, I found GraphicalModels hard to understand. The problem is not clearly enunciated, some sections are very vague and the example section is missing which makes hard to understand. I preferred to google for other versions and try to understand more from there.

In my opinion Structured Grid is very well written. With limited knowledge in grid computation I was able to understand most of it in the read. Structured Grid offers a scalable solution for efficient updating of a structured grid, where updates are local in the physical space. Besides a few well explained examples, the pattern clearly explains all the domain specific terms (e.g. ghost nodes, spencils). In a structured grid problem, the location of nodes in the physical space directly maps to their location in data array (this is in contrast to the Unstructured Grid where the location of the element in the array cannot be directly inferred from the physical location). The updates are logically concurrent, but in practice are implemented as a sequential sweep through the computational domain. The most important aspect of solving a problem with a structured grid pattern is the domain decomposition. To decompose the problem into tasks, the grid is sliced into contiguous chunks, and each processing element is assigned a chunk for work. The chunks should be of small enough granularity in order to exploit parallelism and of reasonable size in order to fully exploit the computational resources of each processing element. The forces governing the solution are the typical task granularity vs. parallelism, simple task partitioning (e.g. ignoring unequal workloads) vs. fault tolerance.

Tuesday, November 10, 2009

On Shared Queue, Speculation, Digital Circuits

Shared queue: a queue is very common data structure that can be shared among multiple units of execution in a parallel program. Its implementation should tradeoff between simple synchronization protocols and increased concurrency. The easiest approach employs a single synchronization construct, however such a protocol most likely will encompass too much of the shared queue in that single synchronization construct, thus increasing the chances that UEs will remain blocked waiting to access the queue and will limit available concurrency. Fine tuning can increase the concurrency by providing concurrent access to multiple parts of the queue however this approach tends to be more complicated and thus error prone. On machines where the memory hierarchy of the underlying hardware and the number of UEs are limited, a distributed queue can avoid contention and decrease the parallel overhead. One idea of implementation is the distributed queue used in fork/join. Here each thread from the thread pool is associated a non-blocking distributed queue. When a thread generates a new task, it is put in its own task queue. When a thread executes a task, it first tries to take one from its own queue. If it is empty, it randomly chooses another thread and tries to steal one from that threads’ queue. As each task tries to take a new task first from its own task and there is no single hot spot, it eliminates the performance degradation.

I found speculation quite an interesting pattern. I was especially familiar with hardware speculation and particularly with branch predication, which is present in almost all major desktop CPUs. With branch predication, for each branch instruction, one path is speculatively chosen based on the past behavior of the program. This can be extended to speculatively execute both branches and terminate one of them when the result of the branch predicate is available. Speculation as a parallel algorithmic strategy is similar but it is usually at a much higher granularity. The main assumption in the speculation algorithmic strategy pattern is that we can break long dependency chains by ignoring them. Creating independent tasks is then a speculative action as ignoring these dependencies may not always hold good for all inputs. In such cases, we might not be much better off than a serial algorithm, but on average we will be able to take advantage of the available hardware parallelism. Predicates are constructs that are used to keep track of what needs to be true for the decomposition to hold. Predicate validity check is a very important aspect of this pattern. A quick way of checking the predicate validity is essential. A last step of the algorithm is having a commit/recovery mechanism depending on whether the predicate was true/false respectively.

The link assigned to the third pattern reads Digital circuits, while the description attached is actually for the Circuits pattern. Even though related the two are still different patterns. Digital circuits offers solutions for implementing an hardware architecture instruction set that meets the performance and power requirements. The circuits pattern applies to problems where the output is a logical function or bitwise calculation of the input. Since most of the instruction sets are restricted to operate on a word size of data, the pattern provides a way to efficiently compute the output. Digital circuits deals with bitwise operations as well as the arithmetic operations (ADD, SUBTRACT, ALU) on bits or bit-vectors. Therefore, it is a natural fit for the circuits pattern from the view of algorithmic strategies.

Sunday, November 8, 2009

On Loop Parallelism, Task Queue, Graph Partitioning

I was already familiar with the loop parallelism pattern and some of the transformations that can be made to loops and loop nests in order to achieve efficient parallel execution as this material has been covered in the Computer Architecture class. The paper formalizes the steps required to transform a sequential program into a parallel one and offers an extended set of transformations. Additionally it offers few valuable suggestions on how these steps should be applied. For instance, parallelism should be applied incrementally. It is desirable for optimizations to be applied one-by-one, rather than all at once. This eases testing and debugging of individual optimizations. Also, it is important to select the set of loops most amenable to optimization as less code re-structuring will be necessary to achieve the desired speedup. The use of profiling tools (e.g. Intel VTune or the open-source gprof) is highly recommended for loop identification as they record what fraction of a program’s runtime is spent in various portions of the code. The author also emphasizes the importance of the memory system performance, as simply exploiting parallelism is not usually enough. It is necessary to consider the data-movement and data-use necessitated by parallelization.

Task queue pattern is yet another parallel pattern that seems very intuitive and that offers a very good coverage on the implementation of the pattern. The author describes the problem forces very clearly. The centralized vs. distributed is one of the tradeoffs that an implementation should consider. While centralized task queues are simpler to implement, they can quickly become a bottleneck as the number of processing elements increases. In contrast, distributed task queues require more overhead, however they are able to scale to much larger number of processing elements. Task size, latency to acquire a task and the number of processors are three factors that should be considered before deciding to implement the centralized version. If the task size is large and the overhead of acquiring a task by a processor is small and the number of processors is small, then centralized queue is an efficient mechanism to use. Otherwise, the distributed queue is the better choice. One method for implementing a distributed task queue is distributed task stealing. In this model, each processor has its own queue, it enqueues its local tasks to this queue and operates on them, keeping the computation local. If the amount of available parallelism is limited or the amount of local memory is small, hierarchical task queues, a hybrid approach between distributed and a centralized queue, can sometimes perform better. Eager vs. lazy scheduling is a tradeoff concerned with the load balancing of the tasks. The eager scheduling happens periodically and can provide better load balancing at the expense of extra overhead, while lazy scheduling (performed only in the case when one processor becomes idle) might not yield as good of load balance, but requires less overhead.

Graph partitioning pattern seems to have been already covered under Graph Algorithm Pattern. Even though it makes partitioning an explicit objective, the partitioning step seems to be central in the Graph Algorithm as well. Even the solution steps seem to be roughly similar. They mainly involve recognizing problem specific graph structure and designing partitioning and parallel traversal technique. There is a limited number of parallelization opportunities for many graph structures. Both papers cover the most common of them: Kernighan-Lin, Fiduccia-Matheyses, BFS, ParMetic, etc. The graph partitioning pattern though, seems to offer more coverage on the theoretical aspects of the partitioning algorithms.

On PRES: Probabilistic Replay with Execution Sketching on Multiprocessors

Concurrency bugs, such as atomicity violations and deadlocks, are some of the most difficult bugs to detect and diagnose due to their non-deterministic nature. While on uniprocessor machines solutions based on logic time have been used in order to deal with the nondeterminism caused by the timing variations, on multi-processors (SMPs and multi-cores), besides thread scheduling, signals and asynchronous events, timing variations exert another major influence on thread interleaving. Traditional solutions in the area of multi-processors suggest deterministic replay and try to record almost every inter-thread interaction such as synchronization and shared memory communication, in addition to those data necessary for uni-processor replay. Because hardware modification always comes with a high cost of fabrication and increased hardware complexity software solutions are generally preferred.

Next to CHESS, PRES is another novel software-only approach that was proposed for exposing concurrency bugs. However, while both navigate a non-deterministic execution space, CHESS tries to find unexplored interleavings to test, while PRES uses execution sketches as guidelines and feedback from failed replays to get closer to reproducing a target concurrency bug. In order to lower production-run recording overhead, PRES relaxes the traditional (perhaps idealistic) objective of “reproducing the bug on the first replay attempt”. It achieves that by recording only partial execution information during the production run. In a subsequent step, using an intelligent partial-information based replayer that systematically explores the unrecorded nondeterministic space via multiple coordinated replay, PRES attempts to reconstruct the complete information necessary for bug reproduction. After a bug is reproduced successfully once after several replay attempts, PRES can then reproduce it with 100% probability on every subsequent replay for diagnostic purposes.

PRES leverages three types of information: inter-thread global order from the production run, intra-thread local information from the production run, and feedback from previous unsuccessful replays. In order to record this information PRES uses five sketching (recording) methods that trade reproduction probability for recording overhead at different levels, and two basic schemes (Base and RW) that represent two extremes of the tradeoff for comparison. PI-Replayer, the central component of the architecture, conducts a repeated and automated process, i.e., replaying, generating feedback, using feedback, and replaying again. After an unsuccessful replay run, the system first identifies all dynamic race instruction pairs in the previous unsuccessful replay using an happens-before race detection algorithm and filters out all ordered-determined production run races. Next, from the suspect set, PI-Replayer follows certain policies to flip the execution order of one pair in the next replay and executes next replays deterministically until the suspect racing pair is reached (a deterministic execution is feasible because every non-deterministic event in the previous replay was recorded). The above steps may be repeated many times until the bug is reproduced. The mistakes made by early replays will be corrected, and PI-Replay will gradually get closer to bug reproduction. Since there are a limited number of execution paths and therefore a finite number of races, repeating the above process will eventually find a correct interleaving and successfully reproduce the bug.

While PRES’s overheads are significantly lower than those of previous methods, they may still be considered high for some performance-critical applications, so further optimizations may be useful. Nevertheless, PRES seems to be of great help in reproducing concurrency bugs on multi-processors systems.

Thursday, November 5, 2009

On PipelineProcessing, GeometricDecomposition, DataParallelism

Instruction pipeline in modern CPUs was one of the main topics I studied in the Computer Architecture class and therefore I could assume some familiarity with this pattern. Vector processing (loop level pipelining) was also a subject covered in same class. In loop level pipelining, specialized hardware in super computers allows operations on vectors to be performed in a pipelined fashion. Pipeline processing pattern is very similar to the Pipes and Filter patter, with the key difference being that this pattern explicitly discusses concurrency. It is also similar with Discrete Event pattern in the way that both apply to problems where the computation is naturally decomposed into a collection of semi independent tasks. However, the Discrete Event pattern is irregular and asynchronous where the Pipeline pattern is regular and synchronous: In the Pipeline pattern, the semi-independent tasks represent the stages of the pipeline, the structure of the pipeline is static, and the interaction between successive stages is regular and loosely synchronous. In the Discrete Event pattern, however, the tasks can interact in very irregular and asynchronous ways, and there is no requirement for a static structure.

The Geometric decomposition pattern was a first time read for me, but the explanation was clear and very intuitive. The authors discuss four key areas of geometric decomposition: data decomposition, data exchange, update operation, data distribution and task schedule. The decomposition of data into chunks implies a decomposition of the update operations into tasks. Each task represents the update of one chunk, and tasks execute concurrently. If each task can be executed with information entirely local to its chunk, then the concurrency is embarrassingly parallel and the Task Parallelism pattern should be used. However, when this is not the case Geometric decomposition should be used. A key factor in Geometric decomposition is ensuring that nonlocal data required for the update operation is obtained before it is needed. This is specified by the Data exchange. We must also ensure the data required for each chunk update is present when needed. This is analogous to managing data dependencies in the Task Parallelism pattern. The overall program structure for applications of this pattern will normally use either the Loop Parallelism pattern or the SPMD pattern, with the choice determined largely by the target platform.

Data parallelism is very generic. The operations are often uniformly applied to the data elements. This is when in a way data parallelism relates to the composite pattern. This is mainly because the pattern allows the programmer to represent a complex hierarchical structure so that individual objects and compositions of those objects are handled in a uniform way. In parallel computing domain decomposition is one way of utilizing the power of a parallel architecture. For data parallelism and domain decomposition into sub-domains, the composite pattern may provide a useful paradigm.

Wednesday, November 4, 2009

On Armstrong Thesis Chapter 6

Erlang behaviors are valuable constructs. They are similar to interfaces in other languages. They are essentially a required set of callbacks. The OTP libraries use them to separate the functional parts of a server from the non-functional parts, letting developers quickly leverage common design patterns. In the Erlang sample application presented in chapter 6, the behaviours solve orthogonal problems (e.g. client–server has nothing to do with worker-supervisor). In building a real system behaviours may be combined and mixed up in many different ways to solve problems.

In a supervision tree, many of the processes have similar structures, they follow similar patterns. For example, the supervisors are very similar in structure. The only difference between them is which child processes they supervise. Also, many of the workers are servers in a server-client relation, finite state machines, or event handlers such as error loggers. Behaviours are formalizations of these common patterns. The idea is to divide the code for a process in a generic part (a behaviour module) and a specific part (a callback module). The advantages of offering a small and fixed set of behaviours are similar to those of design patterns. They focus attention on a small set of well-proven techniques, provide a common vocabulary, allowing the designer to structure and talk about the design in a precise manner. It is of course possible to implement custom behaviours, apparently a very simple task, but for some reason completely undocumented.

Intentional programming as described by Joe Armstrong seem to be a light version of the more novel concept introduced by Charles Simonyi. In Erlang, the concept goes as far as far suggesting that the programmer should write code in such a way that the reader of a program can easily see what the programmer intended by their code. As envisioned by Simonyi, developing a new application via the Intentional Programming paradigm is a bit more complex. A programmer first builds a toolbox specific to a given problem domain, then together with domain experts describes the application's intended behavior based on some intends (e.g. "print the numbers 1 to 10"). In the end an automated system uses the program described in terms of intends and the toolbox to generate the final program.

Gen_server and gen_event seem to be similar in form and behaviour however they serve entirely different purposes. Gen_server describes the client-server model, which is generally used for resource management operations, where several different clients want to share a common resource. The server is responsible for managing this resource. Gen_event is an implementation of the event based, implicit invocation pattern. When an event arrives at an event manager it will be processed by all the event handlers which have been installed within the event manager. Event managers can be manipulated at run-time. In particular we can install an event handler, remove an event handler or replace one event handler with a different handler.

As a final word on Erlang, I found Joe Armstrong’s thesis very impressive. Erlang itself is very interesting from an intellectual standpoint. Joe Armstrong is doing a great job at explaining a language that makes concurrent programming easier and faster than any other language so far, and at the same time is quite arcane for anyone not familiar with its underlying concepts, especially those related to fault-tolerance.

Tuesday, November 3, 2009

On Task Parallelism, Recursive Splitting, Discrete Event

Out of the three patterns, I would have expected Task Parallelism to be the easiest read. However the overall feeling was that it provided an abundance of details. The examples are various, a bit too elaborate and sometimes a bit too complex. There are around two tables and one diagram to support the importance of understanding implementation platform characteristics. I suppose that the complexity of the pattern writing form comes from its applicability to a numerous class of problems combined with the wish of the authors to cover a lot material. In the end, I would say that this pattern constitutes more of a good reference on Task Parallelism.

I found recursive splitting very intuitive because of its similarity to algorithms that can be expressed recursively or the commonly known Divide and Conquer (e.g. many flavors of sorting). In such cases, usually a problem is recursively split into smaller problems until the problem is small enough to solve directly. Similarly the solutions to the recursive splitting problems can be viewed as solving recursively generated tasks (e.g. quick sort or binary tree search). Recursive splitting is particularly useful when the knowledge of the hardware resources available is not always known, because it offers an algorithmic strategy that produces tasks but does not know about handling them. The handling of the generated tasks can then be handled by implementation strategy patterns like task queue and fork join etc, strategies that I was also already aware of from the previous readings on patterns on parallelism. The major concerns with mapping such recursive structures to parallel platforms are determining recursion depth vs. computation per node, load balancing and locality considerations. Defining a small base case for the recursion generates a large amount of concurrent tasks and keeps the hardware busy, however, the efficiency per thread can be low. Defining larger base cases can improve the computational efficiency per thread by keeping the overheads low, but might not generate enough concurrency. For instance, in the quicksort example, the base case is too small – an array of size 1 is sorted by default. This is not the ideal scenario – a better option is to use small efficient sorts like insertion sort or a serial version of quicksort within a task if the array size is less than a threshold (depending on the problem size and machine capabilities). I also found particularly interesting the description of possible optimizations to improve locality.

I was not familiar with the discrete event pattern. The pattern seems to be a generalization of the Pipeline pattern. In the Discrete Event pattern, there is no restriction to a linear structure, no restriction that the flow of data would be one-way, and the interaction takes place at irregular and sometimes unpredictable intervals. Sometimes it is desirable to compose existing, possibly sequential, program components that interact in possibly irregular ways into a parallel program without changing the internals of the components. For problems such as this, it might make sense to base a parallel algorithm on defining a task (or a group of tightly coupled tasks) for each component, or in the case of discrete-event simulation, simulation entity. Interaction between these tasks is then based on the ordering constraints determined by the flow of data between them. Load balancing is a difficult problem in this pattern due to its potentially irregular structure and possible dynamic nature. Some infrastructures that support this pattern allow task migration so that the load can be balanced dynamically at runtime. The solution is based on expressing the data flow using events as abstractions, with each event having a task that generates it and a task that processes it and defining ordering constraints between the tasks. Each task is an instance of the Facade pattern by providing a consistent event-based interface to the component. Optimistic and pessimistic approaches are used to deal with out of order invents. It is also possible for systems using this pattern to reach deadlock. In these cases, a middle-ground solution is to use timeouts instead of accurate deadlock detection, and is often the best approach.