Wednesday, November 11, 2009

On Branch-and-Bound, Graphical Models, Structured Grids

We use Branch & Bound Pattern when we have an extremely large space which we need to search to make a decision or to find an optimal solution. The space is so large that enumerating every point in the space is infeasible. The branch-and-bound pattern has four basic operations. The solution involves four steps: Branching, Evaluation, Bounding and Pruning. Branching is when a subproblem p cannot be solved directly and consequently it is decompose into smaller subproblems. Evaluation is the application of the objective function to a point in the search space. Here the reader is assumed to have familiarity with optimization problems. For instance, he should know that an objective function is a function associated with an optimization problem which determines how good a solution is, for instance, the total cost of edges in a solution to a traveling salesman problem. In the Bounding step, we should find a way to derive upper and lower bounds on the solutions contained in the space by solving a simplified version of the search problem. In the Pruning step, if a subproblem has been solved, or it can be proved that the subproblem contains no points better than the current optimal solution, or the subproblem is infeasible, we should eliminate the subproblem. The issues involved with parallelization seem to be well explained. The three main approaches to parallelize the branch-and-bound algorithm are: Operation-based parallelization (e.g. apply the same operation on several subproblems in parallel), Structure-based parallelization (solve different subproblems in parallel) and Construction-based parallelization (construct several branch-and-bound trees in parallel).

Although less than three pages long, I found GraphicalModels hard to understand. The problem is not clearly enunciated, some sections are very vague and the example section is missing which makes hard to understand. I preferred to google for other versions and try to understand more from there.

In my opinion Structured Grid is very well written. With limited knowledge in grid computation I was able to understand most of it in the read. Structured Grid offers a scalable solution for efficient updating of a structured grid, where updates are local in the physical space. Besides a few well explained examples, the pattern clearly explains all the domain specific terms (e.g. ghost nodes, spencils). In a structured grid problem, the location of nodes in the physical space directly maps to their location in data array (this is in contrast to the Unstructured Grid where the location of the element in the array cannot be directly inferred from the physical location). The updates are logically concurrent, but in practice are implemented as a sequential sweep through the computational domain. The most important aspect of solving a problem with a structured grid pattern is the domain decomposition. To decompose the problem into tasks, the grid is sliced into contiguous chunks, and each processing element is assigned a chunk for work. The chunks should be of small enough granularity in order to exploit parallelism and of reasonable size in order to fully exploit the computational resources of each processing element. The forces governing the solution are the typical task granularity vs. parallelism, simple task partitioning (e.g. ignoring unequal workloads) vs. fault tolerance.

Tuesday, November 10, 2009

On Shared Queue, Speculation, Digital Circuits

Shared queue: a queue is very common data structure that can be shared among multiple units of execution in a parallel program. Its implementation should tradeoff between simple synchronization protocols and increased concurrency. The easiest approach employs a single synchronization construct, however such a protocol most likely will encompass too much of the shared queue in that single synchronization construct, thus increasing the chances that UEs will remain blocked waiting to access the queue and will limit available concurrency. Fine tuning can increase the concurrency by providing concurrent access to multiple parts of the queue however this approach tends to be more complicated and thus error prone. On machines where the memory hierarchy of the underlying hardware and the number of UEs are limited, a distributed queue can avoid contention and decrease the parallel overhead. One idea of implementation is the distributed queue used in fork/join. Here each thread from the thread pool is associated a non-blocking distributed queue. When a thread generates a new task, it is put in its own task queue. When a thread executes a task, it first tries to take one from its own queue. If it is empty, it randomly chooses another thread and tries to steal one from that threads’ queue. As each task tries to take a new task first from its own task and there is no single hot spot, it eliminates the performance degradation.

I found speculation quite an interesting pattern. I was especially familiar with hardware speculation and particularly with branch predication, which is present in almost all major desktop CPUs. With branch predication, for each branch instruction, one path is speculatively chosen based on the past behavior of the program. This can be extended to speculatively execute both branches and terminate one of them when the result of the branch predicate is available. Speculation as a parallel algorithmic strategy is similar but it is usually at a much higher granularity. The main assumption in the speculation algorithmic strategy pattern is that we can break long dependency chains by ignoring them. Creating independent tasks is then a speculative action as ignoring these dependencies may not always hold good for all inputs. In such cases, we might not be much better off than a serial algorithm, but on average we will be able to take advantage of the available hardware parallelism. Predicates are constructs that are used to keep track of what needs to be true for the decomposition to hold. Predicate validity check is a very important aspect of this pattern. A quick way of checking the predicate validity is essential. A last step of the algorithm is having a commit/recovery mechanism depending on whether the predicate was true/false respectively.

The link assigned to the third pattern reads Digital circuits, while the description attached is actually for the Circuits pattern. Even though related the two are still different patterns. Digital circuits offers solutions for implementing an hardware architecture instruction set that meets the performance and power requirements. The circuits pattern applies to problems where the output is a logical function or bitwise calculation of the input. Since most of the instruction sets are restricted to operate on a word size of data, the pattern provides a way to efficiently compute the output. Digital circuits deals with bitwise operations as well as the arithmetic operations (ADD, SUBTRACT, ALU) on bits or bit-vectors. Therefore, it is a natural fit for the circuits pattern from the view of algorithmic strategies.

Sunday, November 8, 2009

On Loop Parallelism, Task Queue, Graph Partitioning

I was already familiar with the loop parallelism pattern and some of the transformations that can be made to loops and loop nests in order to achieve efficient parallel execution as this material has been covered in the Computer Architecture class. The paper formalizes the steps required to transform a sequential program into a parallel one and offers an extended set of transformations. Additionally it offers few valuable suggestions on how these steps should be applied. For instance, parallelism should be applied incrementally. It is desirable for optimizations to be applied one-by-one, rather than all at once. This eases testing and debugging of individual optimizations. Also, it is important to select the set of loops most amenable to optimization as less code re-structuring will be necessary to achieve the desired speedup. The use of profiling tools (e.g. Intel VTune or the open-source gprof) is highly recommended for loop identification as they record what fraction of a program’s runtime is spent in various portions of the code. The author also emphasizes the importance of the memory system performance, as simply exploiting parallelism is not usually enough. It is necessary to consider the data-movement and data-use necessitated by parallelization.

Task queue pattern is yet another parallel pattern that seems very intuitive and that offers a very good coverage on the implementation of the pattern. The author describes the problem forces very clearly. The centralized vs. distributed is one of the tradeoffs that an implementation should consider. While centralized task queues are simpler to implement, they can quickly become a bottleneck as the number of processing elements increases. In contrast, distributed task queues require more overhead, however they are able to scale to much larger number of processing elements. Task size, latency to acquire a task and the number of processors are three factors that should be considered before deciding to implement the centralized version. If the task size is large and the overhead of acquiring a task by a processor is small and the number of processors is small, then centralized queue is an efficient mechanism to use. Otherwise, the distributed queue is the better choice. One method for implementing a distributed task queue is distributed task stealing. In this model, each processor has its own queue, it enqueues its local tasks to this queue and operates on them, keeping the computation local. If the amount of available parallelism is limited or the amount of local memory is small, hierarchical task queues, a hybrid approach between distributed and a centralized queue, can sometimes perform better. Eager vs. lazy scheduling is a tradeoff concerned with the load balancing of the tasks. The eager scheduling happens periodically and can provide better load balancing at the expense of extra overhead, while lazy scheduling (performed only in the case when one processor becomes idle) might not yield as good of load balance, but requires less overhead.

Graph partitioning pattern seems to have been already covered under Graph Algorithm Pattern. Even though it makes partitioning an explicit objective, the partitioning step seems to be central in the Graph Algorithm as well. Even the solution steps seem to be roughly similar. They mainly involve recognizing problem specific graph structure and designing partitioning and parallel traversal technique. There is a limited number of parallelization opportunities for many graph structures. Both papers cover the most common of them: Kernighan-Lin, Fiduccia-Matheyses, BFS, ParMetic, etc. The graph partitioning pattern though, seems to offer more coverage on the theoretical aspects of the partitioning algorithms.

On PRES: Probabilistic Replay with Execution Sketching on Multiprocessors

Concurrency bugs, such as atomicity violations and deadlocks, are some of the most difficult bugs to detect and diagnose due to their non-deterministic nature. While on uniprocessor machines solutions based on logic time have been used in order to deal with the nondeterminism caused by the timing variations, on multi-processors (SMPs and multi-cores), besides thread scheduling, signals and asynchronous events, timing variations exert another major influence on thread interleaving. Traditional solutions in the area of multi-processors suggest deterministic replay and try to record almost every inter-thread interaction such as synchronization and shared memory communication, in addition to those data necessary for uni-processor replay. Because hardware modification always comes with a high cost of fabrication and increased hardware complexity software solutions are generally preferred.

Next to CHESS, PRES is another novel software-only approach that was proposed for exposing concurrency bugs. However, while both navigate a non-deterministic execution space, CHESS tries to find unexplored interleavings to test, while PRES uses execution sketches as guidelines and feedback from failed replays to get closer to reproducing a target concurrency bug. In order to lower production-run recording overhead, PRES relaxes the traditional (perhaps idealistic) objective of “reproducing the bug on the first replay attempt”. It achieves that by recording only partial execution information during the production run. In a subsequent step, using an intelligent partial-information based replayer that systematically explores the unrecorded nondeterministic space via multiple coordinated replay, PRES attempts to reconstruct the complete information necessary for bug reproduction. After a bug is reproduced successfully once after several replay attempts, PRES can then reproduce it with 100% probability on every subsequent replay for diagnostic purposes.

PRES leverages three types of information: inter-thread global order from the production run, intra-thread local information from the production run, and feedback from previous unsuccessful replays. In order to record this information PRES uses five sketching (recording) methods that trade reproduction probability for recording overhead at different levels, and two basic schemes (Base and RW) that represent two extremes of the tradeoff for comparison. PI-Replayer, the central component of the architecture, conducts a repeated and automated process, i.e., replaying, generating feedback, using feedback, and replaying again. After an unsuccessful replay run, the system first identifies all dynamic race instruction pairs in the previous unsuccessful replay using an happens-before race detection algorithm and filters out all ordered-determined production run races. Next, from the suspect set, PI-Replayer follows certain policies to flip the execution order of one pair in the next replay and executes next replays deterministically until the suspect racing pair is reached (a deterministic execution is feasible because every non-deterministic event in the previous replay was recorded). The above steps may be repeated many times until the bug is reproduced. The mistakes made by early replays will be corrected, and PI-Replay will gradually get closer to bug reproduction. Since there are a limited number of execution paths and therefore a finite number of races, repeating the above process will eventually find a correct interleaving and successfully reproduce the bug.

While PRES’s overheads are significantly lower than those of previous methods, they may still be considered high for some performance-critical applications, so further optimizations may be useful. Nevertheless, PRES seems to be of great help in reproducing concurrency bugs on multi-processors systems.

Thursday, November 5, 2009

On PipelineProcessing, GeometricDecomposition, DataParallelism

Instruction pipeline in modern CPUs was one of the main topics I studied in the Computer Architecture class and therefore I could assume some familiarity with this pattern. Vector processing (loop level pipelining) was also a subject covered in same class. In loop level pipelining, specialized hardware in super computers allows operations on vectors to be performed in a pipelined fashion. Pipeline processing pattern is very similar to the Pipes and Filter patter, with the key difference being that this pattern explicitly discusses concurrency. It is also similar with Discrete Event pattern in the way that both apply to problems where the computation is naturally decomposed into a collection of semi independent tasks. However, the Discrete Event pattern is irregular and asynchronous where the Pipeline pattern is regular and synchronous: In the Pipeline pattern, the semi-independent tasks represent the stages of the pipeline, the structure of the pipeline is static, and the interaction between successive stages is regular and loosely synchronous. In the Discrete Event pattern, however, the tasks can interact in very irregular and asynchronous ways, and there is no requirement for a static structure.

The Geometric decomposition pattern was a first time read for me, but the explanation was clear and very intuitive. The authors discuss four key areas of geometric decomposition: data decomposition, data exchange, update operation, data distribution and task schedule. The decomposition of data into chunks implies a decomposition of the update operations into tasks. Each task represents the update of one chunk, and tasks execute concurrently. If each task can be executed with information entirely local to its chunk, then the concurrency is embarrassingly parallel and the Task Parallelism pattern should be used. However, when this is not the case Geometric decomposition should be used. A key factor in Geometric decomposition is ensuring that nonlocal data required for the update operation is obtained before it is needed. This is specified by the Data exchange. We must also ensure the data required for each chunk update is present when needed. This is analogous to managing data dependencies in the Task Parallelism pattern. The overall program structure for applications of this pattern will normally use either the Loop Parallelism pattern or the SPMD pattern, with the choice determined largely by the target platform.

Data parallelism is very generic. The operations are often uniformly applied to the data elements. This is when in a way data parallelism relates to the composite pattern. This is mainly because the pattern allows the programmer to represent a complex hierarchical structure so that individual objects and compositions of those objects are handled in a uniform way. In parallel computing domain decomposition is one way of utilizing the power of a parallel architecture. For data parallelism and domain decomposition into sub-domains, the composite pattern may provide a useful paradigm.

Wednesday, November 4, 2009

On Armstrong Thesis Chapter 6

Erlang behaviors are valuable constructs. They are similar to interfaces in other languages. They are essentially a required set of callbacks. The OTP libraries use them to separate the functional parts of a server from the non-functional parts, letting developers quickly leverage common design patterns. In the Erlang sample application presented in chapter 6, the behaviours solve orthogonal problems (e.g. client–server has nothing to do with worker-supervisor). In building a real system behaviours may be combined and mixed up in many different ways to solve problems.

In a supervision tree, many of the processes have similar structures, they follow similar patterns. For example, the supervisors are very similar in structure. The only difference between them is which child processes they supervise. Also, many of the workers are servers in a server-client relation, finite state machines, or event handlers such as error loggers. Behaviours are formalizations of these common patterns. The idea is to divide the code for a process in a generic part (a behaviour module) and a specific part (a callback module). The advantages of offering a small and fixed set of behaviours are similar to those of design patterns. They focus attention on a small set of well-proven techniques, provide a common vocabulary, allowing the designer to structure and talk about the design in a precise manner. It is of course possible to implement custom behaviours, apparently a very simple task, but for some reason completely undocumented.

Intentional programming as described by Joe Armstrong seem to be a light version of the more novel concept introduced by Charles Simonyi. In Erlang, the concept goes as far as far suggesting that the programmer should write code in such a way that the reader of a program can easily see what the programmer intended by their code. As envisioned by Simonyi, developing a new application via the Intentional Programming paradigm is a bit more complex. A programmer first builds a toolbox specific to a given problem domain, then together with domain experts describes the application's intended behavior based on some intends (e.g. "print the numbers 1 to 10"). In the end an automated system uses the program described in terms of intends and the toolbox to generate the final program.

Gen_server and gen_event seem to be similar in form and behaviour however they serve entirely different purposes. Gen_server describes the client-server model, which is generally used for resource management operations, where several different clients want to share a common resource. The server is responsible for managing this resource. Gen_event is an implementation of the event based, implicit invocation pattern. When an event arrives at an event manager it will be processed by all the event handlers which have been installed within the event manager. Event managers can be manipulated at run-time. In particular we can install an event handler, remove an event handler or replace one event handler with a different handler.

As a final word on Erlang, I found Joe Armstrong’s thesis very impressive. Erlang itself is very interesting from an intellectual standpoint. Joe Armstrong is doing a great job at explaining a language that makes concurrent programming easier and faster than any other language so far, and at the same time is quite arcane for anyone not familiar with its underlying concepts, especially those related to fault-tolerance.

Tuesday, November 3, 2009

On Task Parallelism, Recursive Splitting, Discrete Event

Out of the three patterns, I would have expected Task Parallelism to be the easiest read. However the overall feeling was that it provided an abundance of details. The examples are various, a bit too elaborate and sometimes a bit too complex. There are around two tables and one diagram to support the importance of understanding implementation platform characteristics. I suppose that the complexity of the pattern writing form comes from its applicability to a numerous class of problems combined with the wish of the authors to cover a lot material. In the end, I would say that this pattern constitutes more of a good reference on Task Parallelism.

I found recursive splitting very intuitive because of its similarity to algorithms that can be expressed recursively or the commonly known Divide and Conquer (e.g. many flavors of sorting). In such cases, usually a problem is recursively split into smaller problems until the problem is small enough to solve directly. Similarly the solutions to the recursive splitting problems can be viewed as solving recursively generated tasks (e.g. quick sort or binary tree search). Recursive splitting is particularly useful when the knowledge of the hardware resources available is not always known, because it offers an algorithmic strategy that produces tasks but does not know about handling them. The handling of the generated tasks can then be handled by implementation strategy patterns like task queue and fork join etc, strategies that I was also already aware of from the previous readings on patterns on parallelism. The major concerns with mapping such recursive structures to parallel platforms are determining recursion depth vs. computation per node, load balancing and locality considerations. Defining a small base case for the recursion generates a large amount of concurrent tasks and keeps the hardware busy, however, the efficiency per thread can be low. Defining larger base cases can improve the computational efficiency per thread by keeping the overheads low, but might not generate enough concurrency. For instance, in the quicksort example, the base case is too small – an array of size 1 is sorted by default. This is not the ideal scenario – a better option is to use small efficient sorts like insertion sort or a serial version of quicksort within a task if the array size is less than a threshold (depending on the problem size and machine capabilities). I also found particularly interesting the description of possible optimizations to improve locality.

I was not familiar with the discrete event pattern. The pattern seems to be a generalization of the Pipeline pattern. In the Discrete Event pattern, there is no restriction to a linear structure, no restriction that the flow of data would be one-way, and the interaction takes place at irregular and sometimes unpredictable intervals. Sometimes it is desirable to compose existing, possibly sequential, program components that interact in possibly irregular ways into a parallel program without changing the internals of the components. For problems such as this, it might make sense to base a parallel algorithm on defining a task (or a group of tightly coupled tasks) for each component, or in the case of discrete-event simulation, simulation entity. Interaction between these tasks is then based on the ordering constraints determined by the flow of data between them. Load balancing is a difficult problem in this pattern due to its potentially irregular structure and possible dynamic nature. Some infrastructures that support this pattern allow task migration so that the load can be balanced dynamically at runtime. The solution is based on expressing the data flow using events as abstractions, with each event having a task that generates it and a task that processes it and defining ordering constraints between the tasks. Each task is an instance of the Facade pattern by providing a consistent event-based interface to the component. Optimistic and pessimistic approaches are used to deal with out of order invents. It is also possible for systems using this pattern to reach deadlock. In these cases, a middle-ground solution is to use timeouts instead of accurate deadlock detection, and is often the best approach.

On Armstrong Thesis Chapter 5

Dugan says that in order “to design and build a fault-tolerant system, you must understand how the system should work, how it might fail, and what kinds of errors can occur”. Also “error detection is an essential component of fault tolerance”. Following Dugan’s advice Armstrong designed a software structure, which detects and tolerates errors. The three important parts of this design are: a supervision hierarchy for tasks, a strategy for programming fault tolerance that applies to this hierarchy, and an implicit mechanism (well behaved functions) that corresponds to our intuitive idea of what an error is which compensates the lack of explicit specification.

The idea of decomposing an application into a hierarchical structure of tasks is of primary importance in designing any type of fault tolerant system. This is because in tolerant systems the components should be easily restartable to a stable previous state. Structuring the application as a hierarchy of tasks inherently supports this property. Monolithic, tightly coupled applications become crippled and cannot be restarted without losing all the work in progress. Tightly coupled operating systems belong in this category as well. Decomposition and fault isolation are so central to fault tolerance that they were employed even by system software. The ability to treat operating system services as separate components enables OSs to tolerate failures, as evidenced by true microkernels.

The other important aspect of programming a fault-tolerant system is what kind of strategy we employ when an error is discovered. Typically, when an error is detected in an isolated component, besides trying to handle the error, the recovery procedure will most commonly try to restart the failing component. In Erlang for instance, the error recovery procedure is to restart the worker associated to a task, or failing this would try to do something simpler. This approach perfectly matches the concept of performing error handling outside the components so that the error recovery code does not get compromised and the component can be safely brought to an accepted state. However the one potential issue with this approach is that, even though Erlang provides language support for task supervisors, the supervisors still have to be written by the programmer, who can introduce bugs. With this in mind, I wonder if it is still safe to say that Erlang’s approach is more robust than the traditional ones (e.g. using exceptions handling to reset to an acceptable state).

Other systems employ the idea of moving error recovery outside the component to an extreme where the developer is completely removed from the action of designing recovery code. For instance, Crash-only systems are systems built of software components that crash safely and recover quickly. Faults are handled by crashing and restarting the faulty component and retrying any requests which have timed out, no recovery sequence other that crash and/or restart is employed. The developer is only concerned with writing components that comply with the crash-only philosophy. The resulting system is often more robust and reliable because crash recovery is a first-class citizen in the development process.

The task hierarchy in Erlang allows for different relationships between tasks. Tasks with the similar complexity are arranged on the same level on the hierarchy. OR and AND supervisors can model dependencies between tasks on the same level: OR supervisors between independent tasks on the same level, AND supervisors between dependent, coordinated processes on the same level. Supervisor – Parent supervisor relationship can be used to describe relationships between tasks on different levels. Recursive restartability (RR) is the ability of a system to tolerate restarts at multiple levels. Erlang, through its Supervisor – Parent supervisor relationship, supports recursive restartability. Such systems possess a number of valuable properties that by themselves improve availability. For instance, a RR system’s fine granularity permits partial restarts to be used as a form of bounded healing, reducing the overall time-to-repair, and hence increasing availability. On top of these desirable intrinsic properties, we can employ an automated, recursive policy of component revival/rejuvenation to further reduce downtime.

Wednesday, October 21, 2009

On Dense Liniar Algebra, Graph Algorithms, Monte Carlo

I must admit that before reading about OPL I was not at all familiar with computation patterns in general. After doing a bit of internet research I came to understand the following about computational patterns. They are not exactly design patterns. OPL’s structural design patterns for instance, define the structure of a system, without indicating what is actually computed. Computational patterns (dwarfs) however, embody the generalized idiom for parallelism used to implement these structural patterns. The analogy is with civil engineering, where structural patterns would describe a factory’s physical structure and general workflow while computational patterns describe the factory’s machinery, flow of resources, and work-products. As with OPL, structural and computational patterns can be combined in the “pattern language” to provide a template for architecting arbitrarily complex parallel software systems.

I expect that the Graph Algorithm Pattern must have been a straightforward read for everybody. Once the graph concepts are clear, most of the details of the first three steps recommend by the pattern are clear: recognizing the problem, determining the data structure, defining the temporary data structures to store traversal variables. Very similar challenges are to be faced with every sequential graph problem. However, I found particularly interesting the graph related aspects of parallelism scattered throughout the four steps. Since for many graph structures there is a limited number of parallelization opportunities, I think that this information is very important. For instance, based on knowledge graph theory, we can choose to eliminate some edges and scope the traversal operation into independent localized partitions such that the partitions can be solved in parallel. Each type of graph is presented in the paper with its specific parallelization opportunities.

Dense Linear Algebra Pattern also seems to be well written. I especially appreciated the figures in this pattern, which support the good understanding of the pattern applied to matrix multiplication. In the Monte Carlo description, one can also get a very good visual understanding of the pattern.

One difference I noticed in the Dense Linear Algebra Pattern compared with Graph Algorithm and Monte Carlo is the way in which the Solution section is structured. In comparison with the latter two patterns, which offer a formal guideline as a sequence of steps in which a particular problem should be approached, Dense Linear Algebra Pattern’s solution suggests the usage of linear algebra libraries or BLAS routines and templates. Finally, if no such library is available or the problem is too specialized, the pattern goes over details on how to structure computation in order to optimize utilization of the memory hierarchy. These are all valuable and interesting details however I am still missing a formal sequence of steps to tell me how to approach this type of problems. For instance, I would have expected something like: 1) recognize linear algebra operations to solve 2) figure out if linear algebra library is available and offers this particular operation 3) figure out whether structuring data to pass as input to library does not lead to potentially suboptimal algorithms 4) if that is the case implement yourself optimized operation using domain specific language.

On Armstrong Thesis Ch 4

A side effect is any computation or operation by a procedure that is not the primary purpose of that procedure. Obviously sharing data between concurrent processes represents a context favoring side-effects, because the view on the shared resource in each process can easily be compromised by another’s action on the same resource. And the problem is not breaking preconditions and invariants, because this can happen in sequential code as well; it is the fact that, in concurrent code, they can be broken even in the middle of performing a certain function, after the invariants or preconditions have been verified by the concurrent code. Consequently, our process can easily run into side effects. After checking preconditions and invariants sequential code is guaranteed to perform the function until the end. Of course, sequential code is also concerned with the manipulation of hardware, which also involves side effects. In the Erlang sequential subset there are a few operations with side-effects, but they are virtually never needed.

Making processes and concurrency part of the Erlang programming language as opposed to relying on the host operating system has a number of advantages. Concurrent programs run identically on different OSs, all issues of synchronization and interaction between processes are the same irrespective of the properties of the host operating system. Also, the separation from OS allows Erlang processes to be implemented as light weight processes without the overhead associated to context switching (thus improved performance). Besides performance, this separation or very little need for an OS, offers a lot of flexibility, an Erlang system being easily ported on specialized environments as for example embedded systems.

I mainly agree with Armstrong saying that “abstracting out concurrency is one of the most powerful means available for structuring large software systems”, an idea definitely supporting the suggested dirty/clean paradigm for structuring code for highly reliable systems. However, I wonder whether Armstrong’s idea should be generalized to any types of software systems or it should only be applied to systems that aim for high reliability.

I recently read a very interesting article where the authors were trying to foresee the future of programming languages. “How high level can we get?” was the question posted by the authors. The current trends favor high level abstractions in everything, higher level languages and tools, from compiler to programming environment. However, it is to not really clear whether languages should take Erlang’s path and change fundamentally to handle distributed systems and concurrency. In fact, the authors even give a good example of a system architecture where concurrency should not be abstracted away altogether: distributed systems. In these systems, local resources (memory and objects etc) are generally fast to access, however accessing remote resources may be hundreds of times slower, or even worse, the resources may not be available at all. These differences should not be abstracted away and hidden for the programmer. An abstraction that allows us to handle different resources using the same techniques is fine, one which prevents us from knowing how our resources are allocated or created is not.

Probably a reasonable idea would be for a language to stay away from concurrency and other such restrictive abstractions or support them only through add-on libraries to facilitate what is a new programming paradigm. Obviously Erlang took the opposite path and built concurrency support into the core language. Would this be among the reasons Erlang not being mainstream?

Tuesday, October 20, 2009

Event based, Implicit invocation & Map-Reduce

Map-Reduce

Interestingly enough the idea of the MapReduce pattern is based on the “map” and “reduce” combinators from a functional language like Lisp. In Lisp, a map takes as input a function and a sequence of values. It then applies the function to each value in the sequence. A reduce combines all the elements of a sequence using a binary operation. For example, it can use "+" to add up all the elements in the sequence. Inspired by this concepts the MapReduce algorithm developed within Google offers a mechanism for processing large amounts of raw data, for example, crawled documents or web request logs.

The pattern is mainly related to Master Worker pattern. The Master-Worker parallel structural pattern is well-known for its load balancing characteristics when there are a large number of (relatively) small requests. In the big picture the mapping phase is carried out by a structure very similar to the Master Worker. Additionally MapReduce exhibits the reduce step, which it is fundamentally a concurrent operation with communication costs that are difficult to hide. Scalable reduction algorithms as PLPP and Reduction are crucial to use in this phase. The key to the load balancing characteristics of the Master-Worker is the dynamic assignment of requests to idle workers. In this structure workers execute at their own speed, taking and processing requests when they need more work. In order to avoid the large scheduling overhead of workers, MapReduce however may employ a slightly different mapping strategy. For instance, if the cost of computing the tasks are similar across the full set of tasks, if they can be ordered from large time to small time, a static round-robin schedule may more adequate.

Event based, Implicit invocation

Implicit invocation architectures differ from explicit invocation systems in that implicit invocation system components use events to communicate with each other. Connectors in such architectures are bindings between events and component methods. Because these bindings are determined dynamically at runtime, components are loosely coupled; there is no compile time determination of which method calls will be made. Loose coupling offers software architects the great benefit of increased flexibility and maintainability: new components can be added by simply registering them as event listeners.

Loosely coupled components work together, but do not rely on each other to do their own job. The interaction policy is separate from the interacting components, providing flexibility. Components can be introduced into a system simply by registering them for events of the system, aiding greatly in reusability. Introduction of new components does not require change in other component interfaces, providing scalability as new features are added. Overall, implicit invocation eases system evolution.

Event-based, implicit invocation is an example of a well-crafted architectural style with high cohesion and loose coupling. As such, it is one of the more broadly accepted architectural styles in software engineering. Examples of implicit invocation systems abound, including virtually all modern operating systems, integrated development environments, and database management systems.

Monday, October 19, 2009

On Armstrong Thesis Ch 2

Joe Armstrong advocates an architecture of inter-process communication solely via messages with no data sharing as the way to construct fault isolating systems. He claims this architecture also facilitates parallelism, however parallelism is not here the main goal. Obviously there is a potential efficiency associated with such an architecture, since components share nothing so a number of independent concurrent processes can be implemented on a multi-processor or run on a distributed network of processors. However parallelism is not a feasible approach when the application cannot easily be partitioned into a number of truly independent tasks. Employing parallelism enforces a new set of design constraints on the application, besides constraints for fault tolerance (e.g. no strong data dependencies between tasks).

Concurrent processes with no data sharing providing a strong measure of fault isolation, is the main design decision of the architecture. In this respect, the message passing abstraction hides underlying state changes that may be used in the implementation of sending messages. This is to limit the consequences of a software error. An important example of an architecture using a similar idea for enforcing fault isolation is the Microkernel architecture. The main goal of the microkernels is to make operating systems more reliable and more secure. With this in mind microkernels break the typical monolithic kernel architecture into a combination of minimal kernel running the privileged code and several servers running the not trusted code. The communication between the kernel and the server components is done through message passing.

A characteristic feature of microkernels is minimalism. Minimalism requires that OS’s trusted computing base (TCB) should be kept minimal. As the kernel (the code that executes in the privileged mode of the hardware) is always part of the TCB, minimizing it is natural in a security-driven and fault tolerant design. I mostly embraced Armstrong philosophy of designing a fault tolerant system. I guess it is a good approach of dealing with the problem of fault isolation while keeping a reasonable behavior. Looking at the microkernel approach I feel that the principle of minimalism should be also taken into account when designing the hierarchy of tasks. I agree that the most complex tasks should be organized on the first level that should be dealt with first. But how about the granularity of each level and applying the principle of minimalism of Microkernels in between layers? Employing such a design philosophy would mean breaking otherwise complex tasks from upper levels into a set of subtasks that can be downgraded to inferior levels in the task hierarchy. Consequently, the task hierarchy tree will tend to grow deeper. The task decomposition should of course keep the same invariants and behavior of the application. Designing a tool to visualize the task hierarchy would offer a very good illustration of this principle. Applications with deep hierarchical task trees will likely be more reliable than the rest.

Erlang makes no assumptions about reliable message passing. The programmer must write his or her application so that it works in the presence of unreliable message passing. I guess Armostron’s point is that since there is no guarantee for a mechanism not to fail, he advocates moving the responsibility of dealing with the reliability of passing messages between components, from the language or framework, into the components themselves and make it part of the philosophy of designing reliable components with respect to any type of interaction with the environment.

Thursday, October 15, 2009

On OPL Patterns

The first versions of the Pipes & Filters and Layered Systems patterns seem to have been written in the GOF form, which I personally prefer because it is a very structured form, breaking up the pattern into many headings: Intent, Motivation, Applicability, Structure, Participants, Collaborations, Consequences, Implementation, Sample Code, Known Uses, and Related Patterns. The GOF pattern are quite large, but if the user does not assume familiarity with the pattern the amount of information is very useful.

The new Pipes & Filters and Layered Systems patterns are written in the Compact form, which, in my opinion, for people closer to the development process, may not be enough. I understand that the intent is that longer pattern languages will benefit from this form because it is easier to read quickly, the structure of the overall language can be understood easier. However, since there is no doubt that both forms offer you at least a section with the crux of the problem (e.g. Intent/Motivation in GOF and Problem/Solution in Compact form), it feels like this short form supports mainly pattern theoreticians, than those who use the patterns in practice. And we all know that nothing is more useful than a code example. I cannot help but remember again a quote from Christopher Alexander: “People who study design methods without also practicing design are almost always frustrated designers who have no sap in them, who have lost, or never had, the urge to shape things”.

Iterative Refinement Pattern helps one exploit the concurrency implied by computational structures consisting of a sequence of high level steps which repeat until some exit condition is met, with each step, comprising a number of mostly independent computations. The Bulk Synchronous Processing (BSP) model, proposed by Leslie Valiant in 1990, is a decomposition explicit, mapping implicit model with communication being implied by the location of the processes and synchronization taking place across the whole program. BSP is designed specifically to support the Iterative Refinement Pattern.

A BSP machine consists of an arbitrary number of processors, each with local memory, connected by an interconnection network, providing functionality as data delivery and barrier synchronization of the processors. A BSP program is divided into supersteps, with each superstep consisting of a local computation for each individual processor, a global message transmission from each processor to any subset of the others and a barrier synchronization. At the end of the superstep, the transmitted messages become available as local data for the next superstep. Because communication all happens together at the end of the computation phase of the superstep, it is possible to perform automatic optimization of the communications pattern.

Coming to the question of what was hardest to understand about the Iterative Refinement Pattern, I feel I should mention again the minimalistic form in which the pattern was written. I think that understanding the solution to the iterative refinement pattern, and how the four parts (computation initialization, the sequence of computational steps, the collection of tasks that execute inside a step, the exit condition) work, an implementation example would have helped a lot.

Wednesday, October 14, 2009

On CHESS

“Heisenbugs” are bugs resulted from unexpected interference among threads, which are typically extremely difficult to reproduce and eliminate. This type of bugs may show up in systems that otherwise have been running for months and their cause may be as trivial as for example adding some extra debug statements. CHESS is a tool able to reproduce such bugs, which has been integrated into many code bases inside Microsoft.

The core design principle in CHESS is that, when attached to a concurrent program, it takes complete control over the scheduling of threads and asynchronous events, thereby capturing all the interleaving nondeterminism in the program. There are two main benefits resulting from such a design. First, in case of executions resulting in errors, CHESS is able to reproduce the erroneous thread interleaving. And secondly, CHESS offers methods for systematic exploration of possible thread interleavings. Consequently it can force every run of the program to follow a different thread interleaving, thus increasing the chances for finding errors in existing tests. Another consequence is that CHESS can find in simple configurations bugs that would otherwise be exhibited by more complex configurations, thus rendering useless artificial system stress tests. The tradeoff for having this design is that CHESS has to deal with the complexity of the concurrency APIs; it should understand the precise signatures of those threading and synchronization functions.

CHESS’s primary goals are capturing all the nondeterministic choices during the execution of the program and exposing them to a search engine that systematically enumerate the range of all possible executions of the concurrent program. This is the basis of reproducing and replay of a chosen concurrent execution. The challenge here is not only to integrate CHESS with complex APIs but also to make sure that it does not introduce extra behavior that is not related to the concurrent program.

The most important component of CHESS, the scheduler, is implemented by redirecting calls to concurrency primitives to alternate implementations from a wrapper library. Including an implementation of these primitives as part of the program is a design choice that may simplify the schedule which now needs to understand only simpler system primitives that the lock implementation uses. However the tradeoff with this approach is that it does not allow the scheduler to expose all the nondeterminism in the lock implementation.

Another design decision meant to simplify the implementation of the scheduler is to abstract the execution of the concurrent program using Lamport’s happens-before graph, which is a graph that captures the relative execution order of threads in a concurrent program. This approach offers a common framework for reasoning about all synchronization primitives used in the program which abstracts away the timing of instructions in the execution. In order to build the tree, CHESS determines whether a task may be disabled by executing a potentially blocking API call, then it labels each call to API by a triplet of task, synchronization variable and operation, and finally informs the scheduler about the creation and termination of a task.

One CHESS innovation is the way it deals with data races: Instead of using a dynamic data-race detection tool, CHESS schedules the outcome of data-races indirectly by enforcing single threaded execution. Consequently since no two threads can concurrently access memory locations, all data-races occur in the order in which CHESS schedule the threads. This may slow down the execution a bit but it can be compensated by running multiple CHESS instances in parallel, each exploring a different part of the interleaving state space.

Monday, October 12, 2009

On Reentrancer

Reentrancer is an Eclipse based refactoring tool that deals with reentrancy related problems in Java programs. The main refactoring steps are as follows:

- Handling library globals. Since removing access to Library globals is kind of hard to achieve, Reentrancer only reports possibly problematic accesses to global library state and leaves it to the programmer to handle them.

- Static fields are replaced by getter and setter methods and every access to these fields are replaced by calls to these accessor methods. This step is mainly designed to support the lazy initialization step. Since the transformation does not preserve static initializer semantics in all cases, Reentrancer have formulated preconditions that ensure that the initializers in an input program are safe for refactoring. Reentrancer's analyses for checking the refactoring preconditions are implemented based on the WALA framework, which is primarily used to compute call graphs from static initializers to discover their transitive callees.

- Reentracer then replaces static initializers with explicit lazy initialization. Static initializer methods are an impediment to reentrancy because they are only executed once, upon the first use of the declaring class. With the help of thread local state, the lazy initialization step will be executed once per thread. Since lazy initialization may alter the point at which initialization code executes thus potentially altering program behavior, Reentrancer performs a static analysis to reveal such cases and issues warnings to the programmer if needed.

- In the next step, global state is replaced with thread-local state with the effect that each thread will get its own copy of this state. This is accomplished by wrapping each appropriate static field in a java.lang.ThreadLocal object and using the methods ThreadLocal.get() and ThreadLocal.set() to read/write the value of the wrapped object.

- The final transformation creates a fresh thread for each execution of the application, thus ensuring that each execution observes gets a different copy of the static fields (now made thread local) and the lazy initialization is executed once per thread.

Reentrancer has been evaluated by observing its behavior on several single-threaded, non-reentrant benchmarks. For each individual benchmark, the reentrancy problems have been first confirmed to exist, either by observing failures when running existing tests in parallel or by writing new tests that exposed problems. After refactoring and checking that the preconditions are still met, the refactored version was run in parallel on a dual-core processor, which yielded performance benefits.

In the recent years, significant advances have been made in the area of automated tool support for refactoring. However, Reentrancer is the first enabling semi-automatic refactoring support for making existing Java programs reentrant.

On BA Chapter 14 (Reading the classics)

I guess the main idea of the chapter is more like a conclusion of the all BA book. Read the classics, get inspired from the good old beautiful architectures, but try to stay practical. This is because the difficult thing about theoretical ideas that also have some power is that they also need to be practical.

Underlying the complexity of beautiful architectures, one can usually discover some simple and elegant principles of functional organization and functional order. Discovering these principles takes intellectual engagement, crucial part of the experience and pleasure of architecture. Interestingly enough some of these principles may contradict each other in different contexts; however this should not diminish their importance to one’s baggage of experience.

An interesting example given by the author is how different classics look at inheritance. Formally, OOP principles would advocate making commonality explicit by using inheritance (“is-a”) whereas the GoF favors object composition (“has”) over class inheritance. One can still remember what a great impact had frameworks as MFC on Windows programming. Being able to just subclass classes supplied by the framework and avoid all the complexity of Win32 programming was a great achievement. At the same by carefully analyzing the “is-a” inheritance pattern, one can notice that it is really hard to come up with class hierarchies that really fit this pattern. Very often we notice methods in base classes that do not necessarily make sense in the derived class. This goes beyond the scope of public inheritance and it is usually a sign of bad design.

Smalltalk, with its innovative interface (when programming was done in a monochrome text terminal), has become a classic in itself: a pure object oriented language. Today’s popular object oriented languages, with their primitive types, are not purely object oriented. In Smalltalk, it is possible to create, modify or delete classes during runtime. This is basically the concept of metaprogramming and it is mainly possible because everything is an object. In Smalltalk we also have the notion of latent typing, which is the only typing mechanism available. In fact, latent typing is at the basis of generic programming via templates in C++. With latent typing, types are defined implicitly by what they do and by their interfaces. This offers another alternative to polymorphism because any type can be used anywhere it offers methods fitting the context. One can easily see the relationship to templates.

As classic famous buildings influence young architects, Smalltalk has left his mark on the computer science community. Adamant but beautiful, innovative in its programming model but not compromising, rich in programming concepts that were later embraced by many of the modern programming languages, Smalltalk is a classic that set examples in many ways and has been imitated by more practical approaches. As with building architecture, software architecture is a chaotic adventure at the end of which, beauty and usefulness should balance in harmony. And the recipe for this is reading the classic and practicing and experiencing on your own. Christopher Alexander, the architect father of design patterns, wisely noticed that “people who study design methods without also practicing design are almost always frustrated designers who have no sap in them, who have lost, or never had, the urge to shape things”.

Thursday, October 8, 2009

On ReLooper

Retrofitting parallelism into existing sequential applications is likely to be successful only for certain classes of problems such as for example the so-called embarrassingly parallel problems. These are problems where either computations are large, data sets are large, or some combination of the two, and where the processing that goes on is easily divisible. Re-architecting a complete system for parallelism may take significant amounts of time and efforts. When possible, refactoring for parallelism by using a refactoring tool is a good choice.

The paper introduces ReLooper, an Eclipse based refactoring tool for Java that reduces the burden of analyzing and rewriting parallel loops, and is fast enough to be used interactively. There are two ways in which ReLooper helps Java programmers who want to parallelize their programs by using ParallelArray: it helps them discover when parallelizing a loop is unsafe and it performs the messy conversion of the loop, selecting a good operator from the 132 that come with ParallelArray.

ReLooper relies on Java’s ParallelArray. There are many libraries that target concurrency and parallelism (e.g., Java's ParallelArray, Microsoft's TPL, Intel's TBB, OpenMP, MPI, etc.). The disadvantage of using such a library directly is that the parallel constructs provided by libraries are in general more verbose than parallel constructs provided by programming languages, thus they require many code changes. This is where a refactoring tool comes in handy. In addition, these libraries assume that all parallel computations do not interfere with each other, so they run without any synchronization. It is the programmer’s responsibility to verify non-interference; and this is why ReLooper is interactive, with minimal user input.

Automatic loop parallelization has been a long time topic in the Fortran community, with compilers having various degrees of success. Much of this work is done in the context of numerical computation on scalar arrays and does not deal with the problems posed by sharing heap-allocated array elements, which are common to object oriented languages.

Automatic parallelizing compilers perform full program analysis to find loops that can be safely and automatically parallelized. In contrast ReLooper attempts to parallelize loops explicitly chosen by the programmer, employing a demand driven analysis to signal problems that the human may have missed. The aim here is to quickly and correctly identify dependencies for most cases occurring in practice.

Some of the transformations presented in the paper make the code harder to understand (e.g., replacing a loop with a parallel operation). ReLooper infers the parallel operations based on the kind of array accesses in the loop. If a loop contains accesses triggering more than one operation, ReLooper chooses the most specific operation. This is both simpler to understand, and is faster at runtime.

Wednesday, October 7, 2009

On BA Chapter 13 (OO vs Functional)

Roughly, the differences between functional and objected oriented programming can be summed up as follows: In object oriented programming everything is treated as an object that has state and behavior. Among others this offers modularity and information hiding. In functional programming one is using a set of functions each of which performing a task. Selectively executing these function results in the solution to the problem at hand. While one of OO's first principles is modularity, functional programming is also achieving modularity, by means of systematic use of stateless functions, high level functions (e.g. combinators), lists and other recursively defined types, as well as lazy evaluation. However this is only fine-grain modularization. Functional programming seems to have no contribution to large-grain modularity, which pertains to software architecture.

The author thinks that the stateless nature of functional programming does not seem to affect in any way software architecture. While I agree with this thought, I also feel that this mostly applies to the development view of the 4+1 views. In the process and physical view however, being stateless may become an important feature. This is because purely functional programs have no side effects, which makes them trivially parallelizable. Nevertheless, by looking at how a _good_ functional program is written (e.g. as a set of modules that are not dependent on each other), in the development view it almost feels like it is designed with object oriented principles in mind.

Proponents of functional programming believe that having functions as first-class citizens lets you find more opportunities for abstraction, which means your code is smaller, tighter, more reusable, and more scalable. But having functions as first class citizens can be achieved in OO as well. However, since in the OO context, the only first-class citizens are, at runtime, objects, corresponding in the static structure to classes, a new construction has been designed: the notion of agents. An agent is an object representing a feature of a certain class, ready to be called. Following a parallel with functional languages, agents are high level routine objects; that is the equivalent of combinators (high level functions).

Reusability and dynamic binding are some aspects where OO excels. Depending on the problem, owe to commonalities captured by inheritance, the number of feature definitions may be significantly smaller in OO as if one were to implement the same problem in a functional language. By collecting the features applicable to many variants into correct abstract data types and grouping them toward the top of the inheritance hierarchy, one can benefit from effective code reduction. There seems to be no equivalent to inheritance in functional programming, where it is required to define the variant of every operation for every combinator, repeating any common ones. Dynamic binding is also essential with OO, because it removes the need for client classes to perform multibranch discriminations to perform operations. Dynamic binding solves what traditionally used to be a major source for architecture degradation and obsolescence.

After reading the chapter, my thoughts converge with those of the author’s, that OOD in his modern form, subsumes the functional approach, retaining its benefits while providing more alternatives for extension and reuse.

Monday, October 5, 2009

On BA Chapter 12 (KDE arch)

KDE is working proof that the Open Source Bazaar style is a successful software development model that can yield great technologies comparable and often even superior to some of the most complex commercial software. Now, in its 4.0 version, KDE has learned to deal with all the demands of a modern platform: pervasive concurrency, increasingly distributed processing, easier to use, cleaner and more beautiful interfaces, meaningful responsiveness, clean interactions with the software, high reliability, stability and data safety.

The social and coordinative aspects of the context in which KDE is being developed are even more interesting. In this environment achieving good communication and getting favorable consensus are imperative; these are usually hampered by distance, cultural barriers, one’s own preferences and prejudices. For larger and successful open source projects as KDE its structure is very important, as it needs to hold trademarks, receive donations organize conferences etc. In fact, finding a formal structure, to solve problems KDE faced once it became significant and at the same time it did not hinder the further development of the project, contributed most significantly to the long-term stability of the project community.

The Akonadi framework, one of the so called pillars of the KDE platform aims to provide access to the user’s personal information, the associated metadata, and the relationships between these data, as well as the services operating on them. This information is aggregated from a variety of sources, such as email and groupware servers, web and grid services, local apps that feed into it. Additionally Akonadi caches this information and provides access to it. Most of the important design decisions of the project have been taken over a series of meetings beginning with the meeting that brought the initial fundamental idea, which was the need for a robust, reliable, transactional storage layer, clearly separated from the interface and allowing concurrent multi-client access. In its current version Akonadi-PIM offers a type independent architecture, with a simple integration interface that allows for new types to be easily added by the user. This is achieved by the so-called serializer plug-ins, which are runtime libraries capable of converting back and forth from a type into a binary representation stored in the Akonadi server. From a concurrency point of view, an application generally resides in a different address space and can open one or more connections to the server, each represented internally by a thread. The storage layers rely on a solution inspired by the maildir standard, which allows lock-free ACID access by relying on atomic rename operations on the file system.

Another core library of KDE 4 is ThreadWeaver, a job scheduling library for concurrency, with its main purpose to manage and arbitrate resource usage in multi-threaded environments. By using job sequences and job collections registered in a global application job queue, ThreadWeaver assigns worker threads to the job queue thus decoupling it from the main application thread. Also, instead of relying on the direct implementation of queuing behavior, ThreadWeaver uses queue policies: one based on dependencies between resources and another based on resource restrictions. Priorities can also be used to influence the order of execution as well. Priorities can be changed without changing the job implementation by means of Decorators.

Refactoring Sequential Java Code for Concurrency

Concurrencer is a refactoring tool, which helps programmers refactor sequential java code with the help of j.u.c (java.util.concurrent). Compared with find-and-replace refactoring transformations, its transformations are more complex as they require program analysis and may spawn multiple, non-adjacent, program statements.

The motivation behind implementing such a tool is based on the fact that there are many cases where retrofitting concurrency is easier than rewriting it. Doing this manually is tedious as it may require changing many lines of code; it is error-prone because one can easily choose the wrong refactoring API when multiple of them are available; and at the same time omission-prone as it is always hard to manually choose the most efficient API alternative when more of them are available.

There are three types of behavior preserving refactorings supported in Concurrencer: converting int to AtomicInteger, converting HashMap to ConcurrentHashMap and the most laborious, converting recursion to ForkJoinTask. The first two transformations enable a sequential program to become thread safe, a thread safe program become more scalable, while the third makes a sequential program run concurrently, with obvious performance improvements. When refactoring, it is the programmer’s responsibility to identify all the shared data and a target refactoring. It is then Concurrencer’s responsibility to analyze all the accesses to shared data and apply the transformations in the best possible way.

When refactoring to AtomicInteger and ConcurrentHashMap there are interesting aspects regarding the initialization and field access of the affected variables, however what is worth mentioning is how Concurrencer is dealing with existing synchronization primitives. Typically if the original code contains synchronized accesses around the fields, Concurrencer removes them since this becomes superfluous after the transformation as the new types have built-in thread safety.

The third and the most interesting transformation, relies on the support of the ForkJoinTask framework available in Java 7. Divide-and-conquer programs are the best candidates for this framework offering support for fine grained parallelism in computationally intensive applications. Using the user supplied threshold, which defines the size of the problem that can be solved sequentially, along with the divide-and-conquer method, Concurrencer creates the task java classes by subclassing from RecursiveAction class and encapsulating the parallel computation of the original recursive method, changes the base of the recursion, replaces recursive calls with task instantiations, then executes the parallel tasks and combines the results of the subtasks.

While covering most common refactoring scenarios, Concurrencer is not complete. However even though the approach is neither sound nor complete, it is still useful as it saves programmer’s time overall. Moreover, it is very easy to extend these transformation patterns to cover more refactoring scenarios.

Wednesday, September 30, 2009

On A Java Fork/Join Framework

Fork/Join is a simple design technique acting as the parallel version of the well known divide and conquer algorithm. As the name suggests the algorithm is based on two basic operations, fork - by which the current task starts a new parallel task, and join, which forces the current task not to proceed until the current subtask has completed. These two methods confer the algorithm its recursive nature, with tasks repeatedly splitting subtasks until they are small enough to solve using simple sequential methods.

While at a first glance, any framework (e.g. pthreads, java threads) supporting threads creation and ways to make them wait on their completion would seem a good candidate for implementing the Fork/Join framework, standard thread frameworks are in general to heavy to support the Fork/Join programs. One reason for that is that the synchronization requirements for fork/join tasks are more limited than those for regular threads, so the overhead associated with tracking blocked general purpose threads is wasted. In fact, the significance of the thread overhead is increasing with the granularity of the tasks. Consequently, there should be a good balance between task granularity and thread overhead that should maximize the effect of parallelism. Cilk, one of the first frameworks to tackle these problems, implements fork/join support on top of an operating system’s basic thread or process mechanisms. FJTask framework offers a variant of the design used in Cilk, but with a Java implementation, benefiting from its portability. The core design principle of these frameworks is to map the task to threads as an operating system maps threads to CPUs, but exploit the simplicity, regularity, and constraints of fork/join programs in performing the mapping.

FJTask architecture is simple: a pool of standard threads, as many as CPUs on the system is allocated, with each worker processing tasks held in an internal deque. The JVM and OS should be trusted to allow the mapping of tasks to different CPUs. Fork/join tasks (FJTasks) act as lightweight executable classes, by implementing the Runnable interface and its run method. The scheduling mechanic, inspired from Cilk, is the core of FJTasks framework. Each worker thread maintains runnable tasks in its internal deque, supporting push, pop, take operations. Subtasks generated in tasks run by a given worker thread are pushed onto that worker's own deque. Workers process their own deques in LIFO, or can steal tasks from others workers (using a FIFO), when they are out of tasks. When idle, workers enter a special priority adjustment sequence with attempts to get new tasks, which may eventually end up with the worker blocking until another task is invoked from top level. The mechanism is great because it reduces contention by having stealers operate on the opposite side of the deque as owners. Also, with this scheme programs adopting small task granularities for base actions are likely to run faster than those that only use coarse grained partitioning.

A potential drawback of having such a framework implemented is Java is related to garbage collection. At the first thought, with fork/join programs generating huge numbers of tasks, GC may come in handy when all the tasks are done processing and should quickly turn into garbage. However if garbage generation rates force frequent collections, this may affect the scalability of the framework, because stopping threads for collection takes time approximately proportional to the number of running threads.

On BA Chapter 11 (Emacs Architecture)

With its development beginning in the mid-70s and continuing actively even today, “Emacs text editors are most popular with technically proficient computer users and computer programmers. The most popular version of Emacs is GNU Emacs, a part of the GNU project, which is commonly referred to simply as Emacs.”

Structured as a Model-View-Controller architecture for interactive applications, with the Model, the underlying representation of the data, the View presenting the data to the user and the Controller taking care of the user’s interaction with the View, Emacs delivers a very influential architecture. And this comes from Emacs’s most striking feature, extensibility.

Applications like Eclipse, Firefox and other architectures are extensible through user extensions. In a similar manner, if the user wants to customize Emacs to meet his own needs, he has to write his own customization code in Lisp. Emacs Lisp, the flavor of Lisp implemented by Emacs, is key to Emacs’s ability to accommodate a wide range of new functionality. This is mainly because Emacs Lisp acts as an important abstraction boundary, which hides away the complexity of the Lisp interpreter and of the underlying processor architecture. Through the contributions of the users (now isolated from details), Emacs has grown more into some sort of platform rather than a unified whole, a platform comprising of a multitude of Lisp packages. In fact, the concept of Emacs is more than that of an editor. Same even say that Emacs was many years ahead of his time.

Another interesting aspect of Emacs is the Emacs compiler, capable of translating Emacs Lisp source files into a special representation known as bytecode. Compared to source files, bytecode files load faster, occupy less space on the disk, use less memory when loaded, and run faster. From this description, the analogy with the Java, .NET and other interpreted languages seems to be clear. One can almost claim that Java is an extension of Emacs as a platform concept: it has a byte compiled, garbage collected language, a display library and network functionality, etc.

The dissociation of the Lisp code from handling events from an event loop because of the automatic display management is another great important feature which speaks again for the simplicity of integrating new features into Emacs. The same automatic display management we notice with JavaScript, which only modifies the DOM tree representing the web page, while the browser takes care of updating the display when needed.

Even from only the perspective of designing an editor, analyzing the Emacs architecture, one may notice again design decisions that make Emacs unique. For instance, in order to manipulate text, the Model is a buffer, which is a flat string, where newline characters mark line endings. This is way simpler than the approach of most other editors where the text is represented as an object, a data structure, a tree etc. This representation matches perfectly with the Emacs Lisp primitive operations on buffers that can insert, delete, extract portions of buffers as strings etc. With its UI design, with frames, windows, with its ability to manipulate commands and output of commands inside the editor, with its command line, Emacs, as an editor, has again a different fresh approach, which makes its proponents swear nothing else comes even close.

Sunday, September 27, 2009

On OPL

The paper on OPL offers a great survey of patterns restricted to describing the design of parallel software; it is basically concerned with software architecture and ways to design and implement parallel algorithms. Its target audience is the application programmer and not compiler writers or OS or parallel libraries developers. OPL is also not specific to any specific application domain.

OPL is structured as stacked layered system that defines five categories: architectural patterns - describing overall organizations of a parallel system and how the computing elements interact, computational patterns - describing the core classes of computations that make up the application, parallel algorithm strategy patterns - covering the methods to exploit concurrency in a parallel application, implementation strategy patterns – parallel program organization and common data structures specific to parallel programming and concurrent execution patterns.

While being familiar with many of the patterns I also found some that I was not that familiar with. Therefore, I appreciate the initiative of the authors to define the OPL layers and list the patterns and I also look forward for their next steps where they promised to follow up with pattern descriptions and careful review.

From the implementation strategy patterns the Master-worker/Task-queue is an interesting one. Structurally, the pattern is represented as a Master, maintaining a task queue and controlling a group of processing elements or workers. Usually, only one master and several identical worker components simultaneously exist and process during the execution time.

In this pattern, the same operation is simultaneously applied in effect to different pieces of data. Operations in each worker component are independent of operations in other components. The structure of the solution involves a central Master that distributes data among workers by request. Parallelism is introduced by having multiple data sets processed at the same time.

The tasks or the data pieces may have different sizes. This means that the independent computations of each task should adapt to the data size to be processed, in order to obtain automatic load-balancing. Also, the coordination of the independent computations has to take up a limited amount of time in order not to impede performance of the processing elements. The solution has to scale over the number of workers. Changes in the number of workers should be reflected by the execution time. Improvement in performance is achieved when execution time decreases.

On BA Chapter 10 (Jikes RVM)

The core principle of Jikes RVM is the meta-circularity, or the property of being a self-hosting runtime. While this is an important principle for compilers, many runtime environments are not written in the language in which they typically run, which may have certain limitations. For instance, for a Java runtime written in C or C++, a bug in the memory safety may have serious consequences, even though the Java application itself does have memory safety. Having a self-hosting environment also allows the runtime to easily take advantages of better libraries and abstractions. Very often, the application and the runtime need to communicate with each other. Implementing this functionality is considerably more complex, when the runtime and the application running on it are written in different languages. Meta-circularity also helps the developers to gain from the features they introduce by relying on a system where application, runtime and compiler have a consistent view of the system.

Jikes RVM does not include an interpreter. All the bytecode must first be translated into native machine code. An initial, basic and non optimized compilation is performed by the baseline compiler, which relies on lazy compilation. With lazy compilation, methods are compiled first time when they are invoked by the program. In the later execution stages, the adaptive optimization system starts detecting program hot spots and selectively recompiles them with the Jikes RVM’s optimizing compiler. Selective optimization is the key to enabling the deployment of sophisticated optimizing compilers as dynamic compilers. Since Jikes RVM is written in Java, the implementation of the adaptive optimization system has inherent benefits such as threads and monitors to structure the code. Consequently, compiler tasks are being carried out by separated threaded components that run concurrently in the java thread safe environment. Also, the easy to understand Java collection libraries confer simplicity to each component, hiding away details of the underlying data structure management.

Another benefit of Jikes RVM that comes from using Java is the clean threading interface between the language and the operating system threads, which allows Jikes RVM to have different underlying threading models in order to adapt to new programmer behaviors. For garbage collection Jikes RVM relies on Memory Management Toolkit (MMTk), which provides a powerful and popular set of precise garbage collectors. As MMTK is also written in Java, it can be directly linked into the code being compiled for efficiency. Thus, during the initial creation of object representation, MMTk naturally comes into play providing Jikes RVM with iterators for process references, object allocation and barrier implementations. Obviously, with Java’s threading model, all garbage collectors are parallel and integrate with the runtime model.

Jikes RVM provides not only a very performant virtual machine but also a very promising research platform. The extensibility coming from its meta-circular nature provides a great platform for multilanguage virtual machine research, an extension that would also allow aspects of Jikes RVM to be written in different programming languages. Providing aspect oriented programming within the virtual machine, making the entire virtual machine into an OS to remove barriers to runtime optimization are only a few other interesting extension that can be built into Jikes RVM.

Thursday, September 24, 2009

On BA Chapter 9 (JPC Emulator)

JPC is an x86 emulator written in pure java. Its greatest feature is abstracting away the underlying hardware and operating system. In other words x86 code is converted to Java byte code, which in turn is interpreted by the processor specific JVM. And since there is no native code in it, JPC can emulate all the standard components of an x86 PC while remaining entirely inside the browser. While this architecture provides great flexibility and means to isolate the software behind two independent verified security layers (Java Applet sandbox and JVM), designing the architecture was a rather difficult task.

Disregarding JPC’s biggest challenge, emulation speed (and how the JVM architects tackled java performance limitations to achieve speed), and also the obvious benefits of JPC that come from running virtual hardware in isolation, I think that JPC as a project has a great vision. I personally liked Bochs and Qemu for their support for debugging kernel code. For instance, Bochs can set breakpoints in any kind of software (even if it is compiled without debugging info!), and provides an additional "debugging out port" you can easily access from within your kernel code to print debug messages. Qemu also can be configured to listen for a "gdb connection" before it starts executing any code to debug it. And those who spent sleepless nights trying to debug a kernel oops would greatly appreciate these features.

However, I think JVM’s vision is far beyond just being another emulator. Host platform and OS independency, the ability to run a virtual machine over the Web are great achievements. One can have its hard drive reside on his own server on the Internet, and access it from anywhere in the world by loading a local JVM and pointing it to the server. Furthermore, the core emulation task can be carried out on a remote server. While having the screen output and user input pushed via the Internet to the virtual machine owner we can imagine a model where there is no one to one mapping from user to virtual machine. We can model this as N users using M virtual machines. With this concept, if a machine is idle, any one of the users can use it, remotely launching a JPC image to work on their personal disk image data. This mostly fits users who use larger batch farms to run massively parallel tasks, such as rendering frames, optimizations, etc.

With cloud computing becoming more popular, JPC designers wonder why not use the millions of idle desktop computers worldwide instead and save the financial and environmental costs of using a datacentre. Old issues associated with cloud computing are easily overcome by JPC’s architecture. Gaining the trust of potential computing power donors would no longer be an issue as JPC provides a very secure approach. The cost associated with downloading, installing, maintaining foreign software is always a significant problem for the donor and/or system admin, but this is again no issue because all these tasks will be confined to the boundaries of the virtual machine. Last but not least, the available hardware and operating system provided by the donors will most likely be heterogeneous, but this again is not issue for JVM.

However, the road to creating an emulator hardware and OS independent, without compromising the requirement for maximum performance, was hard. In order to harness the complexity of the IA-32 architecture, the JPC designers had to come up with a clear and modular design and at the same time exploit all the little performance tips associated with the java language. I particularly liked a phrase where the authors catch the real measure of designing and implementing an emulator as opposed to designing hypervisors. During the design of JPC, they felt like they became “as schizophrenic as the codebase was”. And that was because ultimately, not having access to the memory or processor systems, called for different design decisions, that aimed for code clarity and modular design. While working within the processor and memory system, the design usually aims for ultimate performance, so breaking modularity or isolation between layers is often a good design choice.

On Adaptive Object-Model Architecture

The Adaptive Object-Model Architectural (AOM) Style provides an alternative to usual object-oriented design. While the traditional OOD generates classes with attributes and methods, for every business entity, AOM does not treat business entities as first class objects. Instead AOM represents classes, attributes, relationships and behavior as metadata. Every domain change that would otherwise, in OOD, require recompilation, can be performed by the actual user that can change the metadata. Furthermore, the metadata is interpreted at runtime, which means, if a business rule changed it is immediately reflected in the running application. Consequently the model is adaptable.

Transforming an OOD class hierarchy into an AOM one only makes sense when the behavior between subclasses that would represent business entities is very similar or can be broken out into separate objects. This usually reduces the number of classes in the object model and creates a class structure that does not change. Changing the spec of an AOM application usually means changing the content of the database where metadata is saved. So we can safely say that AOM, when applicable, reduces time-to-market by allowing users to experiment and provide feedback.

The adaptability of the system is implemented by means of a few design patterns as TypeObject, Properties, Composite and Strategy. Strategy and RuleObjects are usually used to specify naturally complex rules, while combinations, of rules (e.g. predicates and sets) are modeled through the Composite pattern. Metadata is usually read and interpreted first time when the object model is instantiated and also at runtime when the business rules should be applied. The easiest way to implement the metadata persistence model is by means of object oriented databases. However it is also possible to use a relational database model and even XML.

While very benefic for systems constantly changing, or for those that want to enable their users to dynamically configure and extend their system, the Adaptive Object Model has also disadvantages. Among these, the most important is the complexity of implementing such a system. Beside the several design patterns involved, the system should also provide new tools and GUIs for defining the objects in the system. Further complexity is added from having to implement the model interpreter and by the fact that two object systems coexist: the AOM model that is interpreted and the interpreter itself written in an object oriented programming language. Finally, since the tendency for an AOM is to lead to a domain specific language, all the problems associated with developing a language, such as providing debuggers, version control, etc., there will be extra burden for the AOM designer. On the performance side, as with every interpreter there are certain performance issues associated with AOMs.

Nevertheless, when applied correctly, AOMs are very interesting design models. While developers writing them have split opinions (with the ones understanding it, praising it, and the others claiming that it is too complex), the architects developing AOMs are usually very proud of them.

Monday, September 21, 2009

On BA Chapter 8 (Tandem Architecture)

Tadem is a fault-tolerant computer system, marketed to the transaction processing customers, using ATMs, banks, stock exchanges, etc. Guardian, the OS running on Tandem machines of the NonStop series, was designed in parallel with the hardware, in order to provide fault tolerance with minimal overhead costs.

In many ways, Tandem/16 was a revolutionary system, however with an unexpectedly low impact on the industry and design of modern machines. This is most probably because Tandem/16 was very different from most systems and it was developed in a purely commercial development.

A key design principle of the Tandem’s fault-tolerant architecture was modularity, both hardware and software being decomposed into modules, acting as units of failure, diagnosis, repair and growth. Modularity was very important for Tandem as a fault tolerant system, because individual modules had to be replaceable online. Furthermore, the isolation that comes with modularity decreases the chances that the failure of one module affects the operation of another. In Tandem, the process model and the messaging system are the two important mechanisms used in implementing fault isolation.

Furthermore, each module is designed based on Fail Fast principle. By implementing a mechanism of self checking, each module is designed to either work properly or stop, first time when it detects a fault. This is imperative for guaranteeing data integrity in the event of a failure.

Another important design principle for Tandem as a fault tolerant system was Single Failure: when a hardware or software module fails, its functionality is immediately taken over by another one, given a mean time to repair measured in milliseconds. For instance, for a CPU, there is always a second CPU, ready to assume duty in case the first one fails. The same goes for a running processes, that always run in process pairs, a primary and a backup process.

Tandem is also designed to support online maintenance. Hardware and software can be diagnosed and repaired while the rest of the system continues to deliver services to the user. Hardware components, data and programs can be reintegrated into the system without interrupting the service.

The general feeling is that Tandem was a revolutionary machine, but it was the small things that got in Tandem’s way of imposing its visions. Naming issues and certain incompatibilities, as for instance the interprocess communication unusual concept, are just a few of the issues that prevented Tandem from being broadly accepted. In the nineties, factors as computer hardware becoming generally more reliable and significantly much faster accelerated the decline of the Tandem architecture.

On The BIG BALL OF MUD

A BIG BALL OF MUD is a casually, structured system, whose organization or lack of organization, is rather dictated by expediency than design. Its success and popularity speak of the BIG BALL OF MUD, as architecture in its own right. However, the big questions still remain: why these kinds of systems are architecturally undistinguished and still so popular, what makes good programmers build ugly systems, and what can we do to improve them?

The root cause of BIG BALLS OF MUD is complex. Factors as the time constraints, the cost of investing in the architecture of a new domain (whose benefits are initially hard to estimate), the experience and skill of the designer, the inherent complexity of the application domain and the scalability issues associated with design decisions can all contribute to the appearance of the BIG BALLS OF MUD. The very nature of software architecture as hypothesis about the future, that holds that subsequent change will be confined to that part of the design space encompassed by that architecture, seems to give a philosophical explanation for the existence of BIG BALLS OF MUD architecture.

From systems emerging from quick-and-dirty code (THROWAWAY CODE), that was intended to be used only once and then discarded to those with well-defined architectures, the architectures are all prone to structural erosion. With time, those clean architectures may become overgrown as PIECEMEAL GROWTH gradually allows elements of the system to sprawl in an uncontrolled fashion. And the only way to deal with entropy in software is to refactor it. A sustained commitment to refactoring can keep a system from subsiding into a BIG BALL OF MUD. It is also important to KEEP IT WORKING, from the point of designing a change, through implementation, testing and maintenance. By taking small design steps in any direction, we can make sure that it is never more than a few steps back to a working system. Daily builds or keeping around the last working version are successful maintenance practices. The importance of testing in keeping a working system is emphasized by both the traditional Waterfall approach as well as newer techniques as Extreme Programming. The dynamics of a growing architecture are very complex. The system itself and all of its components evolve at different rates, with the general tendency for the components that change faster to become distinct from those employing slow changes. The SHEARING LAYERS form between the components and identifying these layers, understanding component interactions and grouping components based on how similar their change rates are, help balancing adaptability and stability, forces that are usually in constant tension. Often times, when facing the mess of the BIG BALL OF MUD, the architect should choose between SWEEPING IT UNDER THE RUG and RECONSTRUCTION. However, distilling meaningful abstractions from a BIG BALL OF MUD is a difficult and demanding task, requiring skill, insight, and persistence. At times, RECONSTRUCTION may seem like the less painful course.

In the end the authors note that there are good reasons that good programmers build BIG BALLS OF MUD and accept that expedient programming is, in fact, a state-of-the-art strategy. While they agree that, casual architecture is natural during the early stages of a system’s evolution, the authors also hope that at least there are ways that we can do better. The key to finding those ways is learning about the domain and the architectural opportunities looming within it, as the system grows and matures.

Wednesday, September 16, 2009

On the Layers Pattern

Layers is a common architectural system fit for large systems that structures applications so that they can be decomposed into groups of subtasks such that each group of subtasks is a particular level of abstraction. It is common sense in software architecture that implementing an application following a layered model has more advantages than the monolithic approach. A direct advantage is logical segmentation, facilitating team development, incremental coding and testing.

It is essential that within a layer all components work at the same level of abstraction. The main structural property of the pattern is that the services in layer J are only used by layer J+1 and there are no further dependencies between layers. In other words, each individual layer shields all lower layers from direct access from the higher ones.

Finding the right decomposition is not trivial. Defining the layers can be done in bottom-up fashion, following a ‘yo-yo’ approach or else by refining the structure based on a sequence of steps, which involves defining the abstraction criterion for grouping tasks into layers (usually the distance from the platform – the first layer), determining the number of layers, layer naming and task assignment to each layer and finally specification of services between layers. Strategies involving refining seem natural choices as it is generally impossible to define an abstraction criterion right before defining the layers and their services. Defining the components and services first and later forcing the layer architecture based on usage relationships is also problematic since the pattern does not capture an inherent ordering principle. This means that new components, usually added to the architecture as part of system maintenance, may easily break the strict layering principle.

Once the decomposition is defined, specifying the interface to each layer and the communication between them, structuring the layers, decoupling adjacent layers and last but the not least, designing an error handling strategy are actions that are equally important for designing a good layered architecture. Decoupling layers is usually a nice exercise of design. A top-down, one-way coupling can be achieved by fixing the interface and the semantics of the previous layer. While this allows for top-down communication, for bottom-up communication one may use callbacks. In OOD, one can decouple the lower layer from the upper layer, and even have the upper layers change implementation of lower ones at runtime by means of base classes. This principle is at the basis of Layering Through Inheritance.

I once read about an interesting application of this pattern where the layers are actually used to isolate unrelated concepts that are part of the same application. Having multiple unrelated concepts is common in large systems. The usual tendency to tie these concepts closely together may lead to applications that are hard to implement, change and even understand. Writing a layer that isolates all the concepts from one another solves the issues above with the tradeoff of having a possibly very complex layer, which represents a single point of failure.