Workshop Abstracts


Automatic Data and Computation Partitioning on Scalable Shared Memory Multiprocessors

Sudarsan Tandri and Tarek Abdelrahman
University of Toronto

This paper describes an algorithm for deriving data and computation partitions to improve memory locality on scalable shared memory multiprocessors. The algorithm first determines computation partitions by establishing affinity between where the computations are performed and where the data is located based on data accesses in the program. In the process of deriving the computation partitions, static data partitions are derived for the arrays. These static partitions may not be the best choice for some arrays. The partitioning of these arrays is reconsidered by evaluating the costs of all possible partitions for each such array individually using depth-first search with pruning. The cost of a given data partition is computed using knowledge of cache, local and remote memory access costs as well as the costs of contention and synchronization. Experimental results from a prototype implementation of the algorithm demonstrate that it is computationally efficient and that it is effective in parallelizing standard benchmarks. The results also show the necessity of taking shared memory effects into account.


dHPF: A Platform for Data-Parallel Compiler Research

Vikram Adve
Rice University

In this talk, I will provide an overview of the dHPF compiler effort at Rice University. The project so far has focused on techniques to achieve high performance for regular HPF applications using a Message Passing Interface (MPI) communication substrate. In addition to traditional optimizations, several features distinguish dHPF from previous generations of data-parallel compilers:


Two-dimensional Data Distribution with Constant Topology

Jordi Garcia, Eduard Ayguade and Jesus Labarta
Universitat Politecnica de Catalunya

Data distribution is one of the key aspects in the design of a parallelizing compiler for a distributed-memory architecture to get efficiency from the system. This paper describes the enhancement of our automatic data distribution tool in order to generate two-dimensional distributions. Our tool automatically derives the data mapping for the arrays and the parallelization strategy for the loops in a Fortran 77 program. The layout generated may be dynamic, and the distribution fashion is either BLOCK or CYCLIC. All the information regarding data movement and parallelism is contained in a single data structure named Communication-Parallelism Graph (CPG). The CPG is used to model a minimal path problem in which time is the objective function to minimize, and solved using a general purpose linear programming solver, which finds the optimal solution for the whole problem. The tool actually asumes that the processors grid is two-dimensional, the topology is fixed, and that no changes will be made during execution time. This framework is a first step towards a more general case, in which the processors topology may change along the program execution.


A General Solution to the Alignment Problem
on Distributed Memory Machines

Claude G. Diderich, Swiss Federal Institute of Technology
Marc Gengler, Ecole Normale Superieure de Lyon, LIP

The two major aspects one has to address when automatically parallelizing loop nests on distributed memory computers are the maximization of the parallelism and the minimization of the communication overhead due to non local data accesses. We study the problem of finding computation mappings and data distributions that minimize the numbeer of remote data access for a given degree of parallelism. This problem is known to be NP-hard. The algorithm presented uses a linear algebra framework and assumes affine data access functions. It proceeds by enumerating all interesting bases of the set of vectors representing the alignments between computation and data accesses. Comparison with related work shows that previous results are special cases of our formulation. We discuss exact enumeration techniques as well as heuristics.


Modeling trade-offs in automatic performance
optimization for message-passing architectures

Arjan J.C. van Gemund
Delft University of Technology

In program optimization based on performance prediction, feedback is either based on a model of the program or on a model of the actually generated machine code. Especially in the case of a message-passing architecture, the difference in abstraction is large. In this paper we study the trade-off between modeling at program level and machine level in the context of automatic performance optimization for message -passing architectures. We present a modeling technique based on modeling the various optimizations in terms of resource contention. Despite the abstractions involved, we show that program level modeling is indeed adequate when this technique is used. We demonstrate the effectiveness of our approach by deriving various optimizations of a line relaxation kernel for distributed-memory machines.


On the Applicability of Program Comprehension Techniques to the Automatic Parallelization of Sparse Matrix Computations

Christoph Kessler
University of Trier

Space-efficient data structures for sparse matrices are an important concept in numerical programming because they allow for considerable savings in space and time compared with common two-dimensional arrays. Unfortunately, for such programs it is usually impossible to statically determine all data dependencies. Thus, automatic parallelization of such codes is usually done at run time by applying the inspector-executor techniquee, incurring tremendous overhead. Program comprehension techniques exploit knowledge on frequently occurring implementation variations of important computations. They have been shown to improve many important fields of automatic parallelization of dense matrix computations, such as automatic program transformation and local algorithm replacement, data flow analysis, array distribution, and performance prediction. In this study we investigate up to which deegree this approach could be generalized to sequential codes implementing sparse matrix computations, and how much additional information, e.g. on run-time values of some variables, is needed. We propose a speculative program comprehension and parallelization method combined with compile-time program flow analysis techniques. Only where static analysis does not supply sufficient information, a parallelized run time test must check wether a code fragment speculatively recognized and parallelized at compile time as a potential sparse matrix computation, really implements this computation. Beyond automatic parallelization, program comprehension for sparse matrix computations may also support program maintenance and debugging, and could guide the selection of a more suitable data structure for a sparse matrix.


Data-centric Transformations

Induprakas Kodukula, Nawaaz Ahmed and Keshav Pingali
Cornell University

Data reuse is imperative for good performance on modern shared memory machines because of the deep hierarchy of the memory architecture of these machines. However, programs with good data reuse are much more complex than the usual natural expression of the algorithms they implement. The dense linear algebra community has addressed this complexity with the use of libraries such as LAPACK. LAPACK dealing with only two levels of the memory hierarchy, is quite complex. For multiple levels of memory hierarchy, the complexity of libraries will only increase.

Standard compiler technology to address data reuse has been "tiling", which works quite well for perfectly nested loops. However, for imperfectly nested loops, either code sinking must be applied to convert them to perfectly nested loops after which tiling can be applied, or a sequence of alternate transformations such as index-set splitting, loop fusion and distribution must be applied in some order to achieve data locality. In either case, it is not clear how the compiler can automatically choose the right sequence of transformations to perform. It is also unclear whether these approaches work for multiple levels of memory hierarchy.

In previous work, we addressed the data reuse problem by using a novel approach to program restructuring, which we called data-centric to contrast it with the traditional approach to restructuring programs, which is control-centric. Our tools reasoned about the data and its flow directly instead of dealing with loops, loop transformations and iteration spaces. In this framework, loops were simply an implementation of the desired data flow. The key idea was to partition the data arrays into blocks, walk the blocks in some order and execute some set of iterations "associated" with that block. By dealing with data directly, we showed how to seamlessly deal with perfectly as well as imperfectly nested loops, avoided the phase-ordering problem inherent in control-centric methods, and reasoned naturally about data locality for multiple levels of the memory hierarchy.

One of the limitations of our previous approach was a limitation on how "walks" on the blocked data could be performed. This led to an inability to handle certain classes of programs such as relaxation codes over arrays. In this paper, we develop generalizations of the mechanisms we developed in our earlier work to remove this limitation.


Automatic Data Layout with Resource Constraints

Uli Kremer
Rutgers University

The memory requirement characteristics of a data layout are particularly important for applications that are executed on a parallel machine mainly because of the amount of main memory that the machine provides, rather than its computation power. It may not be possible to executing such a memory intensive program on a conventional uniprocessor due to the lack of the necessary memory resources.

This talk discusses a framework for automatic data layout that considers read--only data replication and minimizes the overall execution time under a given memory constraints. The framework can be used to generate data layout specifications with additional read--only array copies. Many applications use arrays that keep an assigned value over large portions of the program. In such read--only regions, multiple read--only copies of the array, each copy with a different data layout, may avoid otherwise necessary communication. As a result, the overall execution time can be significantly reduced. However, multiple copies will increase the memory requirements of the program.

The presented approach addresses the necessary tradeoff decisions between read--only replication and memory requirements in a new, unified framework that extends our previous framework for automatic data layout. As in our previous work, the data layout selection problem is formulated as an efficient 0--1 integer programming problem. Preliminary experiments show the performance tradeoffs between the new and old formulations.


Interprocedural Array Alignment Analysis

Erwin Laure and Barbara Chapman
University of Vienna

The specification of efficient data distribution schemes is one of the major tasks in programming DMMPs with state of the art parallel languages. Although there are no optimal strategies for generating such data distributions, several heuristics have been developed to provide some support to the user. Alignment analysis, for instance, is able to help find a good distribution scheme and is furthermore a useful preresquisite for automatic data distribution tools. We presented an overview of an automatic alignment analysis tool within the framework of VFCS elsewhere, which is able to automatically generate alignment proposals for the arrays accessed in a proceduree and thus simplifies the data distribution problem. In this paper we extend our previous work to interprocedural analysis taking into account dynamic realignment. This feature is essential for applying alignment analysis to real programs. We present an heuristic algorithm for interprocedural dynamic alignment analysis as well as our cost models based upon sequential profiling data. Furthermore we show that the locality concept introduced elsewhere for intradimensional alignment preferences proved to be applicable and helpful also in this context.


Data Privatization/Localization Method for Automatic Parallelization on DMMs

M.A. Navarro (1), D. Padua (2), Y. Paek (2) and E.L. Zapata (1)
(1) University of Malaga and (2) University of Illinois at Urbana-Champaign

The Polaris restructurer developed at Illinois is a source-to-source translator of conventional Fortran programs, generating parallel codes for various types of multiprocessor systems. In order to evaluate the compiler techniques implemented in Polaris, we apply these techniques to real sequential programs, experiment the parallel codes on the target machine, and analyze the performance for each program. Through the performance analysis, we identify several techniques that would improve the original performance. Our recent efforts have been focused on to identify these techniques for Polaris targeting distributed memory multiprocessors. Some optimization issues, such as reduction communication overhead and more sophisticated data privatization techniques have been evaluated.


Generation of distributed object oriented programs

Pascale Launay and Jean-Louis Pazat
IRISA, France

Today's applications have to work both on distributed systems and over large networks of computers (eg. internet). The well known client/server paradigm is often too restrictive and program designers have to define new paradigms for cooperative work, games, distributed databases or information retrieval. These new applications have also to be scalable and should not need to be rewritten for different network configurations. This is very difficult because parallelism and mapping of program objects interact and have often to be taken into account in the design phase.

Our claim is that large grain parallelism can be easily defined at an early stage of program design, because this parallelism is tightly linked to the application itself whereas mapping of program objects should be done later on and should not interfere with the program design. We propose to separate parallelism and mapping.

The application designer is provided with an object oriented parallel language using a global name space. An easy way to do this without any modification of a language is to provide a library for parallelism (such as the Thread package of Java). The application designer can describe the mapping of some objects using program annotations (embedded in comments for example).

Then, an interactive tool is used to propose the mapping of other objects. Heuristic techniques similar to those used in the context of task and data mapping can be used. At this stage the user can modify the mapping according to his/her insight.

In a second step, the program is automatically transformed into a set of communicating processes by a source-to-source compiler.

For the design of our prototype we have chosen to work on the Java language because this language is both simple and widely used. We use the RMI (remote method invocation) package for implementing communications among processes in the transformed program. The current state of the project is the following: program transformations have been defined, and test cases have been coded by hand. The next phase of the project is to embed these transformations in the Centaur environment which will allow us to demonstrate this approach.


Data Parallel Language Extensions for Exploiting Locality
in Irregular Problems

Guillermo P. Trabado and Emilio L. Zapata
Universidad de Malaga

Many large-scale computational appications contain irregular data access patterns related to unstructured problem domeins. Examples include finite element methods, computational fluid dynamics, and molecular dynamics codes. Such codes are difficult to parallelize efficiently with current HPF compilers. However, most of these problems exhibit spatial locality. This property is exploited by our approach.

In the sequential program, unstructured domains are accessed via indirection arrays. We introduce a new directive that serves to identify directives indirection arrays and the boundaries of the associated domains. The data domains are distributed using Multiple Recursive Decomposition (MRD), a pseudo-regular distribution, which combines efficient implementation with good load balancing and communication behavior. Indirection arrays are aligned with the data arrays. Using the information provided in the directive, the compiler can produce a target program with significantly better performancee than an approach based on indirect distributions and the inspector/executor paradigm.