This paper describes an algorithm for deriving data and computation partitions to improve memory locality on scalable shared memory multiprocessors. The algorithm first determines computation partitions by establishing affinity between where the computations are performed and where the data is located based on data accesses in the program. In the process of deriving the computation partitions, static data partitions are derived for the arrays. These static partitions may not be the best choice for some arrays. The partitioning of these arrays is reconsidered by evaluating the costs of all possible partitions for each such array individually using depth-first search with pruning. The cost of a given data partition is computed using knowledge of cache, local and remote memory access costs as well as the costs of contention and synchronization. Experimental results from a prototype implementation of the algorithm demonstrate that it is computationally efficient and that it is effective in parallelizing standard benchmarks. The results also show the necessity of taking shared memory effects into account.
In this talk, I will provide an overview of the dHPF compiler effort at Rice
University. The project so far has focused on techniques to achieve high
performance for regular HPF applications using a Message Passing Interface
(MPI) communication substrate. In addition to traditional optimizations,
several features distinguish dHPF from previous generations of data-parallel
compilers:
Data distribution is one of the key aspects in the design of a parallelizing
compiler for a distributed-memory architecture to get efficiency from the
system. This paper describes the enhancement of our automatic data distribution
tool in order to generate two-dimensional distributions. Our tool automatically
derives the data mapping for the arrays and the parallelization strategy for
the loops in a Fortran 77 program. The layout generated may be dynamic, and the
distribution fashion is either BLOCK or CYCLIC.
All the information regarding data movement and parallelism is contained
in a single data structure named Communication-Parallelism Graph (CPG).
The CPG is used to model a minimal path problem in which time is the
objective function to minimize, and solved using a general purpose
linear programming solver, which finds the optimal solution for the
whole problem.
The tool actually asumes that the processors grid is two-dimensional,
the topology is fixed, and that no changes will be made during execution
time. This framework is a first step towards a more general case, in which
the processors topology may change along the program execution.
The two major aspects one has to address when automatically
parallelizing loop nests on distributed memory computers are the
maximization of the parallelism and the minimization of the
communication overhead due to non local data accesses. We study the
problem of finding computation mappings and data distributions that
minimize the numbeer of remote data access for a given degree of
parallelism. This problem is known to be NP-hard. The algorithm
presented uses a linear algebra framework and assumes affine data
access functions. It proceeds by enumerating all interesting bases of
the set of vectors representing the alignments between computation and
data accesses. Comparison with related work shows that previous results
are special cases of our formulation. We discuss exact enumeration
techniques as well as heuristics.
In program optimization based on performance prediction, feedback is
either based on a model of the program or on a model of the actually
generated machine code. Especially in the case of a message-passing
architecture, the difference in abstraction is large. In this paper we
study the trade-off between modeling at program level and machine
level in the context of automatic performance optimization for message
-passing architectures. We present a modeling technique based on
modeling the various optimizations in terms of resource contention.
Despite the abstractions involved, we show that program level modeling
is indeed adequate when this technique is used. We demonstrate the
effectiveness of our approach by deriving various optimizations of a
line relaxation kernel for distributed-memory machines.
Space-efficient data structures for sparse matrices are an important
concept in numerical programming because they allow for considerable
savings in space and time compared with common two-dimensional arrays.
Unfortunately, for such programs it is usually impossible to
statically determine all data dependencies. Thus, automatic
parallelization of such codes is usually done at run time by applying
the inspector-executor techniquee, incurring tremendous overhead.
Program comprehension techniques exploit knowledge on frequently
occurring implementation variations of important computations. They
have been shown to improve many important fields of automatic
parallelization of dense matrix computations, such as automatic
program transformation and local algorithm replacement, data flow
analysis, array distribution, and performance prediction.
In this study we investigate up to which deegree this approach could
be generalized to sequential codes implementing sparse matrix
computations, and how much additional information, e.g. on run-time
values of some variables, is needed. We propose a speculative program
comprehension and parallelization method combined with compile-time
program flow analysis techniques. Only where static analysis does not
supply sufficient information, a parallelized run time test must check
wether a code fragment speculatively recognized and parallelized at
compile time as a potential sparse matrix computation, really
implements this computation. Beyond automatic parallelization, program
comprehension for sparse matrix computations may also support program
maintenance and debugging, and could guide the selection of a more
suitable data structure for a sparse matrix.
Data reuse is imperative for good performance on modern shared memory
machines because of the deep hierarchy of the memory architecture of
these machines. However, programs with good data reuse are much more
complex than the usual natural expression of the algorithms they
implement. The dense linear algebra community has addressed this
complexity with the use of libraries such as LAPACK. LAPACK dealing
with only two levels of the memory hierarchy, is quite complex. For
multiple levels of memory hierarchy, the complexity of libraries will
only increase.
Standard compiler technology to address data reuse has been "tiling",
which works quite well for perfectly nested loops. However, for
imperfectly nested loops, either code sinking must be applied to
convert them to perfectly nested loops after which tiling can be
applied, or a sequence of alternate transformations such as index-set
splitting, loop fusion and distribution must be applied in some order
to achieve data locality. In either case, it is not clear how the
compiler can automatically choose the right sequence of
transformations to perform. It is also unclear whether these
approaches work for multiple levels of memory hierarchy.
In previous work, we addressed the data reuse problem by using a novel
approach to program restructuring, which we called data-centric
to contrast it with the traditional approach to restructuring
programs, which is control-centric. Our tools reasoned about the
data and its flow directly instead of dealing with loops, loop
transformations and iteration spaces. In this framework, loops were
simply an implementation of the desired data flow. The key idea was to
partition the data arrays into blocks, walk the blocks in some order
and execute some set of iterations "associated" with that block. By
dealing with data directly, we showed how to seamlessly deal with
perfectly as well as imperfectly nested loops, avoided the
phase-ordering problem inherent in control-centric methods, and
reasoned naturally about data locality for multiple levels of the
memory hierarchy.
One of the limitations of our previous approach was a limitation on
how "walks" on the blocked data could be performed. This led to an
inability to handle certain classes of programs such as relaxation
codes over arrays. In this paper, we develop generalizations of the
mechanisms we developed in our earlier work to remove this limitation.
The memory requirement characteristics of a data layout are particularly
important for applications that are executed on a parallel machine
mainly because of the amount of main memory that the machine provides,
rather than its computation power. It may not be possible to executing
such a memory intensive program on a conventional uniprocessor due to the
lack of the necessary memory resources.
This talk discusses a framework for automatic data
layout that considers read--only data replication and minimizes the
overall execution time under a given memory constraints.
The framework can be used to generate data layout specifications
with additional read--only array copies.
Many applications use arrays that keep an assigned value
over large portions of the program. In such read--only regions, multiple
read--only copies of the array, each copy with a different data layout,
may avoid otherwise necessary communication. As a result, the
overall execution time can be significantly reduced. However, multiple
copies will increase the memory requirements of the program.
The presented approach addresses the necessary tradeoff
decisions between read--only replication and memory requirements in a
new, unified framework that extends our previous framework for
automatic data layout. As in our previous work, the data layout selection
problem is formulated as an efficient 0--1 integer programming problem.
Preliminary experiments show the performance tradeoffs between
the new and old formulations.
The specification of efficient data distribution schemes is one of the
major tasks in programming DMMPs with state of the art parallel
languages. Although there are no optimal strategies for generating
such data distributions, several heuristics have been developed to
provide some support to the user. Alignment analysis, for instance, is
able to help find a good distribution scheme and is furthermore a
useful preresquisite for automatic data distribution tools. We
presented an overview of an automatic alignment analysis tool within
the framework of VFCS elsewhere, which is able to automatically
generate alignment proposals for the arrays accessed in a proceduree
and thus simplifies the data distribution problem.
In this paper we extend our previous work to interprocedural analysis
taking into account dynamic realignment. This feature is essential for
applying alignment analysis to real programs. We present an heuristic
algorithm for interprocedural dynamic alignment analysis as well as
our cost models based upon sequential profiling data. Furthermore we
show that the locality concept introduced elsewhere for
intradimensional alignment preferences proved to be applicable and
helpful also in this context.
The Polaris restructurer developed at Illinois is a source-to-source
translator of conventional Fortran programs, generating parallel codes
for various types of multiprocessor systems. In order to evaluate the
compiler techniques implemented in Polaris, we apply these techniques
to real sequential programs, experiment the parallel codes on the
target machine, and analyze the performance for each program. Through
the performance analysis, we identify several techniques that would
improve the original performance. Our recent efforts have been focused
on to identify these techniques for Polaris targeting distributed
memory multiprocessors. Some optimization issues, such as reduction
communication overhead and more sophisticated data privatization
techniques have been evaluated.
dHPF: A Platform for Data-Parallel Compiler Research
Vikram Adve
Rice University
Two-dimensional Data Distribution with Constant Topology
Jordi Garcia, Eduard Ayguade and Jesus Labarta
Universitat Politecnica de Catalunya
A General Solution to the Alignment Problem
on Distributed Memory Machines
Claude G. Diderich, Swiss Federal Institute of Technology
Marc Gengler, Ecole Normale Superieure de Lyon, LIP
Modeling trade-offs in automatic performance
optimization for message-passing architectures
Arjan J.C. van Gemund
Delft University of Technology
On the Applicability of Program Comprehension Techniques to the Automatic Parallelization of Sparse Matrix Computations
Christoph Kessler
University of Trier
Data-centric Transformations
Induprakas Kodukula, Nawaaz Ahmed and Keshav Pingali
Cornell University
Automatic Data Layout with Resource Constraints
Uli Kremer
Rutgers University
Interprocedural Array Alignment Analysis
Erwin Laure and Barbara Chapman
University of Vienna
Data Privatization/Localization Method for Automatic Parallelization on DMMs
M.A. Navarro (1), D. Padua (2), Y. Paek (2) and E.L. Zapata (1)
(1) University of Malaga and (2) University of Illinois at Urbana-Champaign