4 Identification of Scans
Identification of scans takes place after each phase of normalization:
we thus need a way of memorizing the results before starting the next
phase. We will first explain our notation for representing
scans. Observe that this notation is not to be considered as a kind of
function call or language statement. Its conversion to executable code
is quite another problem. We will then explain how to extract scans
from one-variable recurrences.
4.1 An Operator to Denote Scans
There exists
some languages (mostly languages in the area of systolic arrays design)
which include primitives for scan denotation. We can cite Alpha
and Crystal (cf. [12]). Other formalisms like
Lacs [17] or Pei [4]
can also describe scans in an elegant way.
Some imperative languages have primitives for scans. A
recent language with scan ability is HPF.
It includes reduction primitives à la Fortran 90 such as sum.
The sum of the elements of a vector x can be computed by
merely writing sum(x(1:n)). Our Scan operator is more
general, and hence more complex:
Scan( i | 0 <= i <= n , ( [1] ), +, x, 0)
Let us explain the meaning of this notation. The binary operation to
be applied (here, addition) is given as a parameter. The
first two parameters (the accumulation domain and the direction) state
that the reduction applies to the elements of x from
subscript 1 to n with a step of 1. Subscript 0
is used for storing the initial value (the last parameter). Moreover
the Scan operator computes the whole set of partial results. To
extract the last value of the resulting vector, we use subscripting:
Scan( i | 0 <= i <= n , ( [1] ), +, x, 0) [n]
The HPF reduction primitives can use very complex patterns which are
described by indirection arrays. This power of expression is
interesting but the use of indirection arrays is not very convenient,
since they have to be initialized beforehand. We favor a more
synthetic information even if we lose some expressive power (for
example, HPF allows the computation of reductions along circles, but
that is not used very often in scientific programs). Our solution is
to use a single vector to describe the reduction path. Two points in
the scan domain are in the same path (and contribute to the same
result) if one of them can be reached from the other by iterated
translation along the direction vector. Points which cannot be reached
in this way from other points in the domain are given initial
values.
Returning to the sum example, we easily see that there is only one
path which include all points in the domain. Furthermore, point 0
cannot be reached by translation from other points,
hence it is given the initial value, 0. The value of any other point,
says i ³ 1, is obtained by applying + to the value of its
predecessor, i-1 and to the local value, xi.
In opposition to the reduction primitive, HPF is more restrictive concerning
the scan primitives because it doesn't allow the use of indirection arrays;
a scan is only possible along one dimension of the original array.
Since our first transformation for program normalization uses the expansion
of variables, we must deal with full scans, so we need more
than a reduction operator. Hence we consider that the Scan operator
compute an array of full scans; its result is of the same shape as the
accumulation domain minus the initial values domain. To obtain a
reduction one must reference the adequate element in the result array.
Our Scan operator has been enhanced in two aspects.
Firstly, while we consider only single variable recurrences, these may
have to be converted to vector or matrix scans in order to show
associativity and hence parallelism. An example will be given later in
Sect. 4.2.2. As a consequence, the data of a
scan may be a complex object, and its operator a complex operator.
Secondly, in order to deal with scans computed along
a piecewise rectilinear path, we have to introduce multi-directional
scans in Sect. 5.
4.2 Scan Recognition
Suppose now that the result of the program normalization step is one (or more)
self referencing equations:
i Î D :
xi= Exp(
xd(i), ...)
(2)
where the ellipsis denotes occurrences of other variables and Exp is
an arbitrary expression, which may contains several several
occurrences of x with distinct dependence functions.
Equation (3) is tractable if all
references to x are of the form xi-d where d is a
unique integral vector.
If this condition is fulfilled, d is the direction of the scan.
The domain of the scan is the domain of i, and the initial data is
found from other, non recursive clauses in the system.
It remains to find the operator of the scan and its data. This process
will be presented here for the case d = 1.
The generalization is just a matter of notations.
All expressions which are handled by a compiler are terms
on a system of operators W, and a system of basic terms
A. Here the basic terms are xi-1, the other variables,
and the constants of our language --- integers, reals, truth values,
and so on. To each operator w is associated an arity: an
integer denoted as ¶(w). The rules for constructing
legal terms are three:
-
A basic term is a term.
- If t1, ..., tn are terms, and if w is an operator
with ¶(w) = n, then w(t1, ..., tn) is a term.
w is the head of this new term, and t1, ..., tn are
its arguments.
- There are no other terms.
It is not even necessary that W and A be finite. All we need are
well defined procedures for recognizing a basic term, recognizing
an operator, and computing its arity.
The set of operators depends on the underlying programming language. It
may include arithmetic operators (+, -, *, /, ...), Boolean
operators, elementary functions (sin, cos, log, exp, ...),
comparison operators, and so on. Legal terms must also conform to
type rules. We will suppose that these rules have been checked by a first
pass of the compiler. It is an easy matter to verify that the system of
recurrence equations associated to a well typed program is well typed, and
stays well typed if subjected to substitution.
4.2.1 Simple pattern matching
Let t be the term which is associated to Exp in (3).
The simplest possibility is to analyze the head of t, and its
arguments. Two trivial cases must be detected first. t may simply
be xi-1. In that case, the recurrence is a value propagation,
and its solution is xi = x0. Similarly, if xi does not occur
in t, the solution is xi = t. In both these situations, the
recurrence has a trivial parallelization.
In other cases, we must check that the head of t is binary, and
that, of its arguments, one is xi-1 and there is no occurrence of
xi-1 in the other one. The head of t must belong to a list of
tractable operators. This is the simplest form of pattern
matching.
The disadvantage of simple pattern matching is that it fails as soon
as t becomes too complicated. For instance, it cannot do
anything with t = (1 + xi-1) + ai.
On the other hand, adding new operators is easy. If we devise the
proper data structures, it can even be done without recompiling the
detector program, simply by reading a ``rule file''.
4.2.2 Partial Normalization
The solution for finding the Scan operator in the recurrence
xi=(1+xi-1)+ai
is simply to rearrange t as xi-1 + (1 + ai) by using associativity and
commutativity of +. Properties like associativity, commutativity,
and many others can be expressed as equational axioms as in:
x+(y+z) = (x+y)+z .
Such axioms, when completed by the usual properties of equality give
the set of terms the structure of an
equational theory. Such a theory has a normal form if there exists a
function N from terms to terms such that:
x = N(x)
and
x = y Þ N(x) º N(y) ,
where º denotes structural identity (two terms are structurally
identical if they are constructed in the same way from the same basic
terms). Since identity can always be checked mechanically, an
equational theory which has a normal form is decidable provided that
N is computable. Since we know that there are undecidable
equational theories, we deduce that some axiom systems admit no computable
normal forms.
In the case of addition, for example, we can obtain a normal form by
sorting addends according to an arbitrary order in which constant
terms come last. The normal form is obtained from the sorted
expression by reducing the constants according to the rules of
arithmetics.
When submitted to pattern matching, equal terms should give
``equivalent'' results. One way of guaranteeing this property is to
normalize the given term before applying pattern matching. If we are
clever enough, we can even define the normal form in such a way that
pattern matching is simplified. For instance, in the additive example,
we can select the ordering in such a way that variable xi-1 comes
first.
Unfortunately, most interesting theories do not have a normalization
algorithm. A heuristic solution is to use partial
normalization, i.e. normalization which takes into account only a
subset of the operators and axioms of the theory.
Let O Ì W and let t be a term. The O-skeleton
of t is a construction of t in which only operators from O are
used, and in which basic terms are terms from A or terms whose
head does not come from O. Partial normalization can handle only terms
in whose skeleton the elementary terms either are xi-1 or do not
include xi-1. Let us suppose that O has a normalization
procedure. We can then normalize the skeleton of t, handling its
elementary terms as baggage, and perform pattern matching on the
resulting normal form.
Let us suppose that W is the set of operators in Fortran, and
that xi is a logical variable. Let us take the set of Boolean
operators { Ù, Ú, ¬} as O. Viewing a Boolean expression
as a tree, its O-skeleton is the ``upper part'' of the tree, from
the root to the first occurences of a variable or of a comparison
operator. Furthermore, since Fortran has no conditional expression and
since the 0/1 convention of C does not hold, it is unlikely that a
Boolean variable may occur beyond a comparaison operator (one would
have to use a user defined function). Hence, Boolean expression have a
high probability of being tractable.
Boolean algebra has, in fact, several normal forms. Conjunctive normal
forms (CNF) can be used for detecting and-reductions, and disjunctive normal
forms (DNF) for or-reductions.
Let us consider the following somewhat contrived recurrence:
xi = (xi-1 Ú (ai ³ bi)) Ù (xi-1 Ú
(ai < bi)) .
The Boolean skeleton of the right hand side is:
t = (xi-1 Ú ai) Ù (xi-1 Ú bi) ,
where ai = ai ³ bi and bi = ai < bi.
The heads of these
terms are not Boolean operators. t is in DNF, and xi-1 occurs
twice, hence our recurrence is not an and-reduction. The CNF of t is
t = xi-1 Ú (ai Ù bi) ,
which is an or-reduction. With a more powerfull normalization system,
we would have noticed that ai Ù bi = false, and hence
that the recurrence has the closed form solution
xi = x0.
Such a normalization system must have the same power as linear
programming; its construction is probably very difficult.
4.2.3 Marshalling Associative Operators
We still have to construct a list of associative operators. Some of
them are well known: +, *, max, min, Ù, Ú, etc. Is
there a more systematic procedure?
Let us put the basic equation (3) into the form:
This can be rewritten [5] as:
|
g0 |
= |
l y. y, |
(4) |
|
gi |
= |
fi ° gi-1, |
(5) |
|
xi |
= |
gi(x0). |
(6) |
Since ° is associative, we may hope to compute the
gi's by a scan, and then to compute all xi's
in parallel. But this is a mere formal manipulation, without
any practical interest, unless the complexity of gi stays bounded
as the computation proceeds. In practical terms, this means that
the functions fi must belong to a family which
can be described by a few parameters, and that this family must
be closed under function composition.
As an exemple, consider the case where fi is a polynomial in
xi. Since the composition of a polynomial of degree m with
a polynomial of degree n is of degree mn, the family is closed
only for the case n £ 1, m £ 1.
Let us take as O the set {+, -, *} with the usual properties:
associativity and commutativity of + and *, the rule of signs,
distributivity of * with respect to +, and the familiar rules of
arithmetic (2 + 2 = 4 and so on). If t is tractable for this set of
operators, its normal form is a polynomial in xi-1 whose
coefficients are combinations of its elementary terms. We conclude
that a recurrence xi = P(xi-1), is a scan only if polynomial
P is of first degree:
There are two simple cases : ai = 1, which gives a sum, and bi =
0, which gives a product. In the general case, there are many ways of
extracting ai and bi. Since
practical normalization procedures are not always up to their mathematical
specification, we may have to resort to ``mathematical pattern
matching''. For instance, in the linear case, we can set
bi = t[xi-1 ¬ 0], where e[x ¬ f] is
the substitution of f for x in e, and
ai = t[xi-1 ¬ 1] - bi. The normalization system is
then used to prove that t - ai xi-1 - bi º 0, a much
simpler proposition.
Once we know ai and bi, recurrence (8) can be written in
matrix form:
and hence is equivalent to the computation of matrix products, which
are associative.
Another example is the family of homographic functions:
which is closed under function composition, It so happens, in fact, that if
we associate to Homabcd the matrix
( )
then function composition for the Hom family is associated to the
matrix product. A mathematical pattern matching routine can be devised
for Hom provided our underlying computer algebra system is able to
normalize rational fractions.
It would be interesting to systematically explore closed families of
functions, since each one is the basis of an efficient parallel algorithm.
The study of [1] may be a step in that direction.
4.2.4 Recurrences on finite domains
Let us suppose that the variable xi in (4)
has a finite domain. It is clear that functions from a finite
set to the same finite set are closed under composition. As a
consequence, a recurrence on a finite domain is always a scan.
A function f on a finite domain { a1, ..., an} can
be defined by a table of values { f(a1), ..., f(an) }.
Computing f ° g simply consists in the computation
of { f(g(a1)), ... f(g(an)) }. This can be done
from the tables of f and g, and is necessarily an associative
operation, as it is a representation of function composition. Hence
the algorithm for computing a scan (4) when
xi has a finite domain is:
-
Compute the tables associated to each fi.
- Combine these tables by any reduction scheme, obtaining the
tables for the functions gi of (6).
- Compute the results of the scan in parallel by:
xi = gi(x0) .
It is interesting to compare the parallel and sequential scheme
for complexity. Let us suppose that the domain of the variables (xi)
has n elements, that the recurrence has m steps, and that
our parallel computer has P processors. Let us take the time
for a table access as the unit. Let us suppose that the functions
fi are initially given as tables. In that case the
sequential time is simply m. The elementary step of the scan
is an algorithm for computing the composition of two tables f1
and f2 giving f3:
do i = 1,n
f3(i) = f2(f1(i))
end do
which takes 3n table accesses. The total time for the reduction is thus
3n(m/P + log2 P). The last step takes m/P times unit,
for a total of 3n(m/P + log2 P) + m/P.
Supposing that m is large enough that we can neglect
additional terms, we have to compare m to (3n+1)m/P. The
conclusion is that the method is advantageous provided that the size
of the finite domain is small compared to the number of processors.
4.2.5 Tactics
As the reader may have noticed, the methods we have presented are
overlapping. The same Scan operator may be recognized by elementary
pattern matching, or only after normalization, or as a computation
on a finite domain. For example, a boolean recurrence can be recognized by
pattern matching if its operator is Ù or Ú, or as a recurrence over
a finite domain if not. Our proposal is first to distinguish cases
according to the type of xi. To each type is associated one special
purpose normalisation system. Pattern matching is then applied to the
result of the normalization. If this fails, and if it can be proved
that xi has a finite domain, then the general method of
Sect. 4.2.4, which is less efficient, can be applied.
Consider the recurrence:
xi = (xi-1 +1)/3 .
Suppose first that xi is a so-called real (i.e., belongs to a
subset of the rationals). Then,
up to rounding errors, division distributes into addition, and the
recurrence can be normalized, with the permission of the user, into:
xi = 0.333... xi-1 + 0.333... ,
which is linear, hence, a scan. If xi is integral, then the above
normalization is no longer valid, and the original recurrence is not
a scan. Lastly, if the recurrence is slightly modified into:
xi = [(xi-1 +1)/3] mod N ,
where N is a constant integer, then xi belongs to the finite set
[0, N-1] and the recurrence is again a scan. This observation is
usefull only if N is much smaller than the number of processors.
Note that scan detection can benefit from information obtained by a
static analysis of the program. For instance, an expression which is
apparently a second degree polynomial may be proved linear by
a constant propagation which shows that the second degree coefficient
is zero. A variable which is declared an integer may be found
to have a finite domain by interval analysis. Conversely, the result of a
scan detection may enable the computation of the DFG of an hitherto
intractable part of a program, by exhibiting inductive variables or
giving in closed form the values of the entries in an array which is
used as a subscript. The organization of such iterative analyses is a
very difficult problem.