Identification of Scans

4 Identification of Scans

Identification of scans takes place after each phase of normalization: we thus need a way of memorizing the results before starting the next phase. We will first explain our notation for representing scans. Observe that this notation is not to be considered as a kind of function call or language statement. Its conversion to executable code is quite another problem. We will then explain how to extract scans from one-variable recurrences.

4.1 An Operator to Denote Scans

There exists some languages (mostly languages in the area of systolic arrays design) which include primitives for scan denotation. We can cite Alpha and Crystal (cf. [12]). Other formalisms like Lacs [17] or Pei [4] can also describe scans in an elegant way. Some imperative languages have primitives for scans. A recent language with scan ability is HPF. It includes reduction primitives à la Fortran 90 such as sum.

The sum of the elements of a vector x can be computed by merely writing sum(x(1:n)). Our Scan operator is more general, and hence more complex:

Scan( i | 0 <= i <= n , ( [1] ), +, x, 0)

Let us explain the meaning of this notation. The binary operation to be applied (here, addition) is given as a parameter. The first two parameters (the accumulation domain and the direction) state that the reduction applies to the elements of x from subscript 1 to n with a step of 1. Subscript 0 is used for storing the initial value (the last parameter). Moreover the Scan operator computes the whole set of partial results. To extract the last value of the resulting vector, we use subscripting:

Scan( i | 0 <= i <= n , ( [1] ), +, x, 0) [n]

The HPF reduction primitives can use very complex patterns which are described by indirection arrays. This power of expression is interesting but the use of indirection arrays is not very convenient, since they have to be initialized beforehand. We favor a more synthetic information even if we lose some expressive power (for example, HPF allows the computation of reductions along circles, but that is not used very often in scientific programs). Our solution is to use a single vector to describe the reduction path. Two points in the scan domain are in the same path (and contribute to the same result) if one of them can be reached from the other by iterated translation along the direction vector. Points which cannot be reached in this way from other points in the domain are given initial values.

Returning to the sum example, we easily see that there is only one path which include all points in the domain. Furthermore, point 0 cannot be reached by translation from other points, hence it is given the initial value, 0. The value of any other point, says i ³ 1, is obtained by applying + to the value of its predecessor, i-1 and to the local value, x_i.

In opposition to the reduction primitive, HPF is more restrictive concerning the scan primitives because it doesn't allow the use of indirection arrays; a scan is only possible along one dimension of the original array. Since our first transformation for program normalization uses the expansion of variables, we must deal with full scans, so we need more than a reduction operator. Hence we consider that the Scan operator compute an array of full scans; its result is of the same shape as the accumulation domain minus the initial values domain. To obtain a reduction one must reference the adequate element in the result array.

Our Scan operator has been enhanced in two aspects. Firstly, while we consider only single variable recurrences, these may have to be converted to vector or matrix scans in order to show associativity and hence parallelism. An example will be given later in Sect. 4.2.2. As a consequence, the data of a scan may be a complex object, and its operator a complex operator.

Secondly, in order to deal with scans computed along a piecewise rectilinear path, we have to introduce multi-directional scans in Sect. 5.

4.2 Scan Recognition

Suppose now that the result of the program normalization step is one (or more) self referencing equations:

i Î D : x_i= Exp(x_d(i), ...) (2)

where the ellipsis denotes occurrences of other variables and Exp is an arbitrary expression, which may contains several several occurrences of x with distinct dependence functions. Equation (3) is tractable if all references to x are of the form x_i-d where d is a unique integral vector. If this condition is fulfilled, d is the direction of the scan. The domain of the scan is the domain of i, and the initial data is found from other, non recursive clauses in the system. It remains to find the operator of the scan and its data. This process will be presented here for the case d = 1. The generalization is just a matter of notations.

All expressions which are handled by a compiler are terms on a system of operators W, and a system of basic terms A. Here the basic terms are x_i-1, the other variables, and the constants of our language --- integers, reals, truth values, and so on. To each operator w is associated an arity: an integer denoted as ¶(w). The rules for constructing legal terms are three:

A basic term is a term.
If t₁, ..., t_n are terms, and if w is an operator with ¶(w) = n, then w(t₁, ..., t_n) is a term. w is the head of this new term, and t₁, ..., t_n are its arguments.
There are no other terms.

It is not even necessary that W and A be finite. All we need are well defined procedures for recognizing a basic term, recognizing an operator, and computing its arity.

The set of operators depends on the underlying programming language. It may include arithmetic operators (+, -, *, /, ...), Boolean operators, elementary functions (sin, cos, log, exp, ...), comparison operators, and so on. Legal terms must also conform to type rules. We will suppose that these rules have been checked by a first pass of the compiler. It is an easy matter to verify that the system of recurrence equations associated to a well typed program is well typed, and stays well typed if subjected to substitution.

4.2.1 Simple pattern matching

Let t be the term which is associated to Exp in (3). The simplest possibility is to analyze the head of t, and its arguments. Two trivial cases must be detected first. t may simply be x_i-1. In that case, the recurrence is a value propagation, and its solution is x_i = x₀. Similarly, if x_i does not occur in t, the solution is x_i = t. In both these situations, the recurrence has a trivial parallelization.

In other cases, we must check that the head of t is binary, and that, of its arguments, one is x_i-1 and there is no occurrence of x_i-1 in the other one. The head of t must belong to a list of tractable operators. This is the simplest form of pattern matching.

The disadvantage of simple pattern matching is that it fails as soon as t becomes too complicated. For instance, it cannot do anything with t = (1 + x_i-1) + a_i. On the other hand, adding new operators is easy. If we devise the proper data structures, it can even be done without recompiling the detector program, simply by reading a ``rule file''.

4.2.2 Partial Normalization

The solution for finding the Scan operator in the recurrence

x_i=(1+x_i-1)+a_i

is simply to rearrange t as x_i-1 + (1 + a_i) by using associativity and commutativity of +. Properties like associativity, commutativity, and many others can be expressed as equational axioms as in:

x+(y+z) = (x+y)+z .

Such axioms, when completed by the usual properties of equality give the set of terms the structure of an equational theory. Such a theory has a normal form if there exists a function N from terms to terms such that:

x = N(x)

and

x = y Þ N(x) º N(y) ,

where º denotes structural identity (two terms are structurally identical if they are constructed in the same way from the same basic terms). Since identity can always be checked mechanically, an equational theory which has a normal form is decidable provided that N is computable. Since we know that there are undecidable equational theories, we deduce that some axiom systems admit no computable normal forms.

In the case of addition, for example, we can obtain a normal form by sorting addends according to an arbitrary order in which constant terms come last. The normal form is obtained from the sorted expression by reducing the constants according to the rules of arithmetics.

When submitted to pattern matching, equal terms should give ``equivalent'' results. One way of guaranteeing this property is to normalize the given term before applying pattern matching. If we are clever enough, we can even define the normal form in such a way that pattern matching is simplified. For instance, in the additive example, we can select the ordering in such a way that variable x_i-1 comes first.

Unfortunately, most interesting theories do not have a normalization algorithm. A heuristic solution is to use partial normalization, i.e. normalization which takes into account only a subset of the operators and axioms of the theory.

Let O Ì W and let t be a term. The O-skeleton of t is a construction of t in which only operators from O are used, and in which basic terms are terms from A or terms whose head does not come from O. Partial normalization can handle only terms in whose skeleton the elementary terms either are x_i-1 or do not include x_i-1. Let us suppose that O has a normalization procedure. We can then normalize the skeleton of t, handling its elementary terms as baggage, and perform pattern matching on the resulting normal form.

Let us suppose that W is the set of operators in Fortran, and that x_i is a logical variable. Let us take the set of Boolean operators { Ù, Ú, ¬} as O. Viewing a Boolean expression as a tree, its O-skeleton is the ``upper part'' of the tree, from the root to the first occurences of a variable or of a comparison operator. Furthermore, since Fortran has no conditional expression and since the 0/1 convention of C does not hold, it is unlikely that a Boolean variable may occur beyond a comparaison operator (one would have to use a user defined function). Hence, Boolean expression have a high probability of being tractable.

Boolean algebra has, in fact, several normal forms. Conjunctive normal forms (CNF) can be used for detecting and-reductions, and disjunctive normal forms (DNF) for or-reductions.

Let us consider the following somewhat contrived recurrence:

x_i = (x_i-1 Ú (a_i ³ b_i)) Ù (x_i-1 Ú (a_i < b_i)) .

The Boolean skeleton of the right hand side is:
t = (x_i-1 Ú a_i) Ù (x_i-1 Ú b_i) ,
where a_i = a_i ³ b_i and b_i = a_i < b_i. The heads of these terms are not Boolean operators. t is in DNF, and x_i-1 occurs twice, hence our recurrence is not an and-reduction. The CNF of t is
t = x_i-1 Ú (a_i Ù b_i) ,
which is an or-reduction. With a more powerfull normalization system, we would have noticed that a_i Ù b_i = false, and hence that the recurrence has the closed form solution x_i = x₀. Such a normalization system must have the same power as linear programming; its construction is probably very difficult.

4.2.3 Marshalling Associative Operators

We still have to construct a list of associative operators. Some of them are well known: +, *, max, min, Ù, Ú, etc. Is there a more systematic procedure?

Let us put the basic equation (3) into the form:

x_i = f_i(x_i-1). (3)

This can be rewritten [5] as:

g₀	=	l y. y,	(4)
g_i	=	f_i ° g_i-1,	(5)
x_i	=	g_i(x₀).	(6)

Since ° is associative, we may hope to compute the g_i's by a scan, and then to compute all x_i's in parallel. But this is a mere formal manipulation, without any practical interest, unless the complexity of g_i stays bounded as the computation proceeds. In practical terms, this means that the functions f_i must belong to a family which can be described by a few parameters, and that this family must be closed under function composition.

As an exemple, consider the case where f_i is a polynomial in x_i. Since the composition of a polynomial of degree m with a polynomial of degree n is of degree mn, the family is closed only for the case n £ 1, m £ 1.

Let us take as O the set {+, -, *} with the usual properties: associativity and commutativity of + and *, the rule of signs, distributivity of * with respect to +, and the familiar rules of arithmetic (2 + 2 = 4 and so on). If t is tractable for this set of operators, its normal form is a polynomial in x_i-1 whose coefficients are combinations of its elementary terms. We conclude that a recurrence x_i = P(x_i-1), is a scan only if polynomial P is of first degree:
x_i = a_i x_i-1 + b_i . (7)
There are two simple cases : a_i = 1, which gives a sum, and b_i = 0, which gives a product. In the general case, there are many ways of extracting a_i and b_i. Since practical normalization procedures are not always up to their mathematical specification, we may have to resort to ``mathematical pattern matching''. For instance, in the linear case, we can set b_i = t[x_i-1 ¬ 0], where e[x ¬ f] is the substitution of f for x in e, and a_i = t[x_i-1 ¬ 1] - b_i. The normalization system is then used to prove that t - a_i x_i-1 - b_i º 0, a much simpler proposition.

Once we know a_i and b_i, recurrence (8) can be written in matrix form:

æ
è

x_i

1
ö
ø = æ
è

a_i b_i

0 1
ö
ø æ
è

x_i-1

1
ö
ø ,
and hence is equivalent to the computation of matrix products, which are associative.

Another example is the family of homographic functions:

Hom_abcd(x) =

ax + b

cx + d

which is closed under function composition, It so happens, in fact, that if we associate to Hom_abcd the matrix (

a	b
c	d

) then function composition for the Hom family is associated to the matrix product. A mathematical pattern matching routine can be devised for Hom provided our underlying computer algebra system is able to normalize rational fractions. It would be interesting to systematically explore closed families of functions, since each one is the basis of an efficient parallel algorithm. The study of [1] may be a step in that direction.

4.2.4 Recurrences on finite domains

Let us suppose that the variable x_i in (4) has a finite domain. It is clear that functions from a finite set to the same finite set are closed under composition. As a consequence, a recurrence on a finite domain is always a scan.

A function f on a finite domain { a₁, ..., a_n} can be defined by a table of values { f(a₁), ..., f(a_n) }. Computing f ° g simply consists in the computation of { f(g(a₁)), ... f(g(a_n)) }. This can be done from the tables of f and g, and is necessarily an associative operation, as it is a representation of function composition. Hence the algorithm for computing a scan (4) when x_i has a finite domain is:

Compute the tables associated to each f_i.
Combine these tables by any reduction scheme, obtaining the tables for the functions g_i of (6).
Compute the results of the scan in parallel by:
x_i = g_i(x₀) .

It is interesting to compare the parallel and sequential scheme for complexity. Let us suppose that the domain of the variables (x_i) has n elements, that the recurrence has m steps, and that our parallel computer has P processors. Let us take the time for a table access as the unit. Let us suppose that the functions f_i are initially given as tables. In that case the sequential time is simply m. The elementary step of the scan is an algorithm for computing the composition of two tables f1 and f2 giving f3:

    do i = 1,n
      f3(i) = f2(f1(i))
    end do

which takes 3n table accesses. The total time for the reduction is thus 3n(m/P + log₂ P). The last step takes m/P times unit, for a total of 3n(m/P + log₂ P) + m/P. Supposing that m is large enough that we can neglect additional terms, we have to compare m to (3n+1)m/P. The conclusion is that the method is advantageous provided that the size of the finite domain is small compared to the number of processors.

4.2.5 Tactics

As the reader may have noticed, the methods we have presented are overlapping. The same Scan operator may be recognized by elementary pattern matching, or only after normalization, or as a computation on a finite domain. For example, a boolean recurrence can be recognized by pattern matching if its operator is Ù or Ú, or as a recurrence over a finite domain if not. Our proposal is first to distinguish cases according to the type of x_i. To each type is associated one special purpose normalisation system. Pattern matching is then applied to the result of the normalization. If this fails, and if it can be proved that x_i has a finite domain, then the general method of Sect. 4.2.4, which is less efficient, can be applied.

Consider the recurrence:
x_i = (x_i-1 +1)/3 .
Suppose first that x_i is a so-called real (i.e., belongs to a subset of the rationals). Then, up to rounding errors, division distributes into addition, and the recurrence can be normalized, with the permission of the user, into:
x_i = 0.333... x_i-1 + 0.333... ,
which is linear, hence, a scan. If x_i is integral, then the above normalization is no longer valid, and the original recurrence is not a scan. Lastly, if the recurrence is slightly modified into:
x_i = [(x_i-1 +1)/3] mod N ,
where N is a constant integer, then x_i belongs to the finite set [0, N-1] and the recurrence is again a scan. This observation is usefull only if N is much smaller than the number of processors.

Note that scan detection can benefit from information obtained by a static analysis of the program. For instance, an expression which is apparently a second degree polynomial may be proved linear by a constant propagation which shows that the second degree coefficient is zero. A variable which is declared an integer may be found to have a finite domain by interval analysis. Conversely, the result of a scan detection may enable the computation of the DFG of an hitherto intractable part of a program, by exhibiting inductive variables or giving in closed form the values of the entries in an array which is used as a subscript. The organization of such iterative analyses is a very difficult problem.