Identification of Scans

4 Identification of Scans

After each phase of program normalization the detection of interesting recurrences can take place. We are interested in trivial recurrences (their solutions can be computed at compilation time) and scans. The scans are recurrences that can be computed from a set of data by applying on this set a binary and associative function. Due to the associativity of the function the computations can be reordered to extract parallelism. A full scan computes a set of values; there is a special family of scans which return only one value: the reductions.

4.1 An Operator to Denote Scans

Before detecting scans we need a notation for them. There exists some languages (mostly languages in the area of systolic arrays design) which include primitives for scan computation. We can cite Alpha and Crystal (cf. [10]). Other formalisms like Lacs [14] or Pei [3] can also describe scans in an elegant way. Some imperative languages have primitives for scans. A recent language with scan ability is HPF. It includes reduction primitives à la Fortran 90 such as sum.

The sum of the elements of a vector v can be computed by the mere expression sum(v(1:n)). With our operator the syntax is a bit more complex :
Scan({ i | 0 <= i <= n }, ( [1] ), +, v, 0)
This operator is described in more details later but one can note that the binary operation to be applied (here the + operation) is given as a parameter. The first two parameters (the accumulation domain and the direction) state that the reduction takes into account the elements of v from subscript 1 to n with a step of 1. Subscript 0 is for the initial value which is the last parameter. Moreover the Scan operator computes the whole set of partial results. In the example, only the last value of the resulting vector is needed:
Scan({ i | 0 <= i <= n }, ( [1] ), +, v, 0) [n]  

HPF also includes more modern reduction primitives of names like xxx_SCATTER and some scans primitives are provided with names like xxx_SUFFIX and xxx_PREFIX. The HPF reduction primitives can use very complex patterns thanks to indirection arrays. Such a primitive needs the following arguments :

an array a of values to be reduced;
an array b of initial values;
a set of arrays x₁,...,x_n with the same shape as a and where n is the dimension of b. These arrays determine the place where each value in a is to be accumulated.

This power of expression is interesting but the use of indirection arrays is not very convenient. We favor a more synthetic information even if we lose some expressive power (for example, HPF allows the computation of reductions along circles, but that is not used very often in scientific programs). Our solution is to associate a direction to a reduction. Note that the Scan operator is easier to use than the HPF primitives since there is no need to initialize huge indirection arrays. One has just to give the scan directions. In opposition to the reduction primitive, HPF is more restrictive concerning the scan primitives because it doesn't allow the use of indirection arrays; a scan is only possible along one dimension of the original array. Since our first transformation for program normalization uses the expansion of variables, we have often to deal with full scans, hence we need more than a reduction operator. So we consider that the Scan operator compute an array of full scans, its result is of the same shape as the accumulation domain minus the initial values domain. To obtain a reduction one must reference the adequate element in this array.

Our Scan operator has been enhanced to deal with scans computed along a piecewise rectilinear path. These paths are defined by a list of vectors, hence the corresponding scans are called multi-directional scans. The last vector indicate the main direction of the scan, the other vectors are just used to jump from one of these directions to another when a dead end occurs.

The classical example of a multi-directional scan is the sum of the elements of a matrix:
s=0
DO i=1,n
 DO j=1,n
   s=s+a(i,j)
 END DO
END DO
In the equivalent system of equations, the piecewise rectilinear path appears clearly:
v2[i,j] =
   case
    { i,j | i=1, j=1 }        : a[i,j] ;
    { i,j | 2<=i<=n, j=1 }    : v2[i-1,n] + a[i,j] ;
    { i,j | 1<=i<=n, 2<=j<=n }: v2[i,j-1] + a[i,j] ;
   esac ;
This path may be defined by two vectors (the two directions of the multi-directional scan) (

1

0
) and (

0

1
). The second direction is used until the last element of a row and then the first direction is used to skip to the next row which must be scanned from the beginning to the end, and so on. Since the path iterates the whole array, only one scan is defined here (and not an array of scans). Hence, the domain of the initial value is the single point (1,1). To summarize, the Scan expression to compute the sum of the elements of a matrix a is:
Scan( { i,j | 1<=i<=n, 1<=j<=n }, ( [1 0] [0 1] ), +, a, a[1,1])

At this point we can give a more formal definition of the Scan operator. First we need to define a variant of the lexicographic minimum. Let D be a convex of Z^p and let the e_i be vectors of the same Z^p space.

min

e₁,...,e_k

(z)= z-

i=1

n_i.e_i where

(n₁,...,n_k)= lexmax(µ₁,...,µ_k | (µ₁,...,µ_k)ÎN

^k, z-

i=1

µ_i.e_iÎ D)

(the lexmax function represents the classical lexicographic maximum). A maximum is defined in the same way:

max

e₁,...,e_k

(z)= z+

i=1

n_i.e_i where

(n₁,...,n_k)= lexmax(µ₁,...,µ_k | (µ₁,...,µ_k)ÎN

^k, z+

i=1

µ_i.e_iÎ D)

When one of the lexicographic maximums n does not exists, the value of the corresponding extremum (min_e_₁_,...,e_{_k}^D(z) or max_e_₁_,...,e_{_k}^D(z)) is said to be ^ (for instance, that may happen when z is not in the convex D). In this context the general Scan operator

Scan( D

,(e_i)

iÎ{1,...,k}

,b,d,g) .

computes a multi-dimensional array v with the same shape as D. The values of v are given by the following definition:

" zÎ D, v(z)=

ì
ï
ï
ï
ï
ï
ï
í
ï
ï
ï
ï
ï
ï
î

z=min

e₁,...,e_k

(z)

g(z)

else if

z=min

e₂,...,e_k

(z)

f(z,v(max

e₂,...,e_k

(z-e₁)))

...

else if

z=min

e_k

(z)

f(z,v(max

e_k

(z-e_k-1)))

else

f(z,v(z-e_k)) .

The function f used in the definition is:

f=l zx.b(x,d(z)) .

In few words, this definition means that the propagation function f is applied following a precise path in the accumulation domain D. The path is determined by the directions of the scans (e_i)_iÎ_N_{_k}^_*. Now, the domain D^g of the initial value can be defined as:

D^g = min

e₁,...,e_k

( D) .

4.2 Scan Recognition

At this point, we are given a self referencing equation:

z Î D : v(z)= Exp(v(I(z)), ...) (2)

where the ellipsis denotes occurrences of other variables. In fact, there may be several occurrences of v. The first step is to examine all these occurrences. Equation (5) is tractable if all references to v are of the form v(z-d) where d is a unique integral vector. If this condition is fulfilled, d is the direction of the scan. It remains to find the operator of the scan and its data. This process will be presented here for the case d = 1. The generalization is just a matter of notations.

All expressions which are handled by a compiler are in the form of terms on a system of operators W, and a system of basic terms A. Here the basic terms are v_i-1, the other variables, and the constants of our language --- integers, reals, Boolean, and so on. To each operator w is associated an arity, an integer denoted as ¶(w). The rules for constructing terms are three:

A basic term is a term.
If t₁, ..., t_n are terms, and if w is an operator with ¶(w) = n, then w(t₁, ..., t_n) is a term. w is the head of this new term, and t₁, ..., t_n are its arguments.
There are no other terms.

It is not even necessary that W and A be finite. All we need are well defined procedures for recognizing a basic term, recognizing an operator, and computing its arity.

A construction is a list of terms t₁, ..., t_m such that each t_i is either a basic term, or is constructed by rule 2 from terms occurring earlier than t_i in the list. We usually says that such a list is a construction of its last term.

The set of operators depends on the underlying programming language. It may include arithmetic operators (+, -, *, /, ...), Boolean operators, elementary functions (sin, cos, log, exp, ...), comparison operators, and so on. Well formed terms must also conform to type rules. We will suppose that these rules have been checked by a first pass of the compiler. It is an easy matter to verify that the system of recurrence equations associated to a well typed program is well typed, and stays well typed if subjected to forward substitution.

4.2.1 Simple pattern matching

Let t be the term which is associated to Exp in (5). The simplest possibility is to analyze the head of t, and its arguments. Two trivial cases must be examined first. t may simply be v_i-1. The recurrence is a value propagation, and its solution is v_i = v₀. Similarly, if v_i does not occur in t, the solution is v_i = t. In both of these situations, the recurrence has a trivial parallelization.

In the other cases, we must check that the head of t is binary, and that, of its arguments, one is v_i-1 and there is no occurrence of v_i-1 in the other one. The head of t must belong to a list of tractable operators. This is the simplest form of pattern matching.

The disadvantage of simple pattern matching is that it fails as soon as t becomes too complicated. For instance, it cannot do anything with t = (1 + v_i-1) + a_i. On the other hand, adding new operators is easy. It is interesting at this point to inquire on how to decide that an operator is associative. The following procedure seems to be the most natural one.

Let us put the basic equation (5) into the form:

v_i = f_i(v_i-1). (3)

Since function composition is always associative, this can be rewritten [4] as:

g₀	=	l x. x,	(4)
g_i	=	f_i ° g_i-1,	(5)
v_i	=	g_i(v₀).	(6)

Since ° is associative, we may hope to compute the g_i by a scan, and then to compute all v_i in parallel. But this is a mere formal manipulation, without any practical interest, unless the complexity of g_i stays bounded as the computation proceeds. In practical terms, this means that the functions f_i must belong to a family which can be described by a few parameters, and that this family must be closed under function composition.

As an exemple, consider the case where f_i is a polynomial in v_i. Since the composition of a polynomial of degree m with a polynomial of degree n is of degree mn, the family is closed only for the case n £ 1, m £ 1.

Another example is the family of homographic functions:

f_abcd(x) =

ax + b

cx + d

which is closed under function composition, with interesting applications to tridiagonal solvers. It so happens, in fact, that if we associate to f_abcd the matrix (

a	b
c	d

) then function composition is associated to the matrix product.

It would be interesting to systematically explore closed families of functions, since each one is the basis of an efficient parallel algorithm.

4.2.2 Partial Normalization

The solution for finding the Scan operator in the recurrence

v_i=(1+v_i-1)+a_i

is simply to rearrange t as v_i-1 + (1 + a_i) by using associativity and commutativity of +. What we would like to have is a normalization algorithm which would convert all equal terms to the same normal form. Pattern matching would then be done on this normal form. Unfortunately, such a normalization algorithm does not exist as soon as W becomes complicated enough. A heuristic solution is to use partial normalization algorithms, i.e. normalization which takes into account only a subset of W.

Let O Ì W and let t be a term. The O-skeleton of t is a construction of t in which only operators from O are used, and in which basic terms are terms from A or terms whose heads do not come from O. Partial normalization can handle only terms in whose skeleton the elementary terms either are v_i-1 or do not include v_i-1. Let us suppose that O has a normalization procedure. We can then normalize the skeleton of t, treating its elementary terms as baggage, and perform pattern matching on the resulting normal form.

Let us take as O the set {+, -, *} with the usual properties: associativity and commutativity of + and *, the rule of signs, distributivity of * with respect to +, and the familiar rules of arithmetic (2 + 2 = 4 and so on). If t is tractable for this set of operators, its normal form is a polynomial in v_i-1 whose coefficients are combinations of its elementary terms. As we have seen earlier, such a polynomial is associated to a scan if it is linear:
v_i = a_i v_i-1 + b_i .
There are two simple cases : a_i = 1, which gives a sum, and b_i = 0, which gives a product. In the general case, we may compose two linear functions and show that the result is still linear:
a(cx + d) + b = (ac) x + (ad + b) .
This gives the associative companion function:

lin æ
è æ
è

a

b
ö
ø , æ
è

c

d
ö
ø ö
ø = æ
è

ac

ad + b
ö
ø .

In fact, since linear functions are a subfamily of the homographic functions which is closed under composition, this is a special case of an earlier observation.

4.2.3 Recurrences on finite domains

Let us suppose that the variable v_i in (6) has a finite domain. It is clear that functions on a finite domain form a closed family. As a consequence, the recurrence (6) always is a scan in that case.

A function f on a finite domain { a₁, ..., a_n} can be defined by a table of values { f(a₁), ..., f(a_n) }. Computing f ° g simply consists in the computation of { f(g(a₁)), ... f(g(a_n)) }. This can be done from the tables of f and g, and is necessarily an associative operation. Hence the algorithm for computing a scan (6) when v_i has a finite domain is:

Compute the tables associated to each f_i.
Combine these tables by any tree reduction scheme to get the functions g_i of (8).
The results of the scan are given by:
v_i = g_i(v₀) .

It is interesting to compare the parallel and sequential scheme for complexity. Let us suppose that the domain of the variables (v_i) has n elements, that the recurrence has m steps, and that our parallel computer has P processors. Let us take the time for a table access as the unit. Let us suppose that the functions f_i are initially given as tables. In that case the sequential time is simply m. The elementary step of the scan is an algorithm for computing the composition of two tables f1 and f2 giving f3:

    do i = 1,n
      f3(i) = f2(f1(i))
    end do

which takes 2n table accesses. The total time for the reduction is thus 2n(m/P + log₂ P). The last step takes m/P times unit, for a total of 2nm/P + 2nlog₂ P + m/P. In the case of a reduction, the last step takes only time 1, and we have to compare m to 2nm/P + 2nlog₂ P + 1. In both cases, and supposing that m is large enough that we can neglect other terms, we have to compare m to 2m n/P. The conclusion is that the method is advantageous provided that the size of the finite domain is small compared to the number of processors.

In the case of a Boolean function, there are two truth values which may be represented as usual by 0 and 1. The table associated to f_i is [f_i(0), f_i(1)], and the composition operator is given by the algorithm above. It may be interesting to recognize special cases: if " i : f_i(0) = 0, the the scan is a conjunction, while if f_i(1) = 1 it is a disjunction. This is worthwhile since most computers have very fast, hard wired and and or operators. Similarly, if f_i(0) = f_i(1), then the scan simply is an assignment, while if f_i(0) = 0 and f_i(1) = 1, it is propagation of v₀. Note that these properties must be detected by a symbolic calculation at compile time. Verifying them at run time is too late for reorganizing the program.

4.2.4 A possible organization for a scan recognizer

.

As the reader may have noticed, the methods we have presented are overlapping. The same Scan operator may be recognized by elementary pattern matching, or only after normalization, or as a computation on a finite domain. It seems best to apply first the simplest recognizer (i.e. pattern matching). In case of failure, one distinguishes several cases according to the type of the scan variable. There should be a recognizer for numerical variables, another one for the Booleans, and finally one for other finite types (like subranges in Pascal and enums in C).

Note that scan recognition can benefit from information obtained by static analysis of the program. For instance, an expression which is apparently a second degree polynomial may be proved linear by constant propagation. A variable which is declared an integer may be found to have finite domain by interval analysis. Conversely, the result of scan detection may enable the computation of the DFG of an hitherto intractable part of a program, by exhibiting inductive variables or giving the values of the entries in an array which is used as a subscript. The organization of such iterative analyses is a very difficult problem.