UNIVERSITY OF ILLINOIS LIBRARY Digitized by the Internet Archive in 2013 http://archive.org/details/controlstructure973panz / I !-*•/ r / 'jiLr Report No. UIUCDCS-R-79-973 opy >2 UILU-ENG 79 1724 CONTROL STRUCTURES FOR PARALLEL MACHINES by The ^b,^, 'fVof tffg Frank Dagen Panzica May 1979 NSF-0CA-MCS76-81 686-000042 Report No. UIUCDCS-R-79-973 CONTROL STRUCTURES FOR PARALLEL MACHINES by Frank Dagen Panzica May 1979 Department of Computer Science University of Illinois at Urbana-Champaign Urbana, Illinois 61801 This work was supported in part by the National Science Foundation under Grant No. US NSF MCS76-81686 and was submitted in partial fulfillment of the requirements for the degree of Master of Science in Computer Science, May 1979. Ill ACKNOWLEDGMENT I wish to thank Duncan Lawrie for his help in catching 'off-by-one' errors ana for his guidance throughout this project. I would also like to thank John Wawrzynek for his suggestions and for the many ' bells-and-whistles' added to his program, which made my work easier. IV TABLE OF CONTENTS CHAPTER 1 1 INTRODUCTION 1 CHAPTER 2 6 CONTROL STRUCTURES 6 2.1 Introduction to Control Structures 6 2.1.1 Definition of Control Structures 6 2.1.2 Naming Conventions for Control Structures 8 2.2 Tradeoffs When Choosing Control Structures .... 13 2.3 A Canonical Set of Control Structures 19 2.3.1 Introduction 19 2.3.2 The Canonical Set '. . . 21 2.3.3 Sufficiency of the Canonical Set 25 2.4 Additional Control Structures 29 CHAPTER 3 34 RECURRENCE ALGORITHMS 34 3.1 Tutorial Example (MVMULT) 34 3.1.1 Introduction to MVMULT 34 3.1.2 Defining Parameters 35 3.1.3 The Algorithm 39 3.1.4 Completing the Algorithm 42 3.2 Multiple Column Sweeping 45 3.2.1 Introduction 45 3.2.2 Defining the Algorithm 47 3.2.3 Implementing the Algorithm 49 3.3 The CCLGL Algorithm 53 3.3.1 Introduction 53 3.3.2 Defining CCLGL . 54 3.3.3 Implementing the Algorithm 61 CHAPTER 4 65 STATISTICS 65 4.1 Introduction 65 4.2 Occurances of Control Structures in Algorithms 66 4.2.1 Introduction 66 4.2.2 CCLGL 67 4.2.3 ALGA 7 4.2.4 ALGB 70 4.2.5 ALGC 71 4.3 The Effect of Sneak Paths 72 4.4 The Slide Overlap Option 77 4.5 Comparing Algorithms with Various Resource Times 81 4.6 Comparing Actual and Theoretical Time Bounds .. 92 4.7 Future Studies 95 CHAPTER 5 96 CONCLUSION 96 REFERENCES 98 APPENDIX A 100 APPENDIX B 106 CHAPTER 1 INTRODUCTION Recurrences have long been known to exist in scientific Fortran programs. Algorithms for solving these recurrences have also been developed. These algorithms are designed to solve a given recurrence in the least time. In most cases the number of processors, p, needed for the least time solution is found by counting the number of processor steps needed in the operation requiring the largest number of these processor steps. Several solutions have also been developed that assume a given number of processors are available and then adapt an algorithm to never require more than this number. Both of these solutions are fine in the theoretical sense, however, other considerations must be explored before these algorithms can be useful in speeding up the execution time of a program. The purpose of this thesis is to adapt these solutions to a generalized parallel machine. The first problem encountered involves the scheduling of these algorithms. Memory conflicts, alignment network usage, device overlap, as well as processor utilization must be considered. The algorithms mentioned above have been judged strictly by the number of processor steps needed to complete the algorithm. Here a processor step may be any arithmetic operation, such as addition, multiplication, or division, regardless of differences in complexity and execution time. This approach fails to consider other resources an3 the interconnection between resources used when solving the recurrence. These resources must be considered if a recurrence algorithm is to be used on an actual machine. Several other parameters that are fixed by a given architecture must also be considered and will be mentioned later in this introduction. The second major problem with the algorithms as previously developed is the assumption of an unlimited number of available processors. In some cases this problem has been alleviated by the suggestion that if less processors exist, folding takes place. This assumption, in fact, seriously alters the times associated with each algorithm, and must be carefully considered when choosing the "best time." This thesis discusses the algorithms currently considered to be "best." These algorithms will be redefined to specifically consider the resource scheduling problems mentioned above. Machine architecture parameters such as sneak paths, register usage, and resource times will be considered. Finally, the advantages of representing recurrence algorithms as a combination of control structures will be discussed, and possible extension of this idea to micro-coded machine instructions will be mentioned. This paper will be concerned with a generalized machine architecture. This means that no assumptions related to the number of memories, the alignment network(s), the number of processors, or the times associated with these resources will be made. This was done in order to study the effects of altering various architecure parameters. Because of this decision, the interaction between resources must be considered dynamic. Overlap between resources cannot be determined for a given algorithm without considering the times for each resource. This problem was solved by the development of a program FAPGEN [11 . Using this program, overlap can be dynamically calculated and algorithms can be studied on the higher level of resource usage without being concerned with the specific times of each resource. While defining the theory of control structures, it became apparent that there exists the possibility of defining machine instructions to execute these control structures. This approach has numerous advantages in terms of efficiency and simplicity when actually executing these algorithms on a parallel machine. More will be mentioned concerning machine instructions, throughout this thesis, whenever applicable. Chapter 2 will discuss the fundamentals behind the use of control structures. Structures will be defined and a naming scheme will be explained. h canonical set of control structures will be diplayed to allow for flexibility in handling future algorithms. Next, a set of control structures occurring in the algorithms currently under study will be enumerated. Finally, future considerations for altering this set will be discussed. Chapter 3 deals with the use of these control structures in recurrence algorithms. First a tutorial example will be presented to illustrate the general idea. Next, several algorithms will be discussed in detail. Chapter 4 presents dynamic data concerning individual control structures. This data will be sampled with different machine architecture parameters and resource times. Both the number of repetitions of each control structure and the individual occurrences of each structure will be tabulated. Finally, structure usage will be considered on the basis of the likelihood of each algorithm being used for a given set of recurrence bounds; i.e., algorithms requiring the least time will be considered. CHAPTER 2 CONTROL STRUCTURES 2.1 Introduction to Control Structures 2.1.1 Definition of Control Structures The term control structure will be used to refer to a block of resource requests that are in some way linked together. There is no required restriction on the number or type of resources that can be linked together into one structure, however, for the purposes of this thesis, memory access, alignment network usage and the processor are the only resources considered. Each node in a control structure represents one resource request for each processor/memory unit. Thus, if there are four processors available, a multiply will be done in all four processors simultaneously, but will be signified in a control structure by a single II ir •> • Control structures have been designed to display resource requests in a tree-like representation. The links between nodes of this tree imply a data or control dependency between the resources represented by the respective nodes. Each structure contains two or more resources and is intended to include as many related resource requests as possible. A related resource request is present when the data used, i.e. fetched, aligned, added, etc., by one resource is then used by the succeeding nodes linked to it. Upper bounds on the number of resource nodes included in any one structure are determined by the largest iteration independent group of resource requests (see section 2.2). For the algorithms studied so far (see App.B) , thirteen is the maximum number of resource nodes included in any one structure. The control structures mentioned here and throughout the rest of this thesis are closly related to potential hardware or firmware implementation as micro-coded machine instructions. With this in mind, a generalized computer organization is assumed. This organization is a circular arrangement of the resources: memory - alignment network - processor - alignment network. (Note: there may be either a single alignment or two alignment networks) Provisions for sneak paths have been made, but sneak paths will be considered additional features of a system, not an inherent feature of the control structure. Thus, if the alignment network can be bypassed, assuming the appropriate sneak path is available, the feature is noted, but the control structure will still contain the additional resource node (see section 2.1.2). Throughout this thesis, control structures will be assumed to consist of the following resource nodes: memory access (fetch and store), alignment network(s) usage, and processor usage. The general format for control structures allows for inclusion of additional resource nodes if the need or desire exists. Generally, however, including resources other than those mentioned above is both awkward and potentially invalid. Register usage will not be specifically mentioned, however conservative estimates on the number of registers available have been adhered to consistantly . In general, three registers will be assumed, two for inputs into a processor, and one for the output from a processor. Any case requiring an unusual number of registers will be noted 2.1.2 Naming Conventions for Control Structures One of the primary purposes behind the control structures expounded in this thesis is to facilitate the incorporation of new algorithms at the machine instruction level. The control structures proposed here (see sections 2.3 and 2.4) have been found to be useful building blocks for describing and formalizing new recurrence algorithms. Developing a concise and unambiguous naming scheme was quickly found necessary. First a definition of terms: resource node - an elementary component of a control structure. Each node corresponds to one resource request (per processor) . F - memory fetch (uses the same resource specified by an S) . S - memory store (see F) . A - alignment network usage. Refers to both the input alignment network and the output network, if both exist. If an A-node is proceeded by an F-node or followed by a P-type node, it is assumed to refer 10 to the input alignment network (when one exists) . If an A-node is proceeded by a P-type node or followed by an S-node then it is assumed to refer to the output alignment network. P (or P-type) - resource node referring to a generalized processor step. Also used for null processor steps needed to allow for a generalized machine architecture. + - * / - addition, subtraction, multiplication and division. These are processor steps. U- - unary minus (as opposed to subtraction) . ? - nodes having this designation may be skipped if the appropriate 11 sneak path exists. - designates a backward data flow. The output of a node followed by this symbol is used as an input in a preceeding node. The naming of control structures consists of combining the terms defined above with each term corresponding to a resource node with an optional '?' added to designate a possible sneak path, or a '<' added to designate a backward dependency. The beginning of each name includes a count of the number of resource nodes in the control structure. Thus a typical structure could be referred to as '7FAFA+A7S'. This name refers to a structure having seven nodes. Two pieces of data are fetched from the memory and aligned into a processor (FAFA) . An addition takes place (+) , followed by aligning the processor output and then storing the answer in memory (AS) . The final align may be skipped if a sneak path, between the output of the processor and the memory, exists (A?). (Note: the node is included in the original count (7) whether or not the sneak path exists) The use of '<' can best be explained with an example. 12 A simple control structure "2A+" can be used to add two numbers. By adding '<', we have "2A+<" which can be thought of as a chain sum. On the first iteration, both inputs to the processor come from registers. In every iteration after the first one, one input is the output of the last iteration. The '<• appears immediately after the processor step that has as one of its inputs the output of a previous iteration. This notation is not completely unambigious, however some concessions must be made for managability . As with '?', '<• does not add to the resource count when naming the control structure. For strictly linear structures no ambiguity exists. However, for complex structures a more detailed convention is needed. Whenever possible, an alignment is associated with the corresponding memory access. Thus 'FA' refers to fetching and aligning the same piece of date, likewise 'AS' refers to the movement of the same data. Also, all inputs to a processor must be mentioned before the processor step, thus 'FAFA+' is used, not 'FA+FA'. Furthermore, a '?' always immediately follows the node which may be skipped due to sneak paths. Finally, a generalized machine architecture is assumed, thus a reference such as 'FAS' is illegal, since this implies bypassing the processor. The operation of 13 fetching, aligning and then storing data must be represented by 'FAP?A?S' (see section 2.2.1), where the 'P' refers to a null processor step and both the 'P' and the second 'A' may be skipped if sneak paths exist. 2.2 Tradeoffs When Choosing Control Structures A major consideration when presenting these control structures is their potential use as machine instructions. This results from the ability to decompose existing recurrence algorithms into a relatively small number of distinct control structures, with the potential speed-up of executing machine instructions as opposed to individual resource requests. The ease of expressing recurrence algorithms, as well as standard serial and vector operations, using control structures, and the ability to extend this set of control structures will also be considered (see section 2.3). Control structures throughout this thesis are purposefully left as general as possible relative to any specific machine. Because of this decision, several pairs of control structures may exist that differ only in that some resources may be skipped due to sneak paths present in one structure, but absent in the other. For a specific machine these structures could be one machine instruction 14 with the appropriate resources optionally included or excluded from execution. ?^nother consideration concerns identical control structures with different processor steps; i.e., "7FAFA+^S" and "7FAFA*A3". Here, once again, the two structures will be considered distinct throughout this thesis, except for defining the canonical set. As machine instructions, however, these structures may be distinct, or may be combined (with appropriate selection control) depending on the choosen architecture. Control structures become more efficient as they include more and more resource requests. Often there are several options available for representing a particular portion of an algorithm. Fig. 2-1 and fig. 2-2 show two possible representations of a vector add. In fig. 2-1 the vector add has been executed using three control structures, "FA", "FA", "+AS" and in fig. 2-2 only one control structure is needed. Two cases will illustrate the advantage of the approach used in fig. 2-2 as opposed to the approach used in fig. 2-1. If a structure (machine instruction) is only executed once, little is saved by the second approach. However, even in this case there is some potential savings. If the vector 15 Do i = 1 DOLOOP Do j = 1 , n Do i = 1 Do i = 1 S (fig. 2-1) 16 p ^ © \ ^ ^ © Do i = 1 , n (fig. 2-Z) 17 add is implemented as one machine instruction, the second F can be overlapped with the first align when defining the machine instruction. ?^s three instructions, an outside controller would be needed to overlap the first two instructions. Thus, even in this simple case, using one instruction reduces complexity. Even more savings are apparent when the vector add is executed more than once. As one instruction, overlap can be handled once (when defining the instruction) and thereafter the structure can be executed with maximal overlap. In the "7FAFA+AS" structure, the second F can be overlapped as mentioned above, but the first F can also be overlapped with the align out for all iterations after the first one. Thus, this approach has the advantage of maximizing overlap and requiring the fetching of only one machine instruction. The approach illustrated in fig. 2-1 once again suffers in comparison. Here, three instructions must be fetched for each iteration (they could be stored in a buffer, but this of course requires a sufficiently large buffer to handle the general case) . Overlap must be handled dynamically as mentioned above, and a looping structure must be set up. Thus, there are many advantages to making control structures as inclusive as possible. This philosophy will 18 be followed throughout this thesis, unless specifically noted. The process of determining how many distinct structures are present in a given algorithm is, however, open to modification. Currently, for reasons of flexibility and efficiency, control structures have been made as inclusive as possible. If two structues are executed sequentially, and each is executed only once, then these two structures are combined to form one new control structure. However, some structures formed by this approach are only used infrequently. An example of this is found in BINVERSE which finds the inverse of a lower-triangular, 2x2 matrix. In this algorithm, one structure is only executed if there are two processors available and the matrix is constant coefficient, another structure is executed in all other cases. This specialization can artifically increase the number of unique control structures (machine instructions) that are needed to execute a given set of algorithms. The easiest way to circumvent this is to divide these specialized structures into a set of more widely used structures (a canonical set of structures will be defined in the next section) . Chapter 4 will deal with the problem of deciding when such a decomposition is called for, by studying the 19 frequency of use and the number of repetitions of each structure, and thus deciding which structures are the most "useful". 2.3 A Canonical Set of Control Structures 2.3.1 Introduction Through the efforts of representing recurrence algorithms as a combination of control structures, many such structures have been defined. In addition, structures have been added to represent vector operations. These sources have resulted in approximately fifty defined control structures. While no attempt at claiming this includes all possible structures is .nade, this set has been sufficient for describing at least two new algorithms [2] and [3] which have been proposed after much of the work for this thesis had been completed. Section 2.3.2 will present a subset of the structures already defined which can then be used to form any of the other structures currently defined. This is not intended to replace these compound structures, but rather to present a canonical set for inclusion of future structure once the addition of new structures becomes impossible or extremely difficult; i.e., after the machine has been built. 20 As mentioned above, ideally structures should include as many resource requests as possible. This approach is fine when designing a machine or adding a new algorithm, however it is not always feasible to consider all possible combinations of resource requests. In order to demonstrate the flexibility of the structures defined so far, a canonical set will be defined in the next section. The canonical set, as defined, is not intended to replace larger structures. The purpose of this set is to demonstrate the mathmatical completeness of this set of control structures as a cover of the desired structures. Since this set will be shown to exist, it is possible to define any new structure as a combination of the existing structures. The optimal case is to add a new structure whenever a new structure is needed, but this will not always be practical. The next best approach is to combine structures already defined. Because of the considerations demonstrated in figs. 2-1 and 2-2, it is always desirable to use as few (and as large) structures as possible. For the purpose of defining a canonical set, we will restrict our discussion to control structures currently defined. The set F,A,P,S can, of course, generate any control structure (every control structure is defined as a combination of these resources) . These are not currently 21 defined structures, and will not be considered as a canonical set (see the next section) . The canonical set presented in the next section will be shown to handle all cases and thus replace the need for the individual resource requests mentioned above. Justification of the canonical set will be discussed in section 2.3.3. 2.3.2 The Canonical Set As mentioned earlier, the structures currently defined are designed for a generalized machine. For purposes of defining a canonical set, some distinctions between structures will be ignored. The first simplification that will be made is to ignore sneak paths. Sneak paths result from the physical constraints placed on the hardware of a given machine. When the control structures are thought of as machine instructions, the presence or absence of a sneak path doesn't seriously alter that specific instruction. Instead, sneak path constraints can be thought of as externally controlled and not a fundamental characteristic of the instruction. This view results in considering the two structures "2FA" and "2FA?" as identical. If this is not desired, the canonical set 22 presented can be expanded to include all possible sneak paths by looking at the possible sneak paths present in the canonical set. For the purpose of defining a canonical set, all process operations will be considered generalized processor steps and will be designated by a P. This eliminates any distinction between two structures, based only on a difference in processor steps. Once the canonical set has been defined, this simplification can be reevaluated. If a machine instruction is designed flexibly enough to handle choosing the specific processor step after the general instruction has been choosen, then this canonical set will be sufficient. If this is not the case, then the set can be expanded to include an example with each individual processor operation. Processor steps modified by a "<" will be considered distinct. This relates to the different machine instructions needed to handle this case as opposed to the same control structure without "<", Now we will discuss the set which can cover all defined structures, that is, a canonical set. First, the obvious elements of the canonical set are the smallest structures defined. Since these structures must be covered and aren't 23 comprised of smaller structures, they are included in the canonical set. Once the considerations of the above paragraphs have been applied, four unique two-element structures remain: "FA" "AS" "AP" "AP<" "FA" and "AS" are easily recognized as standard, frequently used control structures. However, "AP" and "AP<" need some explanation. Here the P stands for any processor step (as mentioned above) . Normally processor steps require two inputs, so in these structures, one input is a register that has to be aligned (no F so the data can't come from memory). The other input, for "2AP", can either come directly from a register, or be an element fetched from memory using "FA" which has already been mentioned. For "2AP<" the second input is the output of the processor step. An obvious extension to the above are "FAP" and "FAP<". Here, one input to the processor is fixed as coming from memory, and the other input is optionally from memory or 24 from a register for "FAP", and from the processor, as mentioned above, for "FAP<". These structure also appear in the required set of control structures, so no new structure need be defined. f The only other unique three-element structure is "PAS". The output of the processor in this structure is stored in memory. No restrictions on the inputs to the processor are specified at this stage. Four-element structures provide two additional members to the canonical set. The four-element structures are "FAPP" and "FAPP<". The second processor step in "FAPP" receives one input from the processor operation executed immediately before it, and receives the other input from an unspecified source. In "FAPP<", the other input to the second P is the output from the processor step during the previous iteration. These structures both complete the canonical set and bring up an important question; i.e., if "FAP" and "FAPP" are included, what about "FAPPP", qr "FAPPP<", etc.? This question will be discussed in section 2.3.3. 25 2.3.3 Sufficiency of the Canonical Set First, the sufficiency of the canonical set can be easily shown for the structures defined so far. An example will illustrate this point. A vector add is an elementary operation that can be described using a control structure. As one structure, this operation is simply "FAFA+AS". This can easily be shown to be composed of three elements of the above canonical set, that is, "FA", "FA", and "PAS". As it turns out, this structure can also be represented using "FA", "FAP", and "AS". Finally, the structure can be decomposed using an existing structure not contained in the canonical set as follows: "FA" and "FAPAS". The above illustrates the process necessary to show the sufficiency of this canonical set for representing currently needed structures. Showing the decomposition of the other defined structures will be left for the reader. Now to handle the problem presented in the last section, "Does the set cover all possible structures?" The answer is not really. First, we will show three solutions to this problem, and then the set already described will be defended. Following are the three alternative canonical sets: 26 1) "F' 2) "FA" 3) "FA" AS' 'AS' AP' 'AP' <» M-> / " II n n x' H AP< AP< "PAP" "FAP<" PAS' Note: in sets 1 and 2, the structure "P" covers both the case where P has a backward data dependence, and when it doesn' t . Set 1 represents all resource requests, and thus this set must be a covering for the set of control structures under consideration. The disadvantage of this set is its inconsistancy with the intention behind control structures as mentioned throughout this thesis. Control structures are important for the efficiency associated with combining resource requests. Since none of these elements are currently defined structures, making this a canonical set 27 would require adding these elementary structures. This approach detracts from the desired results, and will thus not be considered further. Set 2 has the advantage of solving the problem of "FAPP" as well as being flexible in allowing for unforseen structures. The disadvantage of this set is its failure to fully utilize the benefits of the control structure concept, since the fourth element of the set, P, reverts back to the idea that each resource request must be handled separately. Also, P, is not currenty used as an individual structure in any of the algorithms under study, thus it must be defined strictly for its use in the canonical set. Set 3 is the canonical set previously defined, minus the eigth element, "FAPP". "FAPP" has been dropped to avoid the question of if "FAPP", why not "FAPPP"? A special case must be defined to represent "FAPP" which we know is a "needed" structure. Refering back to section 2.3.1, it will be remembered that we are ignoring sneak paths. If instead, we consider a sneak path to exist for "AP" (remember "A?P") then "FAPP" can consist of a combination of the two structures "FAP" an "AP", with the sneak path always on in "AP". The same applies for "FAPP<", etc. The above approaches "solve" the problem, but fail to 28 fully appreciate the control structures as presented here. "FAPP" is an extremely important structure (see section 2.4) and should not require a special case to define it. Also, the entire flavor of the high level nature of control structures is jeopardized if "P" is allowed to be a structure. A compromise seems justified. For all structures currently defined, the following set forms a minimal covering. 4) "FA" "AS" 'AP' "AP<" "FAP" PAS" " FAPP" "FAPP<" "FAPPP" is not included because it doesn't come up. The above set has withstood the addition of two new agorithms and several algorithm modifications and is still 29 valid. There appears to be no logical need for structures like "FAPPP". Finally, if such a structure is needed in the future, two solutions exist. One, use an arguement similar to the one used in critiqueing set 3, or two, define a new structure and add it to the canonical set. The first approach preserves the integrity of the given canonical set and can be justified if the new structure is infrequently used, while the second solution is justified if the new structure is used often. If the structure does appear often it is better for it to exist as a single entity, as opposed to a collection of structures, in order to best take advantage of the efficiency of the control structure concept. More remarks will be presented on the issues concerning the last thought in the next section. 2.4 Additional Control Structures As mentioned in the above section, all existing control structures can be formed from a combination of the elements contained in the canonical set. While this is necessary for generality, this idea does not use the fundamental idea behind the control structure concept. This idea is that there are only a limited number of unique control structures found in practice - no matter how complex the application. 30 These structures are bounded by iteration counts and loop limits, not by artifically selected restrictions. The small number of control structures needed reflects the nature of the algorithms and does not reflect the architectural design of a given computer. These facts combine to suggest the expedience of executing some given set of control structures as machine instructions. The previous section presented a set of minimal control structures and demonstrated their universality. This section will present some more complex structures that occur in several algorithm descriptions. Chapter 4 will look at the frequency of use for these control structures as further motivation for extending the idea of control structures to the definition of machine instructions. Several structures seem to play an important role in defining new algorithms. This section will look at two major categories of structures. Using the simplifications mentioned in the last section (no sneak paths and the use of a generalized processor step, P) the two structures discussed here will be "FAFAPAS" and "FAFAPFAPAS" . First, we will discuss "FAFAPAS". This basic structure occurs frequently in various related forms - "FAP", "FAFAP" and "FAPAS", as well as the already mentioned "FAFAPAS". The only difference between these structures is the memory 31 usage. Some of the inputs to the processor come directly from memory (every occurence of "FA") and some from registers previously assigned the desired value. Also, the result of the processor step is either routed to memory ("AS") or temporarily stored in a register. The interest in this structure relates to its ability to represent vector operations. As mentioned before, a vector add (for example) is represented in control structures as "FAFA+AS". Structures of this type, that is, diads, can thus be represented using a single control structure. Diads occur frequently in programs and can be handled by control structures or machine instructions having two inputs. Triads, machine instructions requiring three inputs, also occur frequently in recurrence algorithms and matrix operations, and will be discussed next. The other general structure worthy of note is "FAFAPFAPAS" and its simplified forms - "FAPP", "FAFAPP", and "FAPFAPAS". Once again, the difference between these structures relates to memory access. An interesting feature concerning these structures involves the processor steps involved. In all cases, the first P is a multiply and the second processor step is either an addition or a subtraction. This fact denotes a 32 specific function occuring whenever a member of this group is present. This structure represents the row-column multiplication present in a matrix-vector or a matrix-matrix multiply. The presence of this structure is thus closely related to matrix operations. Even more important, however, is the fact that an algorithm containing matrix operations can easily be expressed at the control structure/machine instruction level by including this structure. The third class of additional structures are all those which include nodes having backward dependencies, i.e., those with "<". These structures are related to the two classes above, except for the data dependency. Generally, these structures occur whenever a structure is executed more than once, and ends with a processor step, as opposed to a store. That is, "FAP" is fine if it is executed only once, but if it is executed more often than that, the output has to go some place useful. The structure "FAP<" then becomes the obvious solution. One more note concerning the existing control structures before we go on to discuss their use in recurrence algorithms. This comment concerns a class of structures not previously mentioned. This class includes all structures explicitly showing more than two inputs into a processor; i.e., "FAFAFAFAP". These structures assign 33 data to only half of the array of processor with any one memory access. Thus, the first "FA" supplies data to half the processors (usually either the left or right half of the processors, or all even processors, etc.). The next "FA" supplies data to the same location in the the other half of the processors, etc. Now we will move on and describe how the control structures defined here can be used to facilitate the definition and execution of recurrence algorithms. Chapter 3 will present some algorithms, and Chapter 4 will look at the frequency of use of these structures in the algorithms when they are being executed. Appendix A gives a pictorial description of all defined structures. 34 CHAPTER 3 RECURRENCE ALGORITHMS 3.1 Tutorial Example (MVMULT) 3.1.1 Introduction to MVMULT For the algorithm discussed here (MVMULT) and all others mentioned in this thesis, some prior familiarity with the algorithm is assumed. The task of representing an algorithm with a limited number of control structures (as proposed in this thesis) must require at least this background proficiency. Continuing beyond this basic understanding, this section will attempt to explain, step by step, a procedure for representing an algorithm using control structures. Matrix-vector multiplication (MVMULT) was chosen as a tutorial example for two primary reasons. One, matrix-vector multiplication is a widely understood algorithm with which the reader is assumed to be familiar. Also, matrix-vector multiplication is an extensively used operation in the parallel recurrence solvers currently under consideration. This fact reflects the desirability of 35 expressing algorithms in matrix notation and should be an important consideration in future algorithms. (Note: see CCLGL for an algorithm almost entirely composed of matrix-vector multiplies) The matrix-vector algorithm used here is designed to utilize the available processors. The algorithm is executed by partitioning the matrix and vector as in fig. 3-1, so that each partition contains P (or less) elements. 3.1.2 Defining Parameters Several factors must be taken into consideration when preparing the formalization of an algorithm. Machine architecture parameters can be known values or left as variables. The first class of parameters to be considered are the high level parameters of the algorithm - in this case, p (the number of processors available), Alpha (one dimension of the matrix) , and Beta (the other dimension of the matrix). For a recurrence algorithm, the important parameters become p, N (the size of the recurrence) , and M (the bandwidth) . Once these bounds have been established, a high level description of the algorithm can be completed. For the example of MVMULT, the following is representative of a 36 A iK V 1 >6 U 1) then the j-loop ("6FAFA*+<") is only done once and the k-loop is 45 never done. By recombining , the algorithm can be written as one structure;o i.e., "8FAFA*+AS". The structure, for this case, can then be done with an iteration of alpha/p and with all "DOLOOPS" eliminated. The complete algorithm appears in fig. 3-5. DOLOOP, MAP, IMAP, ENDO, and DEFINE are functions defined in [1] . These functions take structures and determine execution times. CEILQ is a function that returns the ceiling of the quotient. BND has already been defined, and CL0G2 returns the ceiling of the log base two of the expression. 3.2 Multiple Column Sweeping 3.2.1 Introduction Column Sweeping is a classical algorithm for solving recurrences. This is due in part to its simplicity; i.e., solve for x then substitute and solve for x , etc. This method takes n-1 steps and requires m-1 processors at each step. For P = 0(n) and with a large bandwidth, m, this algorithm is quite efficient. However, since most practical cases involve small values of m, some modification of the algorithm will be needed to handle these cases effectively. TW MVMULT: Proc (Alpha, Beta, P, Jptr) ; Call DOLOOP ( 1 ) ; bound = CEIL (alpha/p) ; temp = bound * beta ; if bound > 1 then call MAP ( ' 8FAFA*+AS' , temp) ; else do ; bound2 = CEIL (beta/ BND ( (p/alpha) ,beta) ; call MAP ( '6FAFA*+<' ,bound2) ; bound3 = CL0G2 (BND ((p/alpha), beta)) ; if bound3> then do ; call DEPEND (6,1) ; call MAP ('2A+<', bound3) ; call DEPEND (2,1) ; end ; else call DEPEND (6,1) ; call IMAP (•2AS') ; end ; call mw (tptr) ; end MVMULT ; (fig. 3-5) 47 3.2.2 Defining the Algorithm Start by looking at the banded matrix (recurrence) as subdivided into smaller matrices, some lower triangular, the A's, and some upper triangular, the B's (see fig. 3-6). Now, instead of just knowing the value of x , as is generally assumed in the classical case, the matrix A is taken as the known value, then the Column Sweeping algorithm can be written in the following matrix notation. (1) X • = f • + B-x . + A.x . ^- JL A A'l 4f At (2) A.x . = f • + B.x . X ^ A' * -* ' (3) x^- = a'! (f . + B.x . ) Assuming the blocksize, Q, is chosen to maximize the use of the processors available (since the final stage requires multipling a QxQ matrix, P = 0(Q*Q) would be reasonable, see the next section), then this calculation will require fewer steps than the classical version. This could considerably speed up the calculation. Unfortunately, equation (3) requires knowing A ^* for every A. This, in general, would require time consuming calculations and defeat the purpose of the modification. However, two special cases do exist. ^8 C ir (XV C C > ti\ c ^ dX (fig. 3-6) 49 If the inverse of A • is known for every i then the problem has been solved. One such case occurs with constant coefficient recurrences. Here all the A's are identical, thus once the initial A • is found, all of the inverses of A are known. The other special case is when A is a 2x2 matrix. Here the inverse of A can be found with minimal effort and thus does not seriously invalidate the modified algorithm. Efforts have been made recently toward handling the variable coefficient case for blocks larger than 2x2. The analysis of this case is similar to the methods described, and thus will not be mentioned further here. In the next section the implementation of the CCLGL algorithm will be discussed, 3.2.3 Implementing the Algorithm For constant coefficient matrices the algorithm is very straightforward (see fig. 3-7). First assume the inverse of A has been found by some means. Then the algorithm becomes simply a call to MVMULT for the calculation B*x^-_,, a call to VVADD (vector-vector add) for f^ + B*x^-_^ and a call to MVMULT for A*. *(f.+ Bx. .). Since B is upper triangular (see 50 n fx l H- tf f. ^i set x^ = ^i " ^*^i-1 ^ A*X. :^ A»x_. = ^i * B'^^i-i ^i -1 = A ■ ♦ (^i * B*x^_,) (fig. 3-7) Q B «/-, (fig. 3-8) 51 fig, 3-8) the first matrix-vector product will be with a (mxm) matrix and a (mxl) vector. The two (Qxl) vectors will then be added and finally a (QxQ) matrix will be multiplied by a (Qxl) vector. Since the size of the recurrence, n, need not be an even multiple of Q, a final stage will be needed to handle the last few rows. Thus the algorithm will sweep FLOOR (n/Q) times followed by a final sweep to pickup the final n - FLOOR (n/Q)*Q rows. For the non-Toeplitz case (see fig. 3-9) , where Q = 2, one additional stage is added. This stage calculates the inverse of a 2x2 matrix. Since this structure hasn't been mentioned yet, some details are in order. To find the inverse of a lower triangular 2x2 matrix one needs the following results: 1/a, 1/d and the product -c/ad. With 3 or more processors one can simultaneously divide 1/a, 1/d, and c/a. Follow this with multiplying c/a * 1/d. Negate the product and the inverse has been found (see fig. 3-10). If a = d (constant coefficient) then 2 processors will suffice. Otherwise, or if p=l, additional division and multiplication steps will have to be added. The classical Column Sweep algorithm will not be discussed here, but note that if p < m, folding must take place (Note: folding throughout this thesis refers to the steps required to execute a segment of code requiring P 53 (fig. 3-9) cu c rl -_£ X (fig. 3-10) 53 processors, when only p processors are availabe and p < P. Folding normally will consist of executing a part of an algorithm CEIL (P/p) -times in order to execute the required number or steps. In some cases, this folding step might also require additional storage to hold temporary variables normally saved in registers) , 3.3 The CCLGL Algorithm 3.3.1 Introduction Recurrence algorithms historically have been designed to solve full or banded recurrences. In practice, recurrences found in FORTRAN are often constant coefficient recurrences. Based on the obvious assumption that constant coefficient systems are less complex, it is justified to assume that algorithms designed specifically for constant coefficient systems will be less involved and quicker in execution time. CCLGL is a modification of the algorithm first proposed in [41. The algorithm, as modified, is applicable for any combination of the bounds on n (recurrence size), m (recurrence bandwidth) , and p (number of processors) , although best performance is obtained with P > n. The algorithm differs from the original in two major respects. 54 One, the storage allocation is less complex, and two, after an appropriate time (see below) the algorithm reverts to multiple column sweeping (see section 3.2). The first change results in fewer memory accesses with a corresponding time savings as well as a decrease in the number of processors needed. The second change is necessitated by noticing that for some values of n, m, and p, the CCLGL algorithm will take more time than the generalized version appearing in [5]. By switching to multiple column sweeping, the timings for this algorithm are consistantly better over a wider variety of parameters. The next sections give a more detailed analysis of this algorithm and an organization on the control structure level will be presented. 3.3.2 Defining CCLGL The following is an algorithm to solve constant coefficient recurrences with a bandwidth, m >= 2 (including the main diagonal). The algorithm will use P > 0(n) and will be refered to as CCLGL. The recurrence we wish to solve can be written as: (1) Tx = f => X = T* f 55 Here T is a lower triangular, Toeplitz (constant coefficient), banded matrix (nxn) , x and f are vectors (nxl) . Since T is constant coefficient, it can be decomposed as follows: T = Two theorems will be useful in solving (1) Theorem 1: -I If T is Toeplitz, then T is Toeplitz This is an immediate consequence of T being lower triangular and constant coefficient. Theorem 2: If T = , then T Here, the L triangles are obvious (remember, L is 56 Toeplitz) . The -LGL section results from considering the matrix operations involved. Remember, we are solving x = T f. Since T* is Toeplitz (constant coefficient) we only need to solve for the first column of T . The system can thus be decomposed as in fig. 3-11. The size of L doubles at each iteration; i.e., -I - _ -• . . •' 'I L. = 2x2 ; L^ = 4x4 ; L^ = 8x8 etc. The following equations result: ^x = LVG,L;f,+ L,f^ < = ^f, = -L;'G^L-;f^ + L^f/ etc. The general form becomes: (i > 1 and n = 2 ) / -I -• -I / -♦ x. = -L.. G., L.. f.+ L.f. = L. f' (nxl) (nxn) (nxn) (nxn) (nxl) (nxn) (nxl) (nxn) (nxl) 57 1. I H (fig. 3-11) 58 Since X . = L. f . , 4,-1 X'l * -I -I ' -L G . X . + L. f . Thus the solution resolves to 3 matrix-vector multiplies and one vector-vector add. Note: the temptation 'I ' exists to simultaneously multiply G.x.. and L-f. , however, this requires twice as many processors and most importantly, creates potential memory conflicts by requiring access to 4 arrays simultaneously. Thus, this slight (potential) speed-up would not be constructive. Finding L requires a simple 2x2 matrix inversion (even easier since L is Toeplitz) , Also, since L is Toeplitz, we only need the first column and thus L can be considered an (nxl) vector in the following calculations. The algorithm reduces to: I Find the inverse of L II III DO i = 2 , log n 'I L.e £py/fi •L. G. L. *0^m> Jier^m, e. = L. e -L. G. L. e *-» A't x-l I 59 -I -• . 4-/ . The size of L. G- L. e is 2 . Since T has a bandwidth 4,~l A,->t 4>'l I of m, eventually G^*., becomes upper triangular. The premultiplication by L^-.^ can also be reduced, in this case to a (2'*''x m) matrix times a (mxl) vector. Thus the algorithm requires the following matrix-vector products: (mxm) * (mxl) (2*''xm) * (mxl) (2'*''x2''''') * (2'''xl) The most processor intensive step is the final calculation of L"'(GL') which requires n*m/2 processors. The final stage (III) requires P = 0(n*n) for optimal time, but of course can be folded. For large n, the parallel time can exceed the serial time, thus a modification is needed. By looking at the T matrix as decomposed into two kinds of regions (A and B) (see fig. 3-6), the algorithm can be rewritten as: Stage I a) Solve Xj = fj + A^ x^ (using the above algorithm) .« b) Form X, = L f, 60 Stage II DO i = 2, CEIL (n/Q) X • = f • + Bx . . + Ax • => X'= L (f. + Bx.) ^ A, A-> End Stage III B is upper triangular and A is (QxQ) For optimal performance, Q should be choosen as large as possible. This keeps the algorithm in Stage I for the longest time. On the other hand, Stage I becomes inefficient when L is no longer reasonably full (Q > m) and when the individual calculations in Stage I require more processors than are available (p < Q*Q) . Stage II is just a block version of the classical column sweep algorithm which was discussed in the previous section. To summarize the algorithm description, find values for L , where L is a (2*x2'') matrix, until the operations become 61 inefficient, then switch to multiple column sweeping to complete the solution. The next section will discuss translating the algorithm into a description involving control structures. 3.3.3 Implementing the ?^lgorithm Certain functions are used quite frequently by the algorithms discussed in this thesis. To avoid duplication and reduce unneeded complexity, MVMULT, CLSWP (both discussed earlier) and VVADD (vector-vector add) will be considered macros when defining the CCLGL algorithm. One final macro, BINVERS will also be used. This macro will refer to the calculation of the inverse of a banded constant coefficient 2x2 matrix, as discussed in the previous section. For completeness, INVERSE will invert a general 2x2 matrix. Due to the definitions of the above macro's and the matrix representation of the CCLGL algorithm, a pictorial representation of the algorithm will be on a macro level (see fig. 3-12) . Since all control structures used in this algorithm have already been defined in the above macros, no new 62 structures need be added. The algorithm can be directly . read from fig. 3-12 and is shown in fig. 3-13. This algorithm thus illustrates the ease of adding additional algorithms once a set of control structures and macros has been defined. INVERSE (compute the inverse of a lower-triangular, 2x2, constant coefficient matrix ) J DO '^ILE EFFICIENT (GL-^ matrix-vector multiplication L"''(GL-1) matrix-vector multiplication DO UNTIL DONE multiple column sweep 63 (fig. 3-12) 6^ CGLGL: Proc (N, M, P, Tptr) ; call DOLOOP (1) ; Q = COMPUTEQ (N, M, P) ; /• finds 'best' time to switch to column sweep */ call BINVERS (P, Pointer) ; /* inverse of 2x2 matrix */ do i = 2 to CL0G2 (N) ; ibyi = ibyi * 2 ; /♦ double block size for every iteration */ if ibyi>Q then do ; /* finish with column sweep ♦/ call CLSWP (Q, N, M, P, Pointer) ; goto LEAVE ; end ; else if ibyi < = M then do ; call MVMULT (ibyi, ibyi, P, Pointer) ; call MVMULT (ibyi, ibyi, P, Pointer) ; end ; else do ; call MVMULT (M, M, P, Pointer) ; call MVMULT (ibyi, M, P, Pointer) ; end ; end ; /* ends Do Loop */ LEAVE: call MVMULT (N, N, P, Pointer) ; /* final results */ call ENDO (Tptr) ; end GCLGL ; (fig. 3-13) 65 CHAPTER 4 STATISTICS 4.1 Introduction In the previous chapters we presented the concept of control structures as a useful tool for analyzing algorithms for parallel machines. In this chapter we will look at several algorithms in detail. The emphasis of this study will be on the performance of the algorithms with respect to control structure usage. For this chapter we will be looking at four algorithms: CCLGL (mentioned above), and ALGA, ALGB , and ALGC (see [6]). CCLGL is strictly for constant coefficient recurrences, the other three are for banded systems. ALGA is designed for P <= n, ALGB for P = 0(n*m/2) and ALGC for P = 0(m*m*n). The set of resource times used throughout this chapter (except for section 4.5) will be unit times with the following values. For memory access, memory store, and both alignment in and alignment out, the time required for these resources is one time unit. For the processor steps, addition and subtraction and multiplication, we will use a 66 resource time of two time units. For division, three time units will be used, and for a unary minus, one time unit will be the resource time. Finally, we will use one time unit for a null processor step. This set of times was chosen to represent a relatively typical set of resource times. Processor steps will usually take longer than data transfer steps. Also, division is a very time consuming processor step while both unary minus and a null processor operation are 'quick' processor steps. No options, such as the slide option or the turning on of sneak paths, will be used without specifically mentioning this fact. 4.2 Occurences of Control Structures in Algorithms 4.2.1 Introduction For these sections we will be interested in the frequency of use of selected control structures. We will look at which structures are used frequently, and how this affects the final times for these algorithms. First we will look at the algorithms separately, and then we will combine these results and look at the significance for recurrence algorithms in general. 67 For the analysis used in this section, we will be varying the values of three parameters. These parameters are n,m, and p. Here, p, as you will remember, refers to the number of processors available. In order to limit the number of runs needed, we have selected six representive values of p: 8,16,32,64,256,1024. These values have been chosen to represent a large cross section of processor values, ranging from a small number of processors (8), to a reasonably large number of available processors (1024). The parameter n refers to the size of the recurrence. Once again we will restrict the number of values studied to a representive sample. Here we will use the following values: 10,20,50,100,500. This sample covers small ' recurrences (10, 20, etc.) as well as an example of a large system (500) (Note: we are not assuming that N is a power of two) . The final parameter considered will be m, the bandwidth of the recurrence. Since most recurrences in practice will have small bandwidths, thus we have concentrated our efforts on small bandwidths, that is, m = 1,2,3. For a representation of large bandwidth recurrences, m = SQRT(N) and m = n - 1 have been choosen. 4.2.2 CCLGL 68 For small p (3) , over 60% (approaches 99% for large n) of the processor time is spent during the last step of the algorithm. That is, most of the time is spent in the last matrix vector product, where the matrix is (NxN) . While for p < 0(n*n), folding takes place, for p > n the proportional time spent on this last step decreases dramatically. This speed up associated with p > n stems from the fact that with this many processors available, at least one row of the matrix can be done at a time (see section 3.3). Two interesting exceptions to the above analysis can be found. For small p, significantly less time (proportionally) is spent in this last matrix product for systems having large bandwidths. This is not due to a reduction in the time spent on the last step, but rather reflects the additional time spent on the other phases of the algorithm for large values of m. The other exception follows from the same line of thought presented in the previous paragraph. This time we will look at the other end of the spectrum. For large p, a significant proportion of the total time for the algorithm is once again spent in the last matrix-vector product. This time thus reflects the efficiency of the algorithm as a whole, when a large number of processors are available. In fact, for the case n = 590, m = 499 (the largest system 69 studied) , the total time for the entire algorithm excluding the last step, was less than the time spent executing that last step. A major interest in this algorithm stems from the choice of Q, the switchover point to column sweeping. Currently, the determination of Q requires Q > m. For some set of parameters, this bound switches the algorithm to column sweeping earlier than is optimal. One example of this is for p = 256, where the n = 100, m = 99 system can be executed is less time than the system where n = 100, m = 10 or where n = 100 and m = 3. In both of these cases, the algorithm switches to column sweeping, while for m = 99 the algorithm never does switch. Similar cases arise for other values of p. In these cases the amount of processor time increased while the total time decreased, compared to the the systems where column sweeping was never needed. These facts indicate that switching to column sweeping should be held off if the algorithm can be completed soon after the switch is currently called for. An additional criteria for Q, to minimize this inefficiency, might be to consider q > CEIL(N/CONSTANT) , where the value of the constant can be determined in further studies. 70 4.2.3 ALGA ALGA as mentioned earlier is designed for p <= n. Since part of the algorithm is dependent of log (p) , large valaes of p can sometimes cause the algorithm to take longer, for the same recurrence, than when fewer processors are available. However there are several structures ('5FAFA*-<' and '5FAP?A?S' to name two) whose execution count is proportional to the value (n*m/p) . Thus, too few processors can also slow down this algorithm. For small m, p = 0(n) gives the best results since (n*m/p) then has a small value. For large values of m, the best results are obtained with p = 0(2*n) since with this bound the two conflicting factors mentioned above are best balanced . 4.2.4 ALGB ALGB is designed for p = 0(n*m/2). When executing this algorithm, performance is very sensitive to this bound. Basically, the entire time for the algorithm must be multiplied by CEIL ( (N*M) / (2*P) ) . For p >= (n*m/p) the times for this algorithm remain consistantly small. However, if p < (n*m/p) , the algorithm becomes extremely slow. 71 Ignoring the folding, an interesting situation occurs as the value of m increases. Due to one bound based on log (n/m) , the algorithm actually speeds up for large m. This effect is only noticable if no folding takes place. If folding takes place due to increasing the bandwidth, then this becomes a second order effect. 4.2.5 ALGC ALGC is extremely dependent on m. Times for m = 2 are roughly twice those for m = 1, and m = 3 times are approximately twice those of m = 2, etc. ALGC can be described in two parts. Part one is the initialization of the system, and part two is the actual execution phase. Part one is null for m = 1, thus accounting for the good times associated with this bandwidth. For m = 2, folding takes place for p < n. For p > n, only a few simple instructions need be executed to initialize the system, thus justifying the small times for this bandwidth. For m > 2, folding takes place if p < (m*m*m) . Since a relatively large number of statements must be executed in order to initialize systems with a large bandwidth, the fact that folding takes place whenever the number of processors is less than m*m*m adds up to quite a few statements being executed to initialize systems with 72 m > 2. Stage two is a relatively simple set of executable structures (no more than three) , with each structure executed a relatively small number of times - at least if there is no folding. Folding occurs in this stage for p < (m*m*n) , thus once again the algorithm is extremely sensitive to large m. 4.3 The Effect of Sneak Paths Sneak paths exist to bypass the established order of memory - alignment - processor - alignment - memory. Two of the four algorithms studied show no significant change when sneak paths around all resources are allowed. CCLFL as an algorithm never uses a structure that allows for sneak paths, and thus the times for this algorithm do not change when sneak paths exist. ALGC does exhibit some time savings when sneak paths are present. This algorithm does contain some structures having sneak paths, however, the time saved by allowing sneak paths to exist is normally less than one percent of the total time. This result is due to two factors. One, the number of times structures containing sneak paths are executed compared to the number of times structures without 73 sneak paths are executed is insignificant. Two, because of overlap within structures, many resources that are skipped, due to sneak paths, are normally not included in the total time calculation. Thus, when their resources are skipped, no execution time is saved. ALGA has several structures with resources that can be skipped due to sneak paths. Approximately a 19% time savings is realized by allowing the required sneak paths to exist. Some of this savings is due to the lack of processor operations within the structures containing these sneak paths ('2FA?', '2A?3', etc.). If sneak paths and the slide overlap options (see the next section) are available, some of the time savings due to sneak paths would be absorbed due by the effects of the slide option. ALGB can be analyzed along the same lines as ALGA. While, in this algorithm, many structures containing sneak paths could be absorbed with the slide option, one factor prevents this from being as important as it is for ALGA. For ALGB, a large number of the structures containing sneak paths are memory to memory transfers ('FA?P?A?S' and 'FAP?A?S') which will not be totally absorbed by the slide option as mentioned for ALGA. Fig. 4-1 displays the results from comparing the 7k algorithms assuming first no sneak paths, and then assuming all sneak paths are available. The numbers shown represent the percentage of the total time for the algorithms with sneak paths, compared to the total time without sneak paths. 75 % = T(all sneak paths on ) / T(no sneak paths on) CCLGL ALGA ALGB ALGC p=16 p=32 p=16 p=32 p=16 p=32 p=16 p=32 N=10 M=l 1.00 1.00 .90 .90 .84 .84 1.00 1.00 M=2 1.00 1.90 .38 .89 .88 .88 .98 .97 M=3 1.00 1.00 .92 .93 .90 .90 .96 .93 M=9 1.00 1.00 .96 .94 .84 .85 .99 .98 N=20 M=l 1.00 1.00 .91 .90 .84 .84 1.00 1.00 M=2 1.00 1.00 .89 .90 .87 .86 .99 .98 M=3 1.00 1.00 .92 .93 .88 .90 .98 .97 M=4 1.00 1.00 .92 .93 .88 .89 .99 .98 M=19 1.00 1.00 .98 .97 .84 .84 1.00 1.00 (fig. 4-1) 76 N=50 (fig. 4-1 continued) M=l 1.00 1.00 .89 .91 .83 .84 1.00 1.00 M=2 1.00 1.00 .88 .90 .87 .87 .99 .99 M=3 1.00 1.00 M=7 1.00 1.00 M=49 1.00 1.00 N=100 M=l 1.00 1.00 M=2 1.00 1.00 M=3 1.00 1.00 M=1'3 1.00 1.00 .91 .93 .88 .83 .88 .89 .88 .89 .90 .92 .96 .96 M=l 1.00 1.00 yi^2 1.00 1.00 M=3 1.00 1.00 M=22 1.00 1.00 M=499 1.00 1.00 .37 .88 .87 .83 .90 .90 .98 .98 1.00 1.00 .99 .99 .93 .95 .88 .89 .99 .99 .99 .99 .83 .83 1.00 1.00 .83 .83 1.00 1.00 .87 .87 1.00 1.00 .88 .88 1.00 1.00 .90 .90 1.00 1.00 M=99 1.00 1.00 1.00 1.00 .83 .83 1.00 1.00 N=500 .83 .83 1.00 1.00 ,87 .87 1.00 1.00 .88 .88 1.00 1.00 .92 .92 1.00 1.00 .80 .80 1.00 1.00 77 4.4 The Slide Overlap Option The slide overlap option is described in [1], The effect of this option is to merge structures based on data dependencies. The two key phrases here are structures and data dependencies. The more structures the more places for structures to slide together. The more data dependencies present, the more restrictions on when two structures can slide together, and thus the smaller the time savings due to this option. Algorithms ALGA and ALGB require relatively few structures (repeated several times) . Because of this, very little times savings appear due to the slide option. CCLGL and ALGC both require more structures and thus some savings from merging structures becomes evident. Actually for all of the algorithms, this option has little effect. The reasons behind this have not been completely explored, but consider the number of structures required for the algorithm requiring the most structures. Even in this case, the number of structures used is under one hundred. Considering that the maximum time savings from a slide is the time it takes to execute one iteration of a structure, the maximum potential time savings from the slide option is small. 78 Fig. 4-2 displays the results of compairing the algorithms with and without the slide option. The numbers represent the percentage of total time allowing the slide option, divided by total time without the slide option. 79 % = T(slide option) / T(butt option) CCLGL ALGA ALGB ALGC p=16 p=32 p=15 p=32 p=16 p=32 p=16 p=32 N=10 M=l 1.00 .95 1.00 1.^50 1.00 1.00 1.00 1.00 M=2 1.00 .96 1.00 1.00 1.00 1.00 .89 .90 M=3 1.00 .96 1.00 1.00 1.00 1.00 .94 .93 M=9 1.00 .94 1.00 1.00 1.00 1.00 .99 .99 N=20 M=l 1.00 .99 1.00 1.00 1.00 1.00 1.00 1.00 M=2 1.00 .99 1.00 1.00 1.00 1.00 .92 .89 M=3 1.00 .99 1.00 1.00 1.00 1.00 .96 .95 M=4 1.00 .99 1.00 1.00 1.00 1.00 .97 .95 M=19 1.00 .98 1.00 1.00 1.00 1.00 1.00 1.00 (fig. 4-2) 80 N=50 (fig. 4-2 continued) M=l 1.00 .97 1.00 1.00 1.00 1.00 1.00 1.30 M=2 1.00 .98 1.00 1.00 1.00 1.00 .96 .93 M=3 1.00 .98 1.00 1.00 1.00 1.00 .98 .97 M=7 1.00 .95 1.00 1.00 1.00 1.00 1.00 .99 M=49 1.00 .99 1.00 1.00 1.00 1.00 1.00 .99 N=100 M=l 1.03 .98 1.00 1.00 1.00 1.00 1.00 1.0! M=2 1.00 .98 1.00 1.00 1.00 1.00 98 .96 M=3 1.00 .99 1.00 1.00 1.00 1.00 .99 .98 M=10 1.00 .99 1.00 1.00 1.00 1.00 1.00 1.00 M=99 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 N=500 M=l 1.00 .99 1.00 1.00 1.00 1.00 1.00 1.00 ^4=2 1.00 .99 1.00 1.00 1.00 1.00 .99 .99 M=3 1.00 .99 1.00 1.00 1.00 1.00 1.00 1.00 M=22 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 M=499 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 81 4.5 Comparing Algorithms with Various Resource Times For this section we will be studying the effect of varying resource times, Time-A is the set of resource times mentioned in the previous sections. Time-B has every resource taking the same time, that is, one time unit. Time-C is a set of times with memory and alignment time equal to one, and all processor steps, except for the null processor step (one time unit) , will require three time units to be executed. By studying fig. 4-4 through fig. 4-7, the obvious observation is that as the processor time increases, the percentage of the total time spent executing processor operations increases. The only minor deviation from this is that although the processor time was increased linearly (from Time-B (1) to Time-A (app. 2) to Time-C (3)) the percentage of processor time to total time increased at a slower rate. The reason for this can be traced to the fact that increasing the time of processor steps only increase potential overlap within structures. Since this study was done without the slide option, structures not containing a processor step will take the same time to execute no matter what the processor times are. Among the algorithms, CCLGL has the highest percentage 82 of processor time vs. total time. This is due to two reasons. First, an3 most importantly, CCLGL is an algorithm designed for constant coefficient recurrences. By taking advantage of this fact, many memory and alignment steps may be skipped, leaving the same number of processor steps. The second reason for the high concentration of processor steps is the efficiency of the overlap within the structures used while executing this algorithm. Since most of the execution time for this algorithm results from calls to MVMULT, and most of the structures in MVMULT are very close to processor bound, the above mentioned result is expected . An interesting trend is noticed as the bandwidth increase for a fixed recurrence size (N) . Generally, as the bandwidth increases, the percentage of processor time increases. This is probably due to the increase in the number of times each structure is executed. Since overlap within a structure is optimal, the more time spent executing structures with significant overlap, the closer the total and processor times will coincide. One final observtion can be made by observing the trend associated with Time-C (processor time of three time units) . Here, as the number of operations increases, the percentage 83 of time spent with the processor active increases and eventually approaches 100%. At this point, the possibility of an algorithm becoming processor bound exists. To avoid this, there should not be too large of a disparity between processor times and data transfer times. The numbers shown in the following figures are the percentage of processor time vs. the total time for the algorithm. 84 COMPARING PROCESSOR TIME VS. TOTAL TIME for VARIOUS RESOURCE TIMES (CCLGL) TIME-A TIME-B TIME-C p=16 p=32 p=16 p=32 p=16 p=32 N=10 M=l .56 .53 .42 .38 .65 .60 M=2 .58 .54 .43 .39 .66 .62 M=3 .58 .55 .44 .41 .66 .62 M=9 .64 .54 .44 .41 .70 .60 N=20 M=l .67 .66 .42 .51 .73 .73 M=2 .67 .56 .44 .51 .74 .73 M=3 .67 .66 .45 .52 .73 .72 M=4 .67 .66 .45 .52 .73 .72 M=19 .75 .72 .55 .60 .86 .76 (fig. 4-3) 85 N=50 M=l M=2 M=3 M=7 M=49 N=100 M=l M=2 M=3 M=10 M=99 N=500 (fig. 4-3 continued) .79 .77 .79 .76 .79 .76 .86 .75 .79 .80 86 86 86 91 79 .86 .86 .86 .95 .80 .53 .52 .53 .52 .53 .53 .61 .54 51 .58 57 .59 57 .58 58 .59 67 .64 50 .52 .84 .84 81 .81 .83 .80 .88 .79 .96 .91 90 .89 90 .89 89 .89 96 99 .92 .97 M=l .96 .96 M=2 .96 .96 M=3 .96 .96 M=22 .80 .99 M=499 .80 .80 64 .64 .97 .97 64 .64 .97 .97 64 .64 .97 .97 50 .68 1.00 .99 50 .51 1.00 1.00 86 COMPARING PROCESSOR TIME VS. TOTAL TIME for VARIOUS RESOURCE TIMES (ALGA) TIME-A TIME-B TIME-C p=16 p=32 p=16 p=32 p=lo p=32 N=10 M=l .36 .36 .24 .24 .47 .47 M=2 .37 .36 .24 .24 .49 .53 M=3 .44 .44 .27 .27 .60 .62 M=9 .68 .67 .49 .41 .75 .78 N=20 M=l .37 .36 .24 .24 .48 .47 M=2 .41 .37 .26 .24 .52 .50 M=3 .48 .45 .29 .27 .62 .61 M=4 .54 .50 .34 .31 .66 .65 M=19 .83 .81 .67 .65 .87 .86 (fig. 4-4) 87 N=50 M=l M=2 M=3 M=7 M=49 N=100 M=l M=2 (fig. 4-4 continued) M=l M=2 M=3 M=22 M=499 44 .37 45 .41 53 .48 71 91 .67 .93 51 .43 43 .44 M=3 .56 .52 M=10 .79 .78 M=99 .94 .95 N=500 • 69 .60 51 .50 59 .58 88 .89 36 .38 .28 .24 28 .26 33 .29 49 .43 82 .84 43 38 75 37 56 55 64 .48 .54 .64 79 .79 94 .95 32 .37 .65 .55 29 .27 .57 .55 35 .32 .65 .65 61 .56 .84 .84 88 .90 .96 .97 .38 .87 .76 .30 .58 .53 .37 .67 .66 .77 .91 .92 .36 1.00 1.00 88 COMPARING PROCESSOR TIME VS. TOTAL TIME for VARIOUS RESOURCE TIMES (ALGB) TIME-A TIME-B TIME-C p=15 p=32 p=16 p=32 p=16 p=32 N=13 M=l .32 .32 .22 .22 .40 .40 M=2 .43 .43 .26 .26 .48 .48 M=3 .51 .51 .32 .32 .56 .56 M=9 .77 .74 .51 .50 .92 .88 N=20 M=l .32 .32 .22 .22 .40 .40 M=2 .46 .43 .29 .27 .53 .49 M=3 .56 .52 .36 .33 .63 .58 M=4 .64 .62 .42 .41 .73 .70 M=19 .82 .81 .54 .54 .98 .97 (fig. 4-5) 89 N=50 M=l M=2 M=3 M=7 M=49 N=100 M=l M=2 M=3 M=10 M=99 N=500 M=l M=2 M=3 M=22 M=499 (fig. 4-5 continued) 34 .32 49 .46 58 .57 75 83 .75 .83 36 .36 50 .49 59 .58 79 .79 83 .83 23 .22 .42 .40 30 .29 .56 .52 38 .37 .66 .64 52 .52 .85 .84 55 .55 .99 .99 24 .23 .44 .41 31 .30 .57 .55 38 .37 .66 .65 57 .57 .88 .88 55 .55 1.00 1.00 36 .36 .25 .24 .45 .45 49 .49 .33 .30 .56 .56 58 .58 .37 .37 .65 .65 86 .86 .67 .67 .94 .94 80 .80 .50 .50 1.00 1.00 90 for VARIOUS RESOURCE TIMES (ALGC) TIME-A TIME-B TIME-C p=16 p=32 p=16 p=32 p=16 p=32 M=l .56 .53 .22 .22 .46 .46 M=2 .58 .54 .26 .26 .57 .55 M=3 .58 .55 .35 .29 .68 .58 M=9 .64 .54 .35 .33 .77 .71 N=2g M=l .67 .66 .21 .22 .58 .46 M=2 .57 .66 .30 .26 .66 .58 M=3 .67 .66 .43 .38 .81 .72 M=4 .67 .66 .46 .42 .86 .79 M=19 .75 .72 .39 .38 .86 .83 (fig. 4-6) 91 N=50 M=l M=2 M=3 M=7 M=49 N=100 M=l M=2 M=3 M=10 M=99 N=500 M=l M=2 M=3 M=22 M=499 (fig. 4-6 continued) 79 .77 .79 .76 .79 .76 .86 .75 .79 .80 .86 .86 .86 .86 .86 .86 .91 .95 .79 .83 96 .96 96 .96 96 .96 80 80 .99 .80 .21 .21 38 .33 49 .47 57 .56 44 .44 26 .21 41 .38 52 .50 62 .62 50 .45 31 .29 44 .43 53 63 .53 .59 63 .58 80 .72 90 .86 94 .93 91 .90 70 .60 85 .80 93 .90 96 .95 98 .93 85 .80 91 .90 95 .94 49 .49 93 96 .90 .96 92 4.6 Comparing Actual ani Theoretical Time Bounds Several papers have mentioned theoretical bounds on the algorithms mentioned in this chapter. One such work, [5], gives the processor bounds for AL3C as: t = |log^n|(2 + riog^ml) - jlog^m |( llog^m 1 + l)/2 Fig. 4-3 shows the theoretical processor time, the actual processor time, and the total time for ALGC. For this study, we have used the resource time parameters refered to in section 4.5 as Time-B. This set of resource times allows one time unit for the execution of every resource request. The theoretical and actual processor times do not exactly coincide because of memory-memory transfers (•PAPAS'). Of course, the total time includes data transfers and thus is greater than either the theoretical or actual processor time. We have restricted this study to recurrences that can be solved uGing ALGC without folding. In order to include as many different recurrences as possible, we used p = 1324 to avoi^ folding. The entries in fig. 4-8 are the actual times and not percentages as displayed in previous figures. 93 N=10 M=l M=2 M=3 M=2 M=3 M=4 M=7 THEORETICAL PROCESSOR TIME ALGC P = 1024; p > m*m*n ACTUAL PROCESSOR TIME 11 13 M=4 13 M=7 14 N=20 M=l 10 14 17 17 19 8 14 16 16 18 10 17 20 20 23 (fig. 4-8) TOTAL TIME 36 53 60 60 75 45 64 20 20 88 94 n=53 M=l M=2 M=3 M=4 N=100 M=l M=2 M=3 N=253 M=l M=2 N=1030 ^1=1 12 17 21 21 14 20 25 16 23 12 20 28 24 14 23 32 16 26 54 75 96 84 63 86 108 72 97 20 20 93 (fig. 4-8 continued; 95 4.7 Future Studies The studies presented in this chapter are indicitive of the studies and considerations made possible by the concept of control structures. These studies are not intended to be a definitive study of the four algorithms we discussed. Many future studies can be made given the flexibility of these structures. The main area of interest illuminated by the studies done for this chapter is the effect of combining the premise behind the studies already made. We have already mentioned the potential interest in combining the sneak path and the slide overlap studies. Of further note are the cases of varying resource times with one or both of the options just mentioned. With processor time much greater than memory or alignment time, the slide option could potentially reduce the total time for the algorithms to almost that of just considering the processor time. Making memory and alignment time large will increase the improtance of both the slide option and the presence or absence of sneak paths. These and many other studies can be made to determine a 'best' set of resources to execute these structures. 95 CH?^PTER 5 CONCLUSION We have presented a concept for allocating resources at the machine instruction level. By allocating several resources as a single instruction, overlap between resources can be statically calculated at the time of defining a given instruction. This allows for optimal overlap within a control structure and minimizes the need for dynamic overlap calculations. /\s we studied several recurrence algorithms, it became apparent that very few unique control structures are sufficient for describing the algorithms currently under consideration. Thus, a limited number of control structures/machine instructions are capable of handling the known recurrence algorithms. The combination of the above two factors, limited number of control structures and their inherent efficiency, makes this concept of control structures potentially useful as a source of machine instructions for parallel machines. 97 We have shown the adaptability of this concept for handling current algorithms as well as potential future algorithms. We have also illustrated the ease by which algorithms can be described in terms of control structures and thus we illustrated the inherent simplicity of executing an algorithm composed of these structures. In addition to their usefulness as machine instructions, it was demonstrated that control structures make the analysis of algorithms easy to conceptualize. Thus control structures are well suited for studies to determine the best computer architecture to maximize the speed and efficiency of executing recurrence algorithms on parallel machines. 98 REFERENCES [1] J. C. Wawrzynek, "Scheduling and Simulation of Computation Graphs for Resource Computers," Master's Thesis, University of Illinois at Urbana, May, 1979. [21 D. D. Gajski, "An Algorithm for Solving Linear Recurrence Systems on Parallel and Pipelined Machines," Dept. of Computer Science, University of Illinois at Urbana, unpublished memo, November, 1973. [3] R. K. Montoye, "Parallel and Pipeline Solution of First Order Difference Problems by Odd Equation Elimination," Dept. of Computer Science, University of Illinois at Urbana, unpublished memo, August, 1978. [4] D. H. Lawrie, "Recurrence Implementation Notes - I, 'Constant Coefficient Algorithms'," Dept. of Computer Science, University of Illinois at Urbana, unpublished memo. File 168, August, 1977. 99 [5] S. C. Chen, D. J. Kuck, and A. H. Sameh, "Practical Parallel Band Triangular System Solvers," ACM transactions on Mathmatical Software, vol.4, no . 3 , Sept. 1978, pp. 270-277. [6] K. Y. Wen, "Interprocessor Connections - Capabilities, Exploitation and Effectiveness," PhD. Thesis, University of Illinois at Urbana, UIUCDCS-R-76-830, October, 1976. 100 ?^PPENDIX A Appendix A presents the set of control structures needed to execute the algorithms discussed in this thesis. First we show the control structures to graphically illustrate the relationship between the resources contained within each structure. To avoid duplication, we use here the notation introduced in the section that defined the canonical set. That is, a generalized processor step, P, is used, and sneak paths are ignored. Simply for ease in locating structures, the structures have been divided, as , per Chapter 2, into a canonical set and a set of additional structures. The rest of Appendix A lists the actual structures used and indicates which sturctures are needed for which algorithms. CAIIONICAL SET 101 ^FA 2AS F A A S 2AP 2AP< A A ir P 3FAP 3FAP< F A P 3PAS p i A F A ^FAPP itFAPP< F t p p F 4- A i h ADDITIONAL STRUCTURES 102 ^APAS A P 4^ A J. S ^FAPAS F i A J. P >l. A s ^FAFAP F F i 1 A A 6FAFAPP< F F 4r 4^ A A 7FAPASAS F 4. A 4. p >!. A 4. s A S ^PASAS ^^ A A i i S S 7FAFAPAS F F I 4^ A A I A i S SFAPFAPAi F 4. A N^ F i P A P 4. A 4. s 103 1 OFAFAPFAPAS 1 OFAFAFAFAPP< 1 1 FAFAFAFAPAS F F F F i i I ]r ^^ A I S 1 2FAFAPASAPPAS F F A A A^^A S 4- p 4- t 1 3FAFAPASFAPPAS \/ F F ,P A 4^ i s P t 104 CONTROL STRUCTURES AND WHERE THEY ARE USED The Structures ALGA ALGB ALGC CCLGL MVMULT 2AS 2FA 2A+< 2A+< 2A* 3FA+ 3U-AS 4FA*+< 4A+AS 5FA+AS 5FAFA* 5FAPA3 5FA*AS 5FAU-AS 5U-ASAS 105 5FA-AS 6FAFA*-< 6FAFA*+< 7FAFA*A+< 7FAFA+AS 7FAFA*AS 7FAFA/A3 7FAPASAS 8FA*FA+AS 10FAFAFAFA*-< 10FAFAFAFA*+< 10FAFA*FA-AS 10FAFA*FA+AS 11FAFAFAFA*AS 12FAFA/ASA*U-AS 13FAFA/ASFA*U-AS 106 APPENDIX B Appendix B presents the algorithms discussed in this thesis. CCLGL and MVMULT are discussed in Chapter 2 and thus will not be repeated here. This appendix will contain the three algorithms refered to as ALGA, ALGB , ALGC. For a more detailed description of these algorithms, see [51. The algorithms will be presented in a psydo-code version. The emphasis of this appendix is to show the conditions for the use of control structures. MAP signifies a call to FAPGEN ([1]) and designates a control structure and its repetition factor. DEPEND flags data dependencies (see Chapter 2) . CEIL is the ceiling function. DOLOOP and ENDO are used to form a loop around structures. 107 ALGA: Proc (n ,m,p , pointer ) ; call DOLOOP (1) ; bound = CEIL (n*m/p) - 1; call MAP ( '10FAFA*FA-AS' ,M) ; call MAP ( '5FAP?A?S' ,M) ; call DEPEND (5,1) ; if m = 1 then call MAP (' igFAFA*FA?-A?S ', BOUND) ; else do ; call DOLOOP (bound) ; call MAP ( '2FA?' ,1) ; call DEPEND (2,6) ; call MAP ( '6FAFA*-<' ,M) ; call DEPEND (6,1) ; call MAP ('2A?S' ,1) ; call ENDO (pointer) ; call TRANSPOSE (m, p, pointer ) ; /* an MxM matrix */ call MAP ( 'SFAPPPAPS' ,1) ; bound2 = LOG (P/(M) - 1 ; call DEPEND (5,1) ; call DOLOOP (bound2) ; /* partition memories */ bound3 = CEIL (m/2) ; call MAP ( '2FA' ,1) ; call DEPEND (10,1) ; call MAP ( '10FAFAFAFA*-< 'BOUNDS) ; 108 call DEPEND (10,1) ; call MAP ( '5FA+A7S' ,1) ; call DEPEND (5,1) ; call DOLOOP (bound3) ; call MAP ( '10FAFAFAFA*+<' ,M) ; call DEPEND (10,1) ; call MAP ( 'SU-ATSAPS' ,1) ; call ENDO (pointer) ; call ENDO (pointer) ; call DEPEND (5,1) ; call DOLOOP (1) ; /* partition memories */ call MAP ( ' 2FA' ,1) ; call DEPEND (2,10) ; bound4 = CEIL (m/2) ; call MAP ( • 10FAFAFA*-< ' ,B0UND4) ; call DEPEND (10,1) ; call MAP ( ' 5FA+A7S' ,1) ; call ENDO (pointer) ; /* ends stage 3 */ /* partition memory */ if remote. term then ; /* no need for last step */ else do ; bound5 = CEIL (n/p) -1 ; call DEPEND (5,1) ; if m = 1 then call MAP ( ' 13FAFA*FA?-A?S ' ,B0UND5) ; 109 else do ; call DOLOOP (bounds) ; call TRANSPOSE (m, p, pointer ) ; call DEPEND (5,1) ; call MAP (2FA?' ,1) ; call DEPEND (2,1) ; call MAP ( '5FAFA*-<' ,M) ; call DEPEND (6,1) ; call MAP ( '2A?S' ,1) ; call ENDO (pointer) ; /* ends stage 4 */ end ; end ; call ENDO (pointer) ; /* ends algorithm */ end ALGA ; 110 ALGB: Proc (n ,m,p, pointer ) ; fold = CEIL (n*m/(2*p)) ; call DOLOOP (1) ; call DOLOOP (2) ; /* stage 1 */ calll0 ('2FA',1) ; call DEPEND (2,8) ; temp = 2 * m * fold ; call MAP ( '10FAFA*FA?+A?S' ,TEMP) ; call DEPEND (10,1) ; call ENDO (pointer) ; /* ends initialization */ fold = CEIL (m*n/(2*p) ; bound = LOG (N/M) - 1 ; temp = bound * fold ; call DOLOOP (temp) ; call MAP ( '2FA?' ,1) ; call DEPEND (2,5) ; call MAP ('5FAFA*-<' ,M) ; call DEPEND (6,1) ; call MAP ( '5FAP?A?S' ,1) ; call DEPEND (5,1) ; call MAP ( •2AS' ,1) ; call DEPEND (2,1) ; call MAP ( '6FAFA*-<' ,M) ; call DEPEND (6,1) ; call MAP ( '5FAP?A?S' ,1) ; call DEPEND (5,1) ; Ill call MAP ( '2AS' ,1) ; call DEPEND (2,1) ; call ENDO (pointer) ; call DEPEND (7,1) ; call MAP ('2FA?' ,1) ; call DEPEND (2,5) ; if remote. term then ; /* can skip the rest */ else do ; call DOLOOP (fold) ; call MAP ( '6FAFA*-<' ,M) ; call DEPEND (6,1) ; call MAP ('7FAP?A?SAS' ,1) ; call ENDO (pointer) ; end ; call ENDO (pointer) ; /* ends this algorithms */ end ALGB ; 112 AL3C: Proc (n ,m,p, pointer) ; call Doloop (1) ; if m = 1 then ; /* no initialization needed */ else if m = 2 then do ; /* initialize */ fold = CEIL (n/p) ; if fold > 1 then do ; call MAP ( '7FAFA*AS' ,FOLO) ; call DEPEND (5,1) ; call MAP ( ' 3FA+<' ,FOLD) ; call DEPEND (3,1) ; call MAP ( ' 2A3' ,1) ; call DEPEND (2,1) ; end ; else do ; call MAP ( '7FAFA*A+<' ,FOLD) ; call DEPEND (7,1) ; end ; call MAP ( ' 5FA-AS' ,FOLD) ; call DEPEND (5,1) ; temp = 2 * fold ; call MAP ( ■ 5FAP?A?S' ,TEMP) ; call DEPEND (5,1) ; end ; /* ends case where in = 2 * / else do ; /* cases m > 2 */ q = 2 ; /* this is basically ALGFULL (see [6]) */ 113 bound = LOG (M) -2 ; do i = to bound ; fold = CEIL (m*q*q*2/p) ; call MAP ( '11FAFAFAFA*AS' ,FOLD) ; if i > then do ; fold = CEIL (m*m*q/p) * i ; call MAP ('7FAFA+AS' ,FOLD) ; end ; call DEPEND (7,1) ; fold= CEIL (m/p) ; call MAP ( '7FAFA?+A?S' ,FOLD) ; call DEPEND (7,1) ; fold = CEIL (m*m/p) ; call MAP ('5FAP?A?S' ,FOLD) ; call DEPEND (5,1) ; q = q * 2 ; /* begin next iteration */ end ; call MAP ('5FAP?A?S' ,1) ; /* an additional align */ call DEPEND (5,1) ; fold = CEIL (m*m*m/p) ; bound = LOG (M) ; temp = bound * fold ; if fold > 1 then do ; call MAP ( '4A+AS' ,TEMP) ; 114 call DEPEND (2,1) ; end ; else do ; call MAP ( ' 2A+<' ,TEMP) ; call DEPEND (2,1) ; end ; fold = CEIL (m*m/p) ; call MAP ( • 5FA?+A?S' ,FOLO) ; call DEPEND (5,1) ; end ; /* ends initialization step */ bound = LOG (N/M) ; call DOLOOP (bound) ; if remote. term /* no need for folds */ then do ; call MAP ( 'SFAFA*' ,1) ; call DEPEND (5,1) ; temp = LOG (M) ; if m = 1 then ; else do ; call MAP ( • 2A+<' ,TEMP) ; call DEPEND (2,1) ; end ; call MAP ( ' 5FA-AS' ,1) ; call DEPEND (5,1) ; end ; /* ends remote term case */ fold = CEIL (m*m*n/p) ; 115 call MAP ('7FAFA*AS' ,FOLD) ; call DEPEND (5,1) ; temp = fold * LOG (M) ; if m = 1 then ; else do ; call MAP ('4A+AS' ,TEMP) ; call DEPEND (2,1) ; end ; fold = CEIL (m*m*n/(2*p) ) ; call MAP ('5fa-as' ,fold) ; call DEPEND (5,1) ; end ; /* ends variable coefficient case */ call ENDO (pointer) ; /* ends stage 2 */ call ENDO (pointer) ; /* ends algorithm */ end ALGC ; BIBLIOGRAPHIC DATA SHEET 1. Report No. UIUCDCS-R-79-973 3. Recipient's Accession No. 4. Title and Subtitle Control Structures for Parallel Machines 5. Report Date May, 1979 7. Authot(s h ranFllagen Panzica 8* Performing Organization Rept. No- UIUCDCS-R-79-973 9. Performing Organization Name and Address University of Illinois at Urbana-Champaign Department of Computer Science Urbana, Illinois 61801 10. Project/Task/Work Unit No. 11. Contract /Grant No. US NSF MCS76-81686 12. Sponsoring Organization Name and Address National Science Foundation Washington, D. C. 13. Type of Report & Period Covered Master's Thesis 14. 15. Supplementary Notes 16. Abstracts Until recently, algorithms for solving recurrence have been judged by counting processor steps. The purpose of this report is to explore and adapt these algorithms for a generalized parallel machine. Machine architecture parameters such as sneak paths, register usage and resource times will be considered, along with potential scheduling problems. While considering these factors, the concept of control structures will be introduced. The potential of defining machine instructions to execute these control structures has been considered and will be mentioned throughout this report. 17. Key Words and Document Analysis. 17a. Descriptors Control structures Machine instructions Recurrence algorithms 17b. Identifiers/Open-Ended Terms 17c. COSATI Field/Group 18. Availability Statement Release Unlimited FORM NT(S-35 ( 10-70) 19. Security Class (This Report) -^ UNCLASSIFIED. 20. Security Class (This Page UNCLASSIFIED 21. No. of Pages 120 22. Price USCOMM-DC 40329-P7 1 F£5 2 ^