LIBRARY OF THE 
 
 UNIVERSITY OF ILLINOIS 
 
 AT URBANA-CHAMPAIGN 
 
 5IO.S4 
 
 t\o. 3^7- 342 
 cop. 2 
 
Digitized by the Internet Archive 
 in 2013 
 
 http://archive.org/details/modelforlinearpr340gold 
 
3*0 
 
 Report No. 3I+0 
 
 A MODEL FOR LINEAR PROGRAMMING 
 OPTIMIZATION OF I/O-BOUND PROGRAMS 
 
 by 
 
 David E. Gold 
 
 June, I969 
 
 THE LIBRARY OF THE 
 
 AUG 25 1969 
 
 UNIVERSITY nF ILLINOIS 
 
Report No. 3^0 
 
 A MODEL FOR LINEAR PROGRAMMING 
 OPTIMIZATION OF I/O -BOUND PROGRAMS 
 
 by 
 
 David E. Gold 
 
 June, 1969 
 
 Department of Computer Science 
 
 University of Illinois at Urb ana -Champaign 
 
 Urbana, Illinois 6l801 
 
 * This work was supported in part by the Advanced Research Projects 
 Agency as administered by the Rome Air Development Center under 
 Contract No. US AF 30(602)l+lM* and submitted in partial fulfillment 
 of the requirements for the degree of Master of Science in Computer 
 Science, June, 1969. 
 
A Model for Linear Programming Optimization of l/0-Bound Programs 
 
 ABSTRACT 
 
 A model of a class of machines having periodically addressable 
 secondary memory is derived. The problem of (timewise) optimization of 
 programs consisting of many core-loads is considered with respect to the 
 model. For a large class of programs, sufficient constraints are estab- 
 lished to make the solution amenable to linear programming methods. 
 
Ill 
 
 ACKNOWLEDGMENT 
 
 I wish to thank Professor D. J. Kuck for his suggestions and 
 criticisms during the evolution of this study. 
 
 I would also like to extend my thanks to the Department of 
 Computer Science and the ILLIAC IV Project for their financial support. 
 
 Finally, I am grateful to Mrs. Sharon Hardman for her excellent 
 job in preparing this manuscript. 
 
IV 
 
 TABLE OF CONTENTS 
 
 Page 
 
 1. INTRODUCTION 1 
 
 1.1 History 1 
 
 1.2 Description of the Algorithm 2 
 
 1.3 Evaluation 3 
 
 2. PRELIMINARy k 
 
 2.1 Code Graph h 
 
 2.2 Basic Assumptions 6 
 
 3- SIMPLE COMPUTATION MODEL AND ESTABLISHMENT OF CONSTRAINTS. . . 9 
 
 3.1 Basic Model 9 
 
 3«2 Memory Constraint 11 
 
 3 . 3 Disk Timing Constraints 12 
 
 3.^+ Additional I/O Constraints 13 
 
 3.5 Sequencing Constraints lU 
 
 3.5.1 Standard ik 
 
 3»5«2 Special Case: Temporary Variables 17 
 
 3.6 I/O Contiguity Constraints 18 
 
 3' 7 System Constraints 19 
 
 3'8 Criterion Function 20 
 
 h. BOUNDARY CONDITIONS 21 
 
 k.l Constraints Which Apply to All Nodes 21 
 
 4.2 New Sequencing Constraints 22 
 
 5. OTHER CONSIDERATIONS 2k 
 
 5.1 Real Code 2k 
 
 5.1.1 Looping 2k 
 
 5.1.2 Branching 25 
 
Page 
 
 5.2 Overlapping Nodes 27 
 
 5.3 Reduction of Variables 27 
 
 APPENDIX 
 
 A. SUMMARY OF CONSTRAINTS 30 
 
 B. EXAMPLE 32 
 
1. INTRODUCTION 
 
 1.1 History 
 
 As computers become faster and larger, programs which run on 
 them expand also. For an increasing proportion of programs (those manip- 
 ulating very large amounts of data), a large amount of the "execution time" 
 may not be spent executing. The speed at which these programs run is a 
 function of the time it takes to manipulate their necessary data between 
 primary and secondary memory. It is this class of problems (commonly called 
 "I/O bound") to which the method to be described is directed. The objective 
 was to develop a model of a program and a model of both primary and second- 
 ary memories of the processor on which it is to be executed. The model 
 must be suitable for analysis. In particular, the goal was to formulate 
 a linear programming specification of the model such that, for most types 
 of source code, the solution is an altered code which completes its execu- 
 tion in the shortest possible time and is computationally equivalent to the 
 original program. 
 
 It is obvious, however, that there can exist no algorithm for 
 reading programmer's minds. This immediately precludes consideration of 
 every program which is computationally equivalent to the original. On the 
 other hand, a very large subset of these is obtained from consideration of 
 the original code and permutation of its statements. Indeed, it is the 
 reordering of statements alone which may allow for the transferring of 
 data at optimal or near optimal times. 
 
1.2 Description of the Algorithm 
 
 The algorithm presented here was developed for use on a machine 
 (or, more precisely, system) having one or more fixed head disks. The 
 method is, in fact, sufficiently general to handle any system with peri- 
 odically accessible secondary memory (e.g., drums, delay lines). For 
 obvious historical reasons, the word "disk" as used in this paper is synon- 
 ymous with the more general secondary memory devices. 
 
 The algorithm first models a program- -in sections- -with a directed 
 graph (called a code graph ). This code graph represents the structure of 
 the replacement statements in each section of code such that reorderings 
 of these statements, when beneficial for I/O reasons, are guaranteed not 
 to affect the final calculated results of the code. 
 
 The execution of the program being modeled is quantized in the 
 model. That is, the time axis is considered to be a series of small incre- 
 ments of time in which various operations (execution, input, output, etc.) 
 may occur. Since there is no a priori knowledge of the optimal solution, 
 all such operations are initially allowed. This is accomplished by con- 
 sidering a large number of binary integer variables as "flags" or indicators. 
 If a particular variable is assigned the value "1" in a solution, the oper- 
 ation with which it is associated is performed in the time interval to which 
 it corresponds. 
 
 It is upon this set of variables that the constraints are placed, 
 guaranteeing a solution which is physically realizable but not restrictive 
 with respect to unusual use of the I/O facilities. Such techniques as 
 multiple copies of data on the disk and relocating data on the disk are 
 therefore allowed in the final solution. 
 
1.3 Evaluation 
 
 While the model presented in this paper yields too many variables 
 to be adequately tested with present linear programming methods, several 
 observations may nonetheless be made. 
 
 There will be an obvious trade-off between the time taken to 
 optimize a program (the execution time for an LP program to find a solution 
 for the model derived here) and the time which it actually saves. At one 
 end of the spectrum are those programs which spend a relatively small part 
 of their time waiting for I/O. At the other end are programs in which the 
 majority of the execution time is really I/O "wait" time. It is clearly 
 towards the latter direction that the model presented in this paper will 
 be useful. At just what point in the spectrum it will begin to become 
 economical, only testing with one or more LP programs can determine. 
 [It should also be mentioned that the ability to optimize code is highly 
 dependent upon the freedom in reordering statements of the code. In cases 
 where little or no reordering is possible, while the model will yield con- 
 straints forcing a rapid solution, little optimization may be accomplished.] 
 
2 . PRELIMINARY 
 
 2.1 Code Graph 
 
 A code graph corresponding to a set of instructions is defined 
 by the following algorithm. 
 
 The code must have no transfers (i.e., is assumed to be a sequen- 
 tial list of operations). 
 
 1; For each variable name which has a second occurrence on 
 the left of a replacement statement (that is, any variable 
 which has its value assigned more than once in the code), 
 do the following: 
 
 a) rename the second such occurrence such that the new 
 name is unique with respect to the variables already 
 appearing in the code. 
 
 b) rename all occurrences of the same variable which 
 occur subsequent to the one changed to the same name 
 (this is done on the left side of replacement state- 
 ments, if necessary, yielding new "second occurrences"). 
 
 2) repeat 1 until no more such occurrences remain. 
 
 3) obtain a graph from this modified code by placing nodes in 
 one-to-one correspondence with the replacement statements 
 in the code. Each node is to be connected (by directed 
 branches) to those nodes corresponding to the computation 
 
 By "variable" here is meant the standard programming concept 
 of variable as applied to large blocks of data (e.g., arrays, matrices). 
 
of variables which appear on the right side of the statement 
 to which it corresponds. The direction assigned to each 
 branch is from predecessor to successor (intuitively, the 
 order in which the computation must take place) . 
 
 Initial (non- computed) variables have their cor- 
 responding nodes placed such that they are not successors 
 of any nodes. 
 
 Example : 
 
 Consider the following segment of code (where no assumptions are 
 made about the meaning of the operations +, -, *): 
 
 A*-B*C 
 D^B+A-C 
 A«-A-D 
 A*-A*D 
 
 Step 1, first iteration yields: 
 
 A*-B*C 
 D*-B+A-C 
 A '+A.-D 
 A'*-A*D 
 
 Step 1, second iteration 
 
 A*-B*C 
 IV-B+A-C 
 A'-A-D 
 AVA'*D 
 
The graph corresponding to this is: 
 
 With each node on this code graph is associated the execution 
 
 time for the operation which corresponds to that node. It is assumed that 
 
 all operations (for given data sizes) have a known execution time; this is 
 
 then normalized so that it is represented as an integer multiple of the 
 
 time intervals (to be defined later) chosen for the particular model. 
 
 This normalized time is denoted q. for the i-th node. 
 
 1 
 
 Note that a node may be considered as both an operation (calcu- 
 lation) on some data (which we assume to take a fixed amount of time) as 
 well as the generation of some data. For ease of explanation, then, the 
 word "node" will be used interchangeably for both calculations and pro- 
 duced data, where such use is unambiguous. 
 
 2.2 Basic Assumptions 
 
 The following simplifying assumptions are made. All calculated 
 data is to be saved. (That is, no provision is made for recognizing purely 
 
 That is, the time intervals are chosen to be at least as short 
 as the shortest execution time so that all execution time may be represented 
 as multiples of time intervals. 
 
temporary intermediate results, so one copy of everything will eventually 
 be stored on the disk, (it will be shown later that the model can be 
 refined to avoid this inefficiency. ) 
 
 The disk is assumed to be infinite. In other words, the amount 
 of data to be written on the disk is significantly smaller than the disk's 
 capacity; hence, the assumption that disk space is always available when 
 and where it is required. 
 
 The actual transmission time for each block is negligible compared 
 to the rotation (cycle) time of primary memory. Primary memory, while fixed 
 in size, is assumed to be capable of storing (holding) data in a form suit- 
 able for direct calculation. It may, however, be the case that for many 
 programs it is not possible to use primary memory with perfect efficiency. 
 This is somewhat analogous to having a number of differently dimensioned 
 rectangles and attempting to place them (non- overlapping) in a larger rec- 
 tangle, the area of which is equal to the combined areas of the smaller 
 rectangle. The situation occurs here because the model does not consider 
 formatting of primary memory whereas, for large problems of the type being 
 considered, its contents (data) must be highly structured. 
 
 A reasonable approximation to real life can be obtained here by 
 
 considering primary memory as being smaller than it actually is. That is, 
 
 a certain "bias" is introduced to account for the "holes" or unused primary 
 
 memory due to the rigid data structure. This bias is best considered to be 
 
 particular to each program and memory size. For the remainder of this paper, 
 
 Memory or Mem will be considered to be the biased maximum useable 
 max max 
 
 primary memory. In other words, no storage formatting problems are con- 
 sidered. (Thus, it is possible to effect legitimate solutions to optimising 
 
6 
 
 code by making the memory "look" smaller than it really is for the machine 
 under consideration. This has the effect of allowing enough extra room to 
 store arrays which do not naturally "fit together" in the restricted memory. ) 
 
3. SIMPLE COMPUTATION MODEL AND ESTABLISHMENT OF CONSTPAINTS 
 
 3.1 Basic Model 
 
 The model is a two-dimensional array of vectors. The array is 
 as illustrated below: 
 
 12 3^ 
 
 • m 
 
 where time is the abcissa. This means that each column refers to a unique 
 time interval. The intervals are of equal length, their minimum (time) is 
 such that every calculation time may be expressed an integral number of inter- 
 vals, m is the total number of such intervals and is fixed by obtaining any 
 upper bound on computation time for the entire section of code being considered. 
 
 Clearly, an upper bound on the running time is easily obtained. 
 The time interval and m determine the amount of time we consider- -not the 
 time the program runs. Indeed the program should execute in somewhat less 
 time than this upper bound, the optimization criterion being that the last 
 
 A perfectly legitimate upper bound of total computation time may 
 be obtained by just considering the original code-executed in sequential 
 order with appropriate delays for I/O. 
 
10 
 
 block to finish computing do so at the earliest possible time (used number 
 
 of time intervals is minimum) . 
 
 The ordinate in the matrix is node number (the nodes having been 
 ordered arbitrarily). 
 
 Each entry in the matrix is a vector having six elements. Thus, 
 we have the following set of variables: 
 
 x. .. 
 ljk 
 
 where k is the time interval < k < m 
 
 where j is the node number 1 < j < n 
 
 and 1 < i < 6 corresponding to the actions 
 
 i = 1 computation 
 
 i = 2 input (begin) 
 
 i = 3 input (continuing) 
 
 i = h output (begin) 
 
 i = 5 output (continuing) 
 
 i = 6 overlay 
 
 Recalling the concept of equivalence between node, computation and data, 
 
 the meaning of the variables x. ,. is as follows: 
 
 ljk 
 
 x . = 1 iff the j-th node is "computing" in the 
 ljk 
 
 k-th time interval 
 
11 
 
 x = 1 iff the j-th node (data) starts being input 
 
 in the k-th time interval (that is, this 
 
 data was not being input during the k-lth 
 
 time interval) 
 
 x = 1 iff the j-th node is being input in the k-th 
 Jjk 
 
 time interval 
 
 x. = 1 iff the j-th node starts being output in the 
 
 k-th time interval 
 
 x = 1 iff the j-th node is being output in the k-th 
 !?Jk 
 
 time interval 
 x> ., = 1 iff the memory space occupied by the j-th 
 
 node is overlaid (released, made available) 
 immediately at the beginning of the k-th 
 time interval 
 
 In the part that follows, one or more of the subscripts may not 
 appear when our concern is only with the remaining one(s) and the meaning 
 is clear. The various constraints on the variables will now be introduced. 
 
 3«2 Memory Constraint 
 
 The amount of available primary memory is fixed and the amount 
 of data contained therein must not exceed its capacity. This yields m 
 constraints: 
 
 k n x x 
 
 Z Z (-Jsl£ + _±i£ _ x . ) • ( s i Z e of the j-th block) < Memory 
 
 . , v p. n ojr' v ° ' — "'max 
 
 r=0 j=l *j c d 
 
 J 
 
12 
 
 The above inequality must hold for all values of k, 1 < k < m, yielding 
 m constraints, p. is the number of time intervals required to transmit 
 the j-th block (of data). The summation, then, yields iff the j-th 
 block is not in primary memory at time k (more accurately: at the k-th 
 interval); otherwise, the result is the sum of all memory taken up by 
 blocks in memory plus the fractional portion of any blocks in the process 
 of being input. The model thus assumes that blocks of data are input in 
 such a fashion that the amount of memory taken up by them is simply a ramp 
 function over the time intervals in which they are being input. 
 
 Note that it is possible to substitute the simpler constraint 
 
 n ^ir 
 Z Z (x . + — «— - jo- . ) (size of the j-th block) < Memory 
 
 r=0 j=l ° r n c °J r max 
 
 for the previous one. The difference is in the resolution of occupied 
 memory: the latter constraint requires that room be available in memory 
 for an entire block when the transmission begins. This is analogous to 
 a square wave description of memory usage. 
 
 3*3 Disk Timing Constraints 
 
 Here the fact that we are considering only fixed head disks is 
 used. We require that the time between when a block is output and subse- 
 quently input be a multiple of the disk revolution time. (The reason is 
 
 Note that actual transmission time is assumed to be negligible. 
 In cases where this is not the case, however, the time between output and 
 subsequent input is an integral number of disk rotations plus the time for 
 two transmissions of the data. The constraint is then modified accordingly. 
 
13 
 
 obvious: data cannot wander around on the disk--it must remain where 
 it is put.) 
 
 Let r he the (integral) number of time intervals corresponding 
 to a single disk revolution. Then for each node j, 
 
 x - Xj , < for some integer I > 1 
 
 (for x = 1 some x, , an integral number of revolutions ago must be 1, 
 For x = 0, we are not concerned with the values of xi . for integral 
 numbers of revolutions ago.) Written as a linear constraint: 
 
 f,r 
 Vj, Vk x - Z x, ., < 0, s = k modulo r, f = k-r 
 ^ JK t=s Jt 
 
 (note that r is a parameter which is constant for any particular system, 
 allowing the generation of the above constraints.) 
 
 3»^ Additional I/O Constraints 
 
 The disk timing constraints are sufficient only for continuous 
 data transfer. That is, they guarantee only that initiation of input and 
 output be "modulo the disk". Here additional constraints establish con- 
 tinuous transmission: (let p. be as defined previously) 
 
 J 
 
 k+p.-l 
 
Ik 
 
 (for x = 1 we start to input in the k-th time interval. Since there are 
 
 p . terms in the summation, there must be p . continuous intervals of data 
 J J 
 
 transmission commencing with the k-th. For x = 0, nothing is constrained.) 
 There is also a similar set of constraints for output: 
 
 k+p.-l 
 v 1 
 
 P J ^k i=k 5ik - 
 
 The above constraints do not disallow initialization of input or output 
 
 more than once within any p. intervals --which can be precluded with another 
 
 J 
 
 set of constraints of the form 
 
 k+p .-1 
 
 L 3 ^ - x 
 
 k+p.-l 
 
 z J X hU - 1 
 
 but because of the restrictions on allowable sequences of computation, 
 input, output and overlay, these are redundant. 
 
 3»5 Sequencing Constraints 
 3. 5*1 Standard 
 
 Let the letters C, I, 0, D represent the operations (in primary 
 memory) compute, input, output and overlay (destroy), respectively. 
 
15 
 
 For any node, then, the temporal sequence of operations associated 
 with it can he expressed 
 
 C00*D(I0*D)* 
 
 where * is Kleene star. In other words, any node must first be computed, 
 then output any number (> l) times and overlaid. This may then be followed 
 by any number (possibly zero) of the following sequences: input, output 
 > times, overlaid. All non-permitted sequences are disallowed by the 
 constraints given in this section. 
 
 W - 1 - ^ (X 2Ji • W ^ ° 
 
 guarantees the proper relative sequencing of I and D, [D(lD)*] 
 
 k 
 
 requires that a block be (at least started to be) input before it may 
 start being output. (Essentially, this establishes that O's appear only 
 in the allowed sequence.) It is still necessary to require the computa- 
 tion and output (in that order) before the remainder of the sequence. 
 
 Let C. be the integer formed by considering the (binary) vari- 
 ables x_ . Q x_ ..... x_ . as an integer in binary notation. 
 
That is, C. - Z x • 2^'^ 
 
 m 
 
 Similarly, 
 
 0. = Ex,.,- 2 (m -^ 
 £=0 
 
 <3 o-r, ^d*> 
 
 5. = Z x... • 2 (m "^ 
 
 then we require at least 1 overlay: 
 
 m 
 
 at least one output preceding it: 
 
 > D. or Z (*., - x,. ) • 2 (m_i) > 
 
 and computation preceding this: 
 
 16 
 
 V 5 j ° r | (x ui - x ^> • *™ * 
 
 The last constraint, however, is not sufficient. It is still possible 
 
 for computation to be unfinished when the first output begins. 
 
 This is disallowed by forming the next constraints, viz. : let 
 
 n be the number of time intervals for which the j-th node (computation) 
 
 must take place. Then 
 
 m 
 
 E X lii = n c 
 
17 
 
 and it is necessary to insure that the "right-most" or latest variable 
 with value 1 is still "to the left of" or before the first output. The 
 constraints are 
 
 k 
 
 Vk, Vj x + Z x. . < 1 (for any x = 1, there can be no 
 Ijk *_q H-ji; J-JK 
 
 x h . £ = 1 f or i < k) . 
 
 The constraints corresponding to the inequality C. > 0. are included in 
 the above set and may therefore be discarded. 
 
 3. 5*2 Special Case: Temporary Variables 
 
 In the case of nodes which are only temporary variables, there 
 
 is no reason to require that a copy of same be ultimately placed on the 
 
 disk. The allowable sequences, then, for such nodes are C0*D(I0*D)*. 
 
 Such an alteration in sequencing (no longer requiring an output) can be 
 
 accomplished by substituting the inequality C. > D. for the inequality 
 
 J J 
 
 0. > D.. This yields the constraint: 
 3-3 
 
 £2-^ ! x 6 . 2 (^)>o 
 
 2 n c. J!=0 6ji 
 
 (J 
 
 or. 
 
 m x., . 
 
 
 4=0 2 n c 6 ^ 
 
18 
 
 where the 2 c . in the denominator of the compute term guarantees that a 
 J 
 
 node is not overlaid before it is completely computed. 
 
 3 .6 I/O Contiguity Constraints 
 
 The constraints presented in this section perform two functions: 
 i) the structure of the code graph is represented here, and 
 ii) data which is needed for any calculation is forced to be 
 entirely in primary memory before the calculation begins. 
 
 Definition: 
 
 The requirements of a node j (or, more simply, the requirements 
 of j) are the predecessors of j. Thus, in the example: 
 
 A and B are the only requirements of j; C is a requirement of D, etc. 
 
 We require that for any node j, when x = 1, every one of the 
 requirements of J be entirely in memory. [Clearly the requirements that 
 the temporal order of execution in the solution be consistent with the 
 code graph is met if the previous requirement is (data may not exist 
 before it is calculated).] 
 
19 
 
 * 
 
 for all i which are requirements of j 
 
 [The summation is seen to range between and 1, as the portion of i in 
 
 memory. The constraint simply guarantees that when j is being calculated 
 
 (x = l), the summation term must also be unity.] 
 Ijk 
 
 3^7 System Constraints 
 
 It is here that constraints peculiar to the system being con- 
 sidered appear. Such items as multiprogramming, number and interrelation 
 of data channels, etc., are "specified" as linear constraints. 
 
 It is not the purpose of this paper to anticipate the various 
 system configurations one might consider. It is, rather, to indicate that 
 appropriate constraints would be generated and to consider one specific 
 example (albeit a simple one). 
 
 Consider a machine which can only transfer data over one channel 
 at a time. This gives the constraints: 
 
 n 
 £ (x 3jk + x 5 jk> < 1 
 
 Except for those nodes i which are initial nodes 
 
20 
 
 This machine is further more not allowed to perform more than one 
 "calculation" (as used in this paper) at a time: 
 
 n 
 
 Z XL, .. < 1 
 
 There are assumed to be no further constraints, although it is obvious 
 that many can be considered for various systems. Indeed, many differently 
 configured systems may be described by suitably establishing the proper 
 constraints here. 
 
 3«8 Criterion Function 
 
 Considering the constraints which require an overlay following 
 a required output following a required computation for each node which is 
 ultimately saved , it suffices to further require only that the final 
 (timewise) overlay be at the minimum time. That is 
 
 m n 
 minimize Z ( E x. .,) • 2 
 
 £=0 j=l 4 ^ 
 
 * 
 
 That is, ignoring those nodes which are "intermediate data" 
 and never output. 
 
21 
 
 k. BOUNDARY CONDITIONS 
 
 The constraints which have been given are for nodes which are 
 computed. That is, nodes corresponding to replacement statements in the 
 original code. It behooves us to now provide suitable alterations for 
 these constraints as they apply to "initial nodes". (initial nodes are 
 those corresponding to input data and hence never get computed.) 
 
 Each initial node may be thought of as having the same set of 
 
 six variables associated with it as do the other nodes. The first, x n „ , 
 
 ljk 
 
 is always for < k < m, 1 < j < n. This is done to allow as many of 
 the previous constraints as possible to hold. 
 
 k.l Constraints Which Apply to All Nodes 
 
 The following constraints are valid for initial nodes: 
 
 k n x x 
 
 Memory constraints: E E ( + — — - x r . ) S . < Mem 
 
 „ . , p. n bjr 7 j - max 
 r=0 o=l *j c 
 
 J 
 
 (noting that the second term is always zero and may be omitted) 
 
 f,r 
 
 Disk timing constraints: x c ., - E x. ., < 
 
 <^ K t=s Jt ~ 
 
 k+p -1 
 
 I/O contiguity constraints: p. • x - E x o/?v < ° 
 
 k+p.-l 
 E : 
 
 i=k 
 
 P J * x 4jk " £* x 5ik < ° 
 
22 
 
 k.2 New Sequencing Constraints 
 
 It is necessary to change the sequencing constraints since 
 initial nodes must have different temporal sequences of "basic operations 
 than the other nodes. This is due to the fact that 
 i) initial nodes are not computed; 
 ii) there is no reason to require a final copy on the disk 
 as we assume that there is a copy there to begin with; 
 iii) since the initial copy is assumed to originate on the 
 
 disk, it is not possible to reference its location there 
 relative to its first (non-existent) output. 
 The allowable sequences, therefore, are 
 
 (I0*D) (I0*D)* 
 
 Note that it will not be necessary to establish constraints requiring the 
 initiation (lO*D) portion. Requiring (lO*D)* only is sufficient due to 
 the fact that the in -memory constraints will force the input of the block 
 at least once. 
 
 We first guarantee proper relative sequencing of I,D: 
 
 o S ^ ("aw " x 6jP < ° foroes (ID >- 
 
 k 
 
 requires output to follow input (when it occurs). The remainder of the 
 normal sequencing constraints are not applicable. 
 
23 
 
 Note that the constraints in this section allow the initial 
 nodes to be stored initially at any location on the disk, thus allowing 
 full generality. 
 
2h 
 
 5- OTHER CONSIDERATIONS 
 
 5.1 Real Code 
 
 The method previously described is restricted to linear (sections 
 of) code. In order to show some potential utility, however, it is necessary 
 to show applications to "non-linear" code. 
 
 5«1.1 Looping 
 
 A loop may be thought of as concatenation of a section of linear 
 code with itself as many times as the loop is executed. It is clearly 
 impractical, though, to generate such "expanded code" and then proceed to 
 optimize same. Another method will be given for "optimizing" loops which 
 is at worst only slightly inefficient. 
 
 Consider the consequences of first optimizing a single iteration 
 of a loop and then allowing the repetition (looping) of this "optimized" 
 piece of code. The first time through the code would be linear and the 
 execution time would indeed be minimal (in the sense discussed in this 
 paper). Upon finishing execution of this first iteration, execution must 
 commence at the "beginning" of the linear section of code. 
 
 In general the data needed to begin this execution will not be 
 in an optimal location on the disk, thus requiring a wait due to "disk 
 latency" of anywhere up to one revolution time. The remainder of this 
 section of code will find items it needs from the disk at precisely those 
 locations most advantageous. This is because after the initial wait, the 
 computation of the second iteration of the loop is "in phase" with that 
 of the first iteration. If the linear section (one complete iteration) 
 
25 
 
 is of moderate length, it is reasonable to expect that the optimization 
 process has reduced the running time by at least several disk revolutions 
 (there would be little justification for it otherwise). This being the 
 case, the loss of as much as only a single revolution per iteration is not 
 outrageous—especially when we consider a latency of only one-half revolu- 
 tion on the average. The procedure of concatenating a previously optimized 
 linear section, then, will still yield highly reduced running times over 
 the original (non-altered) code, in spite of the fact that, in general, the 
 execution of the loop is not really optimal. 
 
 Further refinements to the running time of loop may be obtained 
 by concatenating the loop with itself some number of times (usually much 
 less than the total number of iterations at run time). That a saving of 
 time is effected is clear. As was previously explained, any loss of time 
 due to disk latency (on the average, one-half a disk revolution) occurs 
 when the optimized code is concatenated with itself. If this time loss 
 can be made to occur once for every q iterations (as is the case when the 
 original code is repeated q times and then optimized) instead of every 
 iteration, the time lost is reduced by a factor of q in the long run. 
 
 This method would be particularly valuable for loops which take 
 very little time to execute but will be repeated often. Such short loops 
 need only be repeated several times before optimization. 
 
 5.1.2 Branching 
 
 In the absence of detailed information on the frequency and 
 ordering of the various possibilities at each conditional transfer, it 
 seems impossible to perform general inter-section optimization. Tentative 
 
26 
 
 investigation in this area has not yielded any algorithms for obtaining 
 an optimal solution. 
 
 It is, however, possible to improve on a brute-force section-by- 
 section optimization. For example, in the case of a two-way conditional 
 branch preceded by a section of linear code, the following "optimization" 
 may be performed. 
 
 The initial linear code and any one of the two branches 
 are considered to be one section of linear code, (it 
 may be necessary to add a dummy instruction at the 
 branch point to account for the timing of the condi- 
 tional test there.) Upon this "new" code is placed 
 one additional restriction- -while statements are allowed 
 to be reordered for the normal optimization algorithm, 
 reorder ings over the boundary established at the branch 
 point are prohibited. This may be easily done for example, 
 through introduction of constraints requiring that every 
 node in the initial section be computed before every node 
 in the post-branch section. 
 
 The second branch is considered by itself afterwards 
 except that the initial conditions have already been 
 determined by the previous optimization. In other words, 
 the placement of data in primary and secondary memory at 
 the branch point (resulting from the first optimization) 
 are now the boundary conditions for the second branch 
 and its optimization. 
 
27 
 
 Several refinements of the above algorithm have been considered. 
 Unfortunately, however, these may best be described as heuristic approaches 
 since there is no real experience on which to base their value. 
 
 5.2 Overlapping Nodes 
 
 In many problems (e.g., PDE, weather codes) there are a set of 
 basic operations which are carried out on an extremely large amount of 
 data arranged on a grid. These operations require the values of points 
 in local areas on the grid for each calculation at a point. Since it is 
 necessary to partition these large grids into smaller sections, informa- 
 tion must somehow be provided over these created boundaries. This can 
 be done within the confines of the code lattice model as follows. 
 
 The requirements for each node include all over-boundary data 
 (as many as 8 additional nodes for a five point finite difference method 
 on a rectangular grid). Each calculation produces, besides the updated 
 information for the section of grid, additional separate (although redun- 
 dant) pieces of data- -each of which is a requirement for at least one 
 other node. 
 
 The previous constraints are perturbed such that while only one 
 calculation is considered to have taken place, it produces a multiplicity 
 of new nodes each of which is thereafter independent of its siblings. 
 
 5.3 Reduction of Variables 
 
 The procedure, as thus far described, yields 6 mn binary variables. 
 
 Besides generating a large number, this "brute force" method ignores 
 
 _ 
 
 Recalling, m = no. of time intervals, n = number of nodes. 
 
 x 
 
28 
 
 information useful for reducing the number of variables. It is possible, 
 though, to easily eliminate a large number of the variables. 
 
 Consider, for example, the node marked A in the earlier example. 
 
 i 
 Since the nodes A, B, C and D must all have been calculated before A begins, 
 
 it seems silly to consider the computation of A "early" in the solution. 
 Indeed, it is unnecessary to consider the compute variables (x ) for all 
 time intervals before the time in which it takes to compute all the pred- 
 ecessors for a given node. The required nodes are easily obtained by 
 merely tracing up the graph (against the arrows) from the node under con- 
 sideration; that is, all predecessors and their predecessors, recursively. 
 It should also be clear that all variables (input, output, overlay) are 
 forced to be for these "early" time intervals in -which the compute vari- 
 ables are also 0. 
 
 In a similar fashion, variables may be removed from "late" time 
 intervals with two exceptions: 
 
 i) since no a priori knowledge of the optimal running 
 time exists, the time intervals in question are 
 those counting backwards from the maximum time; and, 
 ii) input, output, and overlay variables are not set to 
 as are compute variables in these regions. 
 There appears to be no analytic way of relating the saving (removing) of 
 variables to the graph structure (which, indeed, it is highly dependent 
 upon) . Preliminary results indicate, however, a savings of the order of 
 magnitude of one-half the total number of variables (6mn) . 
 
 It should be noted that the programs for which there are great 
 reductions of variables by the preceding process are those which have 
 
29 
 
 code graphs which have many levels. Similarly, programs yielding code 
 graphs which have few levels do not result in large reductions of variables, 
 (intuitively, long code graphs are those in which there are long chains of 
 nodes dependent on their predecessors and hence many nodes which are con- 
 strained to start computing late--or finish computing early.) 
 
30 
 
 APPENDIX A 
 SUMMARY OF CONSTRAINTS 
 
 Memory: Z I (^£ + ^Ml 
 r=0 3=1 *j c 
 
 J 
 
 x £- ) 
 ojr' 
 
 (Size of jth block) < Mem 
 
 max 
 
 Number of Constraints: m 
 
 Disk Timing: 
 
 f,r 
 
 x_ .. - Z Xi, . , < 
 
 2jk 4jt - 
 
 m*n 
 
 s = k modulo r, f = k-r 
 
 Continuous Transmission: 
 
 k+p .-1 
 
 m*n 
 
 k+p -1 
 
 m*n 
 
 Sequencing: 
 
 All Nodes: 
 
 £ (x 6ji " X 2o|) < X 
 
 m*n 
 
 4 (x 2ji " W - ° 
 
 m«n 
 
 £ (x 6ji " X 2 3 P + X ^k + 1 i % 
 
 m*n 
 
 Non-Initial Nodes: 
 
 m 
 
 -Z x^ . , < 1 
 i=l M " 
 
 n 
 
(standard) 
 
 m 
 
 £ l*6 it - ^ ■ S(m ' l) <- ° ' 
 
 (temporary) 
 
 I ( x -fiM) 
 
 i=0 M 2 n c. 
 J 
 
 2 (m-i) < Q 
 
 Vi which are requirements of j 
 
 31 
 
 m 
 
 Z x_ . ., - n =0 
 
 i=0 lji C j 
 
 n 
 
 x + Z X., . « < 1 
 Ijk ^ =Q 4ji - 
 
 m«n 
 
 Contiguity: 
 
 x 
 
 ljk 
 
 i=0 C . *1 
 
 W $ ° 
 
 3m*n 
 
 System: 
 
 n 
 
 Z x_ ., < 1 
 
 Ijk - 
 
 m 
 
 n 
 
 £ (x 3;jk + *a|k) < x 
 
 m 
 
 Total Number of Constraints: ~ 10 mn + 3(m+n) 
 
32 
 
 APPENDIX B 
 EXAMPLE 
 
 As an example of this method, let us consider the following 
 sequence of replacement statements as a section of code which is to "be 
 optimized. 
 
 F - 
 
 J = 
 
 C = 
 
 L - 
 
 E - 
 
 G = 
 
 H = 
 
 K = 
 
 I = 
 
 f x (B,D 
 
 f 2 (A,F 
 
 f 3 (A,B 
 f U (C,J 
 
 f 5 (C,D 
 f 6 (B,E 
 f T (F,G 
 f 8 (F,H 
 
 f 9 (C,H 
 
 where each of the "variables" A-L represents a block of data, their 
 respective sizes being as follows: 
 
 (we will also here associate block numbers 1-12 for 
 variables A-L, respectively; as well as normalizing 
 the available primary memory to be unity) 
 
33 
 
 Size block F = 1/8 memory therefore, size of the 6th block = 1/8 
 
 J = l/l6 memory therefore, size of the 10th block = l/l6 
 
 C = l/k memory therefore, size of the 3d block = l/k 
 
 L = l/k memory therefore, size of the 12th block = 1/4 
 
 E = l/k memory therefore, size of the 5th block = l/k 
 
 G = 1/8 memory therefore, size of the 7th block = 1/8 
 
 H = 1/8 memory therefore, size of the 8th block = 1/8 
 
 K = l/l6 memory therefore, size of the 11th block = l/l6 
 
 I = l/l6 memory therefore, size of the 9"th block = l/l6 
 
 The code graph for this code is: 
 
 and we will assign the following time intervals necessary to evaluate each 
 of the (arbitrary) functions f, - f . 
 
3k 
 
 Time for f _ , f_, f_, f Q , f_ = 1 interval therefore, n = 1 
 1 3 7 o 9 c 
 
 j-t • • • »y 
 
 f , f , fs = 2 intervals therefore, n = 2 
 
 5 C 2,5,6 
 
 f. =3 intervals therefore, n =3 
 
 Where one revolution will be k intervals. 
 
 In an effort to keep this example non- complex, all blocks are 
 
 assumed to be transmitted between primary memory and secondary memory in 
 
 a single time interval (p. = 1, 1 < j < 12). 
 
 J 
 
 We will further assume a maximum running time of 2k intervals 
 (m = 2k). As an example of the flexibility available in specifying con- 
 straints for particular problems, we will make two alterations to the con- 
 straints as previously described. 
 
 The first change deals with the constraints for initial conditions: 
 it seems reasonable to consider that the initial nodes may be initially in 
 memory (although it would not be reasonable to require that they do). This 
 can be reflected by removing the constraint on the number of blocks which 
 may be input during the Oth time interval. Thus, the solution may indicate 
 that more than one block has been transmitted during the Oth time interval. 
 This may in fact indicate that these blocks were loaded at the same time 
 (just prior to run time) or may reflect the situation in which these blocks 
 were input (one at a time) earlier (intuitively, during intervals -1,-2,...). 
 
 The second alteration is a minor one dealing with the use of 
 memory. We will arbitrarily declare that memory can never be completely 
 
35 
 
 used. That is, the inequality "<" in the memory constraint is replaced 
 by "<" yielding: 
 
 2k 12 x . 
 
 Z E (x_. - -±3E - x,. ) (S.) < 1.0 
 
 i"ir n oiri 
 r=0 j=l JJ c. J J 
 
 <j 
 
 Elimination of Variables 
 
 We will now consider dropping variables in cases where particular 
 values are precluded because they are not physically possible. 
 
 For each of the non- initial nodes, we list those nodes which must 
 be calculated first: 
 
 c 
 
 none 
 
 
 E 
 
 C 
 
 
 
 F 
 
 none 
 
 
 G 
 
 c, 
 
 E, 
 
 F 
 
 H 
 
 c, 
 
 E, 
 
 F, G 
 
 I 
 
 c, 
 
 E, 
 
 F, G, H 
 
 J 
 
 F 
 
 
 
 K 
 
 c, 
 
 E, 
 
 F, G, H 
 
 L 
 
 c, 
 
 J 
 
 
 The earliest time interval in which any of these nodes may be 
 
 calculated is one later than the sum of the time intervals it takes to 
 
 calculate its predecessors. The following variables are therefore so 
 constraint ed. 
 
X i51 
 
 
 _ 
 
 
 
 x i T k 
 
 
 = 
 
 
 
 x i8k 
 
 
 = 
 
 
 
 x i9k 
 
 
 = 
 
 
 
 x i 10 
 
 1 
 
 = 
 
 
 
 x i 11 
 
 k 
 
 = 
 
 
 
 X i 12 
 
 k 
 
 = 
 
 
 
 36 
 
 1 < i < 6 
 
 i<i<6 1 5 k 5 i| 
 
 l<i<6 1 < k < 6 
 
 l<i<6 l<k<T 
 
 1 < i < 6 
 
 l<i<6 x < k < 7 
 
 l<i<6 1 < k <3 
 
 (x. is assumed =0 < i < 6, j = 3, 5 - 12) 
 
 (x lll = X 121 = X l4l = 0) 
 
 In a similar fashion, it is possible to eliminate variables from the "end" 
 of the computation. 
 
 Node H must be computed before nodes I, K 
 
 Node G must be computed before nodes H, I, K 
 
 Node F must be computed before nodes 0, E, I, J, K, L 
 
 Node E must be computed before nodes G, H, I, K 
 
 Node C must be computed before nodes E, G, H, I, K 
 thus x lgk =0 23 < k < 2k 
 
 x nr71 - 22 < k < 2\ 
 17k - - 
 
 x l6k =0 15 < k < 2k 
 
 x n _. = 20 < k < 2k 
 15k — — 
 
 x. „ = 18 < k < 2k 
 
 13k - - 
 
37 
 
 The preceding optimization eliminates 258 variables from 
 consideration. 
 
 The complete set of constraints for this example is listed below: 
 
 2k 12 x . 
 Memory: Z Z (x - -±fl£ - x ) (s ) < 1.0 
 
 r=0 j=l ^ c. J d 
 
 J 
 
 s, = s_ = s c = s no = 0.2500 
 
 1 3 j 12 
 S 2 = S k = S 6 = S 7 = S 8 = * 1250 
 
 s 9 = s io = s n = °-° 625 
 
 n = n =n = n = n =1 
 
 c l c 3 c l c 8 c 9 
 
 n = n = n =2 
 C 2 c 5 °6 
 
 n =3 
 
 % 
 
 Disk Timing: x - Z x. . , < 1 < j < n < k < m 
 
 <^ K t=s J ~ ~ " 
 
 s = k modulo K, f = k-U 
 
 Initial Conditions: x„. n - x„ ., < 
 
 2jk 3jk - 
 
 =\j k - x 5jk < 
 
 ^^jt-^- 1 a "1.2,11 <k<2^ 
 
 k 
 
 £ (x 2al - W > ° = 1,2,1. <k<2t 
 
 £ (x 2jl " x 6jP - X U j k + l> ° i - !' 2 ' k ° 2 k 5 21, 
 
38 
 
 Sequencing: 
 
 Non-Initial k 
 
 Nodes: E ( x £^ - x 2 -^) - - 1 J = 3 and 5 < j < n 
 
 E (x ^ - x 6 ^) < j = 3 and 5 < j < n 1 < k < 24 
 
 k 
 ^6^ " X 2j^ + \ j k + l ^ X J = 3 ' 5 " n 1 < k < 24 
 
 j = 3,5-n 
 
 j = 3,5-n 
 
 E 
 i=i 
 
 x 6ji ^ 
 
 1 
 
 
 24 
 
 
 
 
 Z 
 
 k 
 
 n = 
 c . 
 
 = 
 
 x ij^ 
 
 + E 
 " i=0 
 
 X 4jk 
 
 < 1 
 
 J = 3,5-n 1 < k < 24 
 
 Contiguity: x^- - E (x^ - Xg^) < 1 < k < 24 
 
 *13k - | (X 32I " W < ° 1 < k < 2U 
 
 x 15k " | Q ( ^ + x 3ii " W S i = S'3,U l<k< 2 U 
 
 x l6k " , S <^ + x 3ii ' W 2 ° 4 " 2 ' 3 ^ k <* 
 
 i!=0 c. 
 
 x . « 
 
 x._, - Z (-==£ + x... - Xx-.J < i = 2,5,6 1 < k < 24 
 
 17k . v n 3ii 6ii' - ' - - 
 
 x,=0 c. 
 
 i 
 
 x l8k " J> Zt 1 + X 3il " W $ ° t ' ( " 1 1 - k - 2k 
 
 £=0 c. 
 
39 
 
 k x ' 
 
 x 19k " £<"^ + x 3ii " W 2 
 
 = 3,6,8 1 < k < 2k 
 
 k x 
 
 110 
 
 x_ _. ,- E (-==£ + x... - x>. J < i * 1,6 1 < k < 2U 
 
 1 10 k „ ~ n 3ii oii - - - 
 
 1=0 "c. 
 
 k x 
 
 c. . . , - Z (-^ + x. . . - x. . . ) < 
 1 11 k v n 3ii bii' - 
 i=0 c. 
 
 X_ . g 
 
 L. - Q . - Z (-^ + X_.„ - X..„) < 
 
 1 12 k „ ~ n 3ii oii' - 
 i=0 c. 
 
 i 
 
 = 6,8 1 < k < 2k 
 
 = 3,10 1 < k < 2k 
 
 System: 
 
 12 
 Z x_ ., < 1 
 
 < k < 2k 
 
 12 
 
 j =1 3jk 5jk y - 
 
 1 < k < 2k 
 
UNCLASSIFIED 
 
 Security Classification 
 
 DOCUMENT CONTROL DATA -R&D 
 
 (Sacurity claaaltlcatlon of tltla. body ol aba tract and inda wing anno till an mum I ft* antarad whan tha ovarall report la clamalllad) 
 
 originating ACTIVITY (Corpormf author) 
 
 Department of Computer Science 
 
 University of Illinois at Urbana-Champaign 
 
 Urbana, Illinois 6l801 
 
 2«. REPORT SECURITY C L A SSI Ft C A TION 
 
 UNCLASSIFIED 
 
 26. GROUP 
 
 3 REPORT TITLE 
 
 A MODEL FOR LINEAR PROGRAMMING OPTIMIZATION OF l/0-Bound Programs 
 
 4. DESCRIPTIVE NOTE* (Typa of raport and Inclutlra dataa) 
 
 Research Report 
 
 5 AUTHOR(S) (Flrmt naora, middlm Initial, la a I nam*) 
 
 David E. Gold 
 
 9 REPORT DATE 
 
 June, 1969 
 
 7a. TOTAL NO. OF PACES 
 
 h3 
 
 7b. NO. OF REFS 
 
 »a. CONTRACT OR GRANT NO. 
 
 U6 -26 -15 -305 
 
 b. PROJEC T NO. 
 
 USAF 30(6O2)klkk 
 
 »m. ORIGINATOR'S REPORT NUMSER(S) 
 
 DCS Report No. 3^0 
 
 9b. OTHER REPORT NOISI (Any othar numbara that may ba aaalgncd 
 this raport) 
 
 10. DISTRIBUTION STATEMENT 
 
 Qualified requesters may obtain copies of this report from DCS. 
 
 II. SUPPLEMENTARY NOTES 
 
 NONE 
 
 12. SPONSORING MILITARY ACTIVITY 
 
 Rome Air Development Center 
 Griffiss Air Force Base 
 Rome, New York ISUkO 
 
 IS. ABSTRACT 
 
 A model of a class of machines having periodically addressable 
 secondary memory is derived. The problem of (timewise) optimization of 
 programs consisting of many core-loads is considered with respect to the 
 model. For a large class of programs, sufficient constraints are estab- 
 lished to make the solution amenable to linear programming methods. 
 
 DD F " , r..1473 
 
 UNCLASSIFIED 
 
 Security Classification 
 
 
UNCLASSIFIED 
 
 Security Classification 
 
 KEY WORDS 
 
 ROLE WT 
 
 ILLIAC IV 
 input output 
 optimize computation 
 computation model 
 disk timing 
 memory utilization 
 linear programming 
 
 1 
 
 UNCLASSIFIED 
 
 Security Classification 
 

 ^^