CENTRAL CIRCULATION BOOKSTACKS The person charging this material is re- sponsible for its renewal or its return to the library from which it was borrowed on or before the Latest Date stamped below. You may be charged a minimum fee of $75.00 for each lost book. Theft, mutllotlon, and underlining of book* or* reasons for disciplinary action and may result In dismissal from the University. TO RENEW CALL TELEPHONE CENTER, 333-8400 UNIVERSITY OF ILLINOIS LIBRARY AT URBANA-CHAMPAIGN DEC 141998 JUL ^ 2000 When renewing by phone, write new due date below nrnnniic Alt* Aa*» I 1 K> previous due date. L162 Digitized by the Internet Archive in 2013 http://archive.org/details/improvingperform945abus O o^ Report No. UIUCDCS-R-78-945 y f (d-CA. UILU-ENG 78 1741 IMPROVING THE PERFORMANCE OF VIRTUAL MEMORY COMPUTERS by Walid Abdul-Karim Abu-Sufah November 1978 NSF-OCA-MCS77-27910-000036 Z*l ii i •! DEPARTMENT OF COMPUTER SCIENCE UNIVERSITY OF ILLINOIS AT URBANA-CHAMPAIGN URBANA, ILLINOIS Report No. UIUCDCS-R-78-945 Improving the Performance of Virtual Memory Computers by Walid Abdul-Karim Abu-Sufah November 1978 Department of Computer Science University of Illinois at Urbana-Champaign Urbana, Illinois 61801 * This work was supported in part by the National Science Foundation under Grant No. US NSF-MCS77-27910 and was submitted in partial fulfillment of the requirements for the degree of Doctor of Philoso- phy in Computer Science, November 1978. iii ACKNOWLEDGMENT Being the student of Professor David Kuck was for me a pleasant encounter of a special kind. I learned a lot from his approach to the solution of problems, from his deep insight into the relevant points of issues, and from his philosophy about things. His advice and guidance were of great value to me while conducting this research. I am grateful to him. I thank Professor Duncan Lawrie, who was actively involved in different phases of this research. I also thank my friends and colleagues, Utpal Banerjee, Robert Kuhn, Bruce Leasure, David Padua, and Michael Wolfe for the interesting discussions which I had with them. Special thanks go to David Padua; he wrote the trace generator a couple of years ago and helped in getting it running to generate the traces for our experiments. Thanks are also due to Yonsook Kang; she implemented the clustering algorithm. I would also like to thank Mrs. Vivian Alsip for her assistance throughout my stay at the Digital Computer Laboratory and for her help with the typing. I owe special gratitude to my wife. Going with me through my graduate study was quite an experience for her. I thank her for her patience and love. iv TABLE OF CONTENTS Page 1 . INTRODUCTION 1 1.1 Improving the Locality of Programs - Previous Work 1 1.1.1 Programmer Implemented Locality Improvement Techniques 3 1.1.2 Automatic or Semi-Automatic Locality Improvement Techniques 5 1.2 Problems with Previous Work and Our Approach 11 2 . FUNDAMENTAL CONCEPTS 14 2.1 Criteria for Performance Evaluation 15 2 . 2 Modeling Program Behavior 19 2.2.1 Previous Work - Stochastic Models 21 2.2.2 Previous Work - Deterministic Models 25 2.2.3 Our Approach - Analysis of Program Behavior at the Symbolic Level 38 2.2.3.1 The Elementary Loop Model 40 2.2.3.2 Other Loops 47 2 . 3 Summary 57 3. PROGRAM TRANSFORMATIONS 59 3 . 1 Data Dependence Analysis 64 3.2 Clustering of Assignment Statements Algorithm 68 3.2.1 Definitions and Notations 68 3.2.2 The Clustering Algorithm 69 3. 3 Fusion of Name Partitions 73 3.3.1 The Usefulness of the Fusion Transformation 73 3.3.2 Notation and Definitions 76 3.3.3 Correctness of Fusing Two Name Partitions 77 3.3.4 The Fusion Algorithm 78 3.4 Scalar Transformations 78 3.4.1 The PARAFRASE Compiler Scalar Transformations 79 3.4.2 The Scalar Forward Substitution Transformation 81 3.4.2.1 Correctness of the Forward Substitution Transformation 83 3.4.3 Modifying the Scalar Expansion Transformation 84 3.4.4 Choosing Between Scalar Expansion and Forward Substitution 86 3.5 Distribution of Name Partitions 88 3.5.1 Horizontal Distribution of Name Partitions 90 3.5.1.1 The Horizontal Distribution Algorithm 90 3.5.1.2 The Problem with Horizontally Distrib- uting an NP with Multi-page Arrays 91 3.5.2 Page Indexing and Vertical Distribution of Basic Name Partitions 94 V TABLE OF CONTENTS (Continued) Page 3.5.2.1 Vertical Distribution of Elementary NP's 95 3.5.2.2 Vertical Distribution of Multi-nested Basic Name Partitions with Multi- dimensional Arrays 97 3.5.2.3 Vertical Distribution of Basic NP's - the General Algorithm and Some Implementation Considerations 102 3.5.2.4 The Correctness of the Page Indexing Transformation 104 3.5.3 Transforming Nonbasic Tr-Blocks into Basic TT-Blocks 119 4 . EXPERIMENTAL RESULTS 127 4.1 Measuring the Characteristics of Program Localities 136 4.1.1 Localities in Segmented Systems 138 4.1.2 Localities in Paged Systems 145 4.2 Measuring the Performance Improvement of Paged Virtual Memory Systems - the Fixed Memory Allotment Case 149 4.2.1 The Page Faults vs. Memory Allotment Results 150 4.2.2 The Space-Time Cost vs. Memory Allotment Results 162 4.3 Measuring the Performance Improvement of Paged Virtual Memory Systems - the Variable Memory Allotment Case 178 4 . 4 Summary 196 5 . CONCLUSIONS AND EXTENSIONS 207 REFERENCES 214 APPENDIX 219 VITA 254 1. INTRODUCTION 1.1 Improving the Locality of Programs - Previous Work Since the early years of modern computing, people have realized that due to cost-speed tradeoffs, computer memories of very large overall capacity must be organized hierarchically. The introduction of memory hierarchies in computer systems created the problem of storage allocation of programs. At each moment during the execution of a program, the distribution of its information (code and data) among the levels of the memory hierarchy must be determined. The programmer was faced with the additional responsibility of manually solving this memory allocation problem. This was not an easy thing to do, especially with the introduction of high level languages which shielded programmers from the details of machines . The idea of virtual memory systems was the solution to this problem. It provided an elegant way of achieving automatic storage allocation [KILB62] , [SAYR69] . Since the evolution of the virtual memory concept in the early 1960s, a tremendous amount of research effort has gone into in- vestigating the various aspects of virtual memory systems. Different methods of implementation were considered and contrasted: segmentation, paging, or paged segmentation. Moreover different memory management algorithms were investigated. These are concerned with the fetch policy which decides when an item of virtual memory (a page, or a segment) is to be 2 fetched to main memory, the placement policy which decides where to place an item in main memory and the replacement rule which decides which item to replace if there is no space for the new item. Both fixed and variable memory allotment policies were considered [BELA66] , [DENN68] , [CHU72] . People have used the number of item faults, the efficiency of main memory utili- zation, and the space-time product cost of a program, to measure the per- formance of different memory management schemes. Principles of optimality have been defined in [BELA66] , [PRIE76] , and [BUDZ77]. The performances of different policies were measured by comparisons to the performance of optimal policies. People often use reference string driven simulation techniques for their statistical measurements of the effects of varying memory allot- ment and page size on the performance of different policies. A survey of the work done in this area and some results can be found in [DENN70] , and [KUCK70] . The central reason behind any success which a virtual memory system might achieve is the property of locality of reference which programs ex- hibit. Denning in [DENN72a] makes the following three statements to des- cribe the locality of reference property of programs : * During any time interval, a program distributes its references nonuniforraly over its address space, some pages being favored over the others. * The density of reference to a given page changes slowly in time or the set of favored pages changes membership slowly. * Two disjoint segments of the page reference string tend to be highly correlated when the interval between them 3 is short, and tend to become uncorrelated as the interval between them increases. It has been confirmed by early studies that the degree of locality of a pro- gram is the most important factor in its cost of execution in a virtual memory computer. Although one may not reduce the number of page faults generated by a program by more than 30 or 40 percent by changing the page replacement algorithm [BELA66] , an improvement of a factor of 5 was achieved by improving the locality of programs [COME67] . Thus it was recognized that efforts should be directed to develop techniques to improve the locali- ty of programs before executing them in virtual memory systems. This was an absolute necessity for certain kinds of programs, namely those pro- cessing large multi-page arrays. There are two approaches to the problem of improving the locality of reference strings generated by programs. In the first approach the programmer was expected to follow certain rules and guidelines when coding the solution to different problems. In the second approach people tried to devise automatic or semi-automatic locality improvement techniques. In the following two sections we will discuss briefly the previous work done in these two areas. We will give illustrative examples and sample results. In Section 1.2 we will point out the deficiencies and problems in the previous work, present our philosophy and approach to the problem, and finally sketch the outline of this thesis. 1.1.1 Programmer Implemented Locality Improvement Techniques It did not take too much time for people to realize that virtual memory computers did not relieve the programmer completely from worrying about the memory needs of a program. When programmers worked under the assumption that in a virtual memory computer they could get all the memory space they needed, the costs of running some programs were high [FINE66], [BRAW68] ,[GLAS65]. Several papers have been published to give programmers rules and guidelines when writing code to solve large problems in a virtual memory computer. Some of these papers were oriented towards specific applica- tions and problems, others were of more general nature. Examples of the problem oriented work can be found in [BRAW70] , [BOBR67] , [DUBR72] , and [ROGE73], which treat sorting, list processing, solution of eigenvalue problems, and the solution of linear equations respectively. [McKE69] , [MOLE72] , and [ELSH74] are examples of papers which address the general problem of algorithms for large matrix programs in a paging environment. Moreover, manufacturers of virtual memory computer systems started to devote sections of manuals to help programmers develop a programming style for virtual storage systems [IBM73]. A good representative of this approach to improve program locality is the work of Elshoff in [ELSH74] . He was concerned with the processing of multi-dimensional arrays in a paging environment. In particular he con- sidered two dimensional arrays which were assumed to be stored row-wise. 2 An NXN matrix satisfied the relation N < Z < N , where Z is the page size. Elshoff presented some rules to be used by programmers when writing code to solve matrix problems. He applied his individual rules and their comb- inations to two example programs, namely matrix transpose and matrix multiplication. He also derived analytical expressions for the number of generated page faults when executing under an LRU page replacement algorithm. Moreover, he executed the original programs and the improved programs on a 5 dedicated machine. The matrices were square matrices of size 101x101, each spanning 20 pages of virtual space with a page size of 512 words. Figure 1-a and Table 1 show the results for the matrix transpose program. Figure 1-b and Table 2 show the results for the multiplication program. There are two very important conclusions which one can make by ex- amining these figures and tables. The first is that programs which process large arrays of data can have very serious problems if executed in virtual memory computers. The second is that the amount of improvement which was attained by the suggested techniques is very significant. 1.1.2 Automatic or Semi-Automatic Locality Improvement Techniques The main attractive feature of virtual memory systems is the auto- matic management of memory allocation. Hence the approach presented in the previous section seems to be a step backward, since the programmer is re- quired to follow certain rules while programming for a virtual memory computer. Many of the programming guidelines are either problem oriented or cannot be applied in simple, direct ways to complex and large programs. Hence it seems that if anything is to be done to programs to improve their locality properties, it should be taken care of by the computer software system and not by the programmer. Several people took this approach [COME67] , [HATF7 1] , [MASU74] , and [FERR74]. All these researchers worked on what is called the 'pagination problem'. A program has a number of modules: main procedure, subroutines, and data blocks. Assuming that a page can hold more than one module, the pagination problem can be simply stated as trying to group these modules or blocks in pages such that the program generates a more local reference string when executed in a virtual memory computer. Thus the aim is to Table 1. Results for the Matrix Transpose Program [ELSH74] (Memory Allotment 15K) Algorithm Problem System Total Elapsed I/O Used CPU CPU CPU Time Time Standard .819 9.900 10.719 77.5 66.8 Combination of All Improvement 1.110 1.408 2.518 11.0 8.5 Rules Table 2. Results for the Matrix Multiply Program [ELSH74] . (Memory Allotment 15K) Algorithm Problem System Total Elapsed I/O Used CPU CPU CPU Time Time Standard 197.3 4493.9 4691.2 19460. 14768.4 Combination of All Improvement 222.7 6.9 229.6 252 22.7 Rules Units are seconds Page Ffcultl 900 800 700 600 500 itoo 300 Ihrashl n^ Hot Thrashing n 1 1 1 1 1 1 1 1 r 2 It 6 e 10 IP lk 16 18 20 Pagea of Heal Memory Figure 1-a. Comparison of Matrix Transpose Algorithms [ELSH74] 300, 000 -1 100,000 - 50,000 - Page Ffculta Soluxi -* c- h ■ - ren ■*■ Stand* rd Algorithm ernatlng EHrectlon Reordering loopa Alternating Llrectlon and reordering loopa Paget of Peal Memory Figure 1-b . Comparison of Matrix Multiplication Algorithms [ELSH74] 8 modify a program's layout in virtual space. This is called "program restructuring." If the program's modules are relocatable with respect to each other, this can be done by relinking the modules after changing the order in which they are presented to the linker, otherwise changes in the source code and recompilation of some modules might be needed. Informa- tion about the dynamic behavior of the program is gathered during an in- formation gathering run. This information is used to construct a re- structuring non-directed graph for the program according to a particular restructuring algorithm . The nodes of the graph represent the modules of the program. The numerical labels of the edges represent the desirability that the nodes they connect be laid out together within the same page. After the restructuring graph is constructed a clustering algorithm is used to obtain the new layout for the program from the graph. The clustering algorithm aims at "determining a linear arrangement of nodes (of the restructuring graph) in pages which maximize the vicinity of those pairs having the highest labels" [FERR76b]. The main difference between researchers in this area is the restructur- ing algorithm they used. Hatfield and Gerald introduced the nearness method for a restructuring algorithm [HATF71]. They argued that performance can be improved if consecutive blocks or modules in the block reference string generated by a program were grouped in the same page. Hence the label E. . of the edge connecting nodes i and j in the restructuring graph, is in- cremented by one every time block i is referenced directly after j or block j is referenced directly after i. In their extension to the nearness method, Masuda, Shiota, Noguchi, and Ohki [MASU74] incremented E.. if ref- erences to i,j are separated by some small distance in time. Ferrari in 9 [FERR74] ,[FERR75] ,[FERR76a] , and [FERR76b] takes into explicit account the memory management policy of the system when designing the restructuring algorithm. He argues that each page replacement policy assumes a certain model of the ideal program behavior, which is the behavior of a program for which all the predictions made by the policy are correct. Hence a pro- gram is restructured such that its behavior is as predictable by a certain policy as possible. Thus he introduced different program tailoring algorithms for different memory management policies: the critical LRU restructuring algorithm for the LRU replacement policy, the critical work- ing set restructuring algorithm for the working set policy and so on. In the working set policy, for example, the block reference string and the knowledge of the window size T of the working set, W, (t,T), allow us to b identify the blocks which will be in memory at each reference of the string. The critical working set tailoring (restructuring) algorithm increments by 1 all the labels of the edges (in the restructuring graph) which connect a critically referenced block (a block which is not in W, (t,T)) to all the b nodes of the members of W, at the time the critical reference is issued. b Ferrari experimented by applying his algorithms to a collection of programs. Some of his experimental results are shown in Table 3. The cost of the restructuring algorithms in terms of computer time varies roughly linearly with the number of references in the string to be examined. The cost of the clustering algorithm (i.e., determining which nodes of the restructuring graph should be grouped in one page) increases less than quadratically with the number of nodes in the restructuring graph. One notices that the data collection which is needed for restructuring is expensive and difficult in today's systems. Restructuring was recently implemented on the SIRIS 8 10 co H 03 4J 3 OJ 6 •H M OJ £ 60 3 •H M 3 4-1 CJ 3 >-i 4J en a) PES e CO U 60 O l-i P-, 03 4-1 iH 3 CO ai PS AS -H U CO o M c (J o OJ CO N pH J-l -H 0) 4-1 3 CO CJ CO 3 a; -o SB « 05 U CO 4-1 pH 3 CO Ph (U cj 60 3 CO T3 PH OJ PS M O O -H a) o 60 C •H CM OO m E o s cO PS O Ph 3> PS I I I CN ^ (J\ H cm CM CN CM -^• m o vO 00 HHOH .-H iH iH rH iH 1 1 1 1 m oo on m i— I o r-~ oo 1 1 OO H 1 1 1 vjO CM O CM 1 1 cn CO iH rH rH O O CM 00 in cm LO vO CNI 00 CO o m m CM ^O rH rH O CM CM ON O iH CM CM CO rH CO rH H rH cm co rH CM v3- 00 1 CM 1 1 rH >H r-^ 1 CM CO rH CO 4J 4J 4-1 4-J 4J CU 01 0) 0J OJ 03 03 co 03 03 4J 4-> 4-1 4-1 01 OJ 0J OJ 60 60 60 60 60 on CO cn CO c c a 3 3 •H •H •H •rH •H 50 60 00 60 ^ Ai as AS AS 3 c 3 C U iH n u r4 •H •H •H •H O o o o O Ai Ai A M 3 3 3 3 3 u M r4 U o o o o TJ -3 T3 13 •3 3 3 3 3 OJ 0) OJ ai OJ rH rH rH r-\ rH 01 0) 0J 0) a §• a. a a >H M t-l »-i E S S 3 3 3 3 CO CO CO cO cO Ph Ph P-. Pn CO CO CO CO CO o 3j Ph PS HH rJ pH CO 4-J 4-1 4-1 4J 0J OJ 0) 0) CO 03 03 CO 60 60 60 60 c 3 3 3 •H •H •H •H a; AS AS A M U U M O o o o 3 3 3 3 0J CJJ OJ 0J U u s-i U 3 3 3 3 P-, PH Ph Ph U ,5 03 03 03 3 4-J 03 ••— \ 03 03 4-1 •H 0J CO OJ CO OJ CO CO CO CO CO CO 3j O CO CO CO CO (J M 3 Ds 3 5 3 s s 2 & ^ s PS Fn 5 g § § 3 O r< S r4 u *H cj u CO u CO CO hJ M CJ § ^1 s M 60 CO CO cO U u u CJ Ph 4-J 3 OJ OJ OJ O 0) 53 S5 2 OJ OS |H OJ •H )H o 4-1 •H T) OJ r4 OJ rH •H r4 60 o M a M OJ rH OJ rH •H - Her program T3 U a &• •H !- & 0J 0J E 0) E E 3 s- E 3 Vh rH o > 0) o O E o o o 3 E •H CJ •H 4J CJ •H o CJ M CJ -H 4-i cO a 4-1 03 4-J a o 4-J CJ >-i E 3 O >> 3 CO 3 4-1 3 cO 3 60 o CO CO CO CO o rH cO CO cO cj V. o CJ M H U •H CO U r-i M >H 4-1 u 4-1 OJ OJ 4-1 rH CJ U 3 4J rH CO Ph a >-l 4-1 rH M a 03 u E U Cu OJ o 3 •H O a tO o •H O (X ptj ..., d^ If P(t-l) = (P^t-1), P 2 (t-1), ..., P.(t-l), .... P (t-1)) and r = P.(t-l) then d = i. In other words r is at distance d in P(t-l). In the simple LRU model each distance is assigned a probability: P [d = i] = a. , 1 < i < n. r t i — — In order for the LRU model to exhibit the locality property, it should satisfy the condition: a, > a. > ... > a . 1 — 2 — — n This locality condition has been shown to be approximately true for real programs [DENN72b]. The distance probabilities can be determined from measurements on real programs. In the distance string d , d , ..., d, corresponding to a reference string of a program one can count the number of occurrences of a certain distance i, then a, = maximum likelihood estimate of a. = (number of i l occurrences of distance i)/k. 22 The problem with this method is its expense and that there is no obvious way of "perturbing" these measurements to model other strings. Empirically it was found that approximations to the a.'s can be derived from Belady's lifetime function [BELA69]: -k A. = a n + a + . . . + a. ~l-ci , l a„ > . . . > a l,t - 2,t - - n,t > for all t, but in general a. f a. ... -, • In a simplified analysis one would assume that there are two distributions of the distance probabilities. One represents the intraphase behavior and is biased toward the top of the stacks. The second corresponds to phase transition behavior and is biased toward larger stack distances. A two-state Markov chain can be used to choose between the distance distributions. This is shown in Figure 2. In state 1 the intraphase distribution is used. In state 2 the phase transition distribution is used. 1-p is the probability of making a phase change and p is the probability of staying in the same phase. p>>q because programs do not spend much virtual time in phase transitions. Although this two-distribution model exhibits the clustering of page faults and phase transition phenomenon, it does not allow for changes in a program's locality set size. This requires a distribution for each locality set size and possibly more than one distribution to model phase transitions. The multiple-distribution model is complicated, impractical, and attempts at validating it have been unsuccessful. Other Markovian models have been discussed in the literature [SPIR77] and [SHED72], There are several problems with many of 24 1 - p 1 - q Figure 2. Two-State Markov Chain 25 them. Mainly these problems are validation, complexity, and practicality problems. The stochastic approach to modeling program behavior seems to go in a vicious circle. If the proposed model is simple and practical, it is not accurate. On the other hand, if more accuracy is incorporated in a model, it becomes complex, impractical, and difficult to validate. We choose to end the discussion of stochastic models at this point and refer the reader interested in more details to [SPIR77]. 2.2.2 Previous Work - Deterministic Models As was mentioned previously, the locality property is the central property of reference strings which everybody is trying to model. In all the literature dealing with stochastic models of program behavior, people talk about the locality property in a vague manner. People argue that at any moment of time t, there exists a set of favored pages which the program tends to reference for a long period of time. This set is called the locality set and the time which the program spends referencing its member pages is called the residence time in the particular locality set [DENN72a]. Thus the program will go through a sequence of states S , S ? , ..., S., ..., S, during its execution. A sequence of 2-tuples (1^, T^ , (L 2 , T 2 ) , ..., (L ±f T ± ) , ..., (L k> T k > is associated with the sequence of states. In state S. the program references the L. locality set of pages for a duration of T.. People who worked in the development of the LRU stack model assumed that a program has n locality sets at any time, n being the depth of the stack. The Ith locality set consists of the I most recently used pages, 1 <_ £ <_ n. "The true, or favored, locality set will then be 26 the smallest set whose retention in memory leads to an acceptably low page fault rate" [SPIR76]. However, no method is provided to isolate one of the n localities as being the true locality set. The work of Batson and Madison [BATS76a] , [BATS76b], [BATS76c] is the only attempt found in literature to date to provide a formal definition of a locality set and a method to isolate locality sets in a reference string. To cure the deficiencies of the simple LRU model, Batson and Madison extended the LRU stack to include two new ordered vectors. Thus, at each moment of time t, three ordered vectors are kept to describe the state of the reference string: P(t) = (P (t), P 9 (t), ..., P,(t), ..., P (t)); 1 z 1 n a(t) = (a (t), a (t) , ..., a (t), ..., a (t)); 1 z l n T(t) = a\(t), T_(t), ..., T.(t), ..., T (t)). 1 / l n * P(t) is the LRU stack of segment identifiers as defined earlier. a. (t) is the time at which the segment in the i-th stack position was last referenced. T.(t) is the time at which a reference was l last made to a stack position greater than i. In other words, T.(t) is the time after which the i top positions of the stack were occupied by members of S.(t). S.(t) is the set of the i most recently referenced segments. At each time t, there is a hierarchy of sets S(t) = (S.. (t) , S-(t), ..., S.(t), ..., S (t)). In this hierarchy S . (t)a S .. . (t) . z l n l it± T.(t) can be described as the formation time of S.(t). Figure 3 shows a reference string and its P(30), a(30) , and T(30) [BATS76a]. An activity set at time t is any set of segments in the LRU * Batson and Madison studied only segmented virtual memory systems. We will discuss the implications of this limitation later. 27 111111111122222222223 time . . . .123456789012345678901234567890 the string . ggggeafbcdabcdddcdddabcdddcddd level 1 BLl's level 2 BLI's level 3 BLI's h {g> {a, b, c, d} U, d} H (c, d} "-UT" ^TdT {d} {d} the LRU stack g g g g e a f b c d a b c d d d c d d d a b c d d d c d d d g e a f b c d a b c c c d c c c d a b c c c d c c c g e a f b c d a b b b b b b b c d a b b b b b b b g e a f b c d a a a a a a a b c d a a a a a a a g e a f f f f f f f E f f f f f f f f f f f f g e e e e e e e e e e c e e e e e e e e e e g g g g g g g g g g g g g g g g g g g g g Figure 3-a. A Reference String, Its LRU Stack, and BLI's [BATS76a] 28 P(30) S(30) a(30) T(30) S 1 (30) = {d} S 2 (30) = (d,c } S 3 (30) - (d,c ,b} S 4 (30) - (d,c ,b,a} S 5 (30) - (d,c ,b,a,f} S 6 (30) = (d,c ,b,a,f ,e} S ? (30) = (d,c ,b,a,f,e,g} 30 27 22 21 28 24 24 11 10 10 10 Figure 3-b. The P, S, a, and T Vectors at t = 30 for the String in Figure 3-a [BATS76a] 29 hierarchy in which every member of that set has been re-referenced since the set was formed. In terms of the a(t) and T(t) stacks, an activity set at time t, A. (t) , is any S.(t) for which a.(t) > T^t). At each instant during program execution, zero or more activity sets are recognized at various levels of the LRU hierarchy. Moreover, when a reference is made to any segment which is below a particular segment in the LRU stack, then this activity set (and any set above it) is terminated. A bounded locality interval (BLI) is defined as the 2-tuple consisting of an activity set and its lifetime or residence at the top of the stack. In Figure 3, the BLI's of the example reference string are shown [BATS76a]. Notice the hierarchical structure of the BLI's. In [BATS76a] algorithms are given to update the P(t), o(t), and T(t) stacks. Also experimental results concerning the characteristics of the BLI's are presented in [BATS76a] and [BATS76c]. In Chapter 4 of this thesis, we will discuss Batson's experimental results and the validity of their implications. We have implemented Batson's algorithms and applied them to our collec- tion of Fortran programs. We have correlated the syntactic structure of programs and found several problems with the concept of bounded locality intervals. Some of these are: 1. As is mentioned in [BATS76a], the way BLI's are defined lead to identifying a tremendous number of very short BLI's. These BLI's have no indication of locality or any significance. They only add undue expense to generating the experimental data. Figure 4 shows a real example taken from one of our programs. Only references 30 15 DO 15 KK = l.KMAX FD(KK) = FD(KK) + 273 FE(KK) = FE(KK) + l.E-3 TTA(KK) = TTA(KK) + 273 QWl(KK) = QWl(KK) + l.E-3 Figure 4-a. An Example Loop level 1 BLI ((QW1,TTA,FE,FD);258) \ level 2 BLIS h z-r 1 ^ < h H H O Pm o Pm 3 > < H H Pm o Pm O Pm 00 > o- < H H Ph o Pm I w QJ T3 QJ •U CO U 0> C QJ O ON CN h4 « QJ £ Q [14 W Pm u-i QJ 3 60 •H Pm O Pm CD QJ QJ QJ > > > > QJ QJ QJ QJ 34 The problem which is illustrated by the example in Figure 5 could be cured if the definition of an activity set was modified. If an activity set was defined as any set of segments of the LRU hierarchy in which every member of that set has been re-referenced k-times since that set was formed, k > 1, then we will have only one BLI covering the execution of loop 2 in the example of Figure 5. This modification will also reduce the number of very short BLI's. Although Batson in [BATS76b] mentions that Peter Denning did suggest this modification in the definition of an activity set to him, he did not modify the definition. The modification would increase the complexity and the expense of finding the BLI's in real traces of programs. Moreover, it is not obvious how one should choose k. The more important fact is that this suggestion does not really solve the problem of the confusion in interpreting the hierarchical structure of the BLI's. This is illustrated in the next example. 3. BLI's have an inconsistent correlation to the syntactic structure of programs. For example, the existence of a hierarchy of BLI's is a necessary but not sufficient condition for the existence of a nested loop in the source program. A nested loop will generate a multilevel hierarchical BLI structure. The existence of a multilevel BLI structure, however, can be due to other reasons. In Figure 6-a, the loop is double nested. This loop generates a two-level BLI structure. In the first loop of Figure 6-b , the arrays A, B, C, D, and E are referenced. A subset of this array set, namely, A, B, C, and D, are referenced in the DO 1.1= 1,100 A(I) = B(I) * C(I) DO 1 .J = 1,100 D(I,J) = D(I,J) * A(I) 1 Continue ((A,B,C,D);30300) level 1 t- \ level 2 35 ((A,D);300) ((A,D);300) ((A,D);300) Figure 6-a. A Doubly Nested Loop and Its BLI ' s 36 DO 10 I = 1,100 A(I) = B(I) * C(I) D(I) = B(I) * E(I) C(I) = E(I) ** 2 10 Continue DO 20 I = 1,100 B(I) = A(I) - C(I) * D(I) 20 Continue DO 30 I = 1,100 E(I) = 30 Continue ((A,B,C,D,E);1300) i 1 level 1 ((A,B,C,D);400) (E;100) , H -h level 2 Figure 6-b. Consecutive Loops and their BLI's 37 second loop. The array E is referenced in the third consecutive loop. For this situation, we also have a hierarchical BLI structure. This structure is misleading. The first loop is not reflected in any BLI. The level one BLI hints at the existence of a locality of size 5 during the execution of the three loops. This is of course not true. In this situation, we really have three localities. The first one is of size 5, its members are A, B, C, D, E, and it covers only the first loop. This is followed by a locality of size 4, its members are A, B, C, D, and it covers the second loop. The last locality is of size 1, it contains the E array, and it covers the third loop. Denning' s suggestion will not change the problem with the BLI's in this example. 4. There is no simple, obvious way of isolating the major phases of execution of a program from its BLI's. In other words, it is not obvious how to get the sequence of 2-tuples (L 1 ,T 1 ), (L_,T„), ..., (L.,T.), ... for a program from its BLI's. In [BATS76a], level one BLI's of 10 milliseconds or greater duration are taken to be the major phases of execution. Our examples in Figures 5 and 6-b illustrate situations where level one BLI's give erroneous information. In Figure 6-a, the program spends most of its time referencing arrays of level 2 BLI's. To avoid these problems, a procedure is suggested in [BATS76b] to determine a pathway through the BLI hierarchy such that the space- time cost of executing the given program is minimized. The BLI's 38 which are included in this pathway are taken to define the major phases of execution. The procedure suggested in [BATS76b] does not really minimize the space-time cost. The correct algorithms for minimizing the space-time cost of running a program were developed by Budzinski in [BUDZ77]. These algorithms are complex and expensive. Moreover, the localities of a program are supposed to be machine independent while in the approaches of [BATS76b] and [BUDZ77] the minimum space-time product is dependent on machine parameters such as the mean time needed to transfer a segment (or a page) from secondary to primary storage. From the previous discussion, it is clear that the locality sets isolation problem has not really been solved. In the next section we present our different approach and solution to the problems presented in Sections 2.2.1 and 2.2.2. 2.2.3 Our Approach - Analysis of Program Behavior at the Symbolic Level We think that there are two main reasons for the difficulties which people faced when trying to come up with satisfactory models of program behavior. The first reason is due to the approach taken in attacking this problem. Traditionally, people took reference strings generated by programs to be the observed phenomenon of interest. Thus, for them a program serves only the purpose of generating a reference string and then it can be ignored. However, the center of concern should really be the program itself and not the reference string. There is almost nothing important in a reference string which 39 is not reflected in the source program. Thus, our approach will be to study and analyze programs at the source level. Although the complexity of programs was probably the main reason why people avoided studying programs at the source level, one can overcome this dif- ficulty by recognizing that scientific programs have few basic struc- tures. One can start by studying the most simple structure and then move to more elaborate ones. As it turns out, a clear understanding of simple structures can be extended rather easily to more complex ones. For more discussion about program analysis at the source level see [BATS 76c] . The second reason for the difficulties of modeling program be- havior is due to the programs themselves. As we will demonstrate later in this chapter, programs as written by people do not behave well in a paging environment. We will adopt the following strategy in our study. First, we will develop a model for an ideal program. In developing this model we will discuss the important characteristics of such an ideal program. Next we show that it is possible to find some programs in the real world which follow this model. However, we will give examples of other pro- grams which, as written by people, do not follow this model. In Chapter 3 we develop automatic transformation algorithms which can be used to force most programs to follow this model. Moreover, these transformations reduce the cost of execution of programs in virtual memory computers. Thus the transformations make programs behave better (they will be easier to model and manage) and cost less to execute. 40 In our study we will separate data and code pages. For pro- grams with large data aggregates, code paging is trivial compared to data paging. We are mainly concerned with the data paging problem. Moreover, we will ignore references to scalars. These same assumptions were made in [BATS76a]-[BATS76c] . Most scientific production programs are written in Fortran. Moreover, there is a good reason to believe that versions of Fortran will continue to evolve and exist for a long time to come. Hence, without a loss of generality, we will use examples of Fortran-like programs and structures. All through this thesis we assume that paging will be made on demand. In other words we assume that there is no overlap between the CPU and I/O activities of the same program. In Section 2.2.3.1 we develop our model of the program with the ideal behavior. In the same section we define elementary loops and show that such loops follow the model of the ideal program. Hence, we will call our model the elementary loop model (ELM). In Section 2.2.3.2 we will give examples of programs which do not follow the ELM model. In the same section we will mention those transformations of Chapter 3 which will cure specific problems with different programs. 2.2.3.1 The Elementary Loop Model What is the ideal behavior of a program in a paged system? Ideally a program will need only a small fraction of its virtual space to be present in main memory. With this little memory allotment, the mean time between page faults, MTBPF , will be large. Moreover, the program will make effective use of the main memory page frames allotted 41 to it. Thus, the density of reference to each page will be high. In other words the mean time between reference to each page, MTBR , will be small. Moreover, the page faulting activity will be clustered. This leads to rather long periods of useful CPU activity which are interrupt- free. This has an important effect in multiprogrammed systems. If programs have cyclic behavior in which they go through alternating periods of clustered I/O and CPU activities then the scheduling and other problems become much easier. The OS CPU time will be decreased. The description given in the previous paragraph is that of a program which can be modeled by the ideal program model. Let us now define one kind of loops which follow this model. Definition 1 ; An elementary loop is an ordered set of assignment state- ments preceded by one DO control statement. The variables referenced in the loop are one-dimensional arrays and possibly scalars. The sub- scripts of the array variables are linear functions of the index vari- able. In the subscript expressions, all the index variables have the same coefficient. As an example of the behavior of an elementary loop let us discuss the behavior of the following program. Program 1. DO S 1=1, N S A(I) = 2*1+3 5 2 C(I) = B(I)**2-4*C(I) 5 3 D(I) = C(I)/A(I) Let Z be the number of words in a page, N>>Z, and K = [N/Zl. There are 42 four arrays referenced in Program 1: A, B, C, and D. Each array occupies K pages of virtual space. Let us denote the ith page of A by a(i). Thus A will span the virtual pages a(l), a(2), ..., a(i), . .., a(K) . Similar notation will be used for the pages of the other arrays. The total virtual space of these arrays is 4*K pages. In a non-virtual memory computer this program will need 4*K pages of main memory to run. If this amount of main memory is not available, the programmer must take care of transferring parts of his arrays between secondary and main memory such that the program will run in less than the total virtual space. In a virtual memory computer, however, the operating system will automatically take care of this problem. The operating system need only assign 4 pages to this program and the program will run in an optimum way under demand paging. It will have the minimum number of page faults, or I/O transfers between secondary and main memory. Moreover, its space-time cost will be minimum. With 4 pages of main memory, the program will have 4 page faults when it starts execution in order to allocate a(l), b(l), c(l), and d(l). After this burst of I/O activity the loop will go through Z iterations without any I/O interrupts. The I/O interrupt-free CPU activity will last for 7*Z memory references. 7 is the number of array memory references per iteration of the loop. During the CPU activity period the MTBR to the pages of the program will be < 7 references. Thus the density of reference to these pages is high, Another burst or cluster of I/O activity will follow to allocate a(2), b(2), c(2), d(2) in main memory. In the next burst of CPU activity the loop index will go from Z + 1 to 2*Z and the duration of this CPU burst will be another 7*Z references. This oscillation or cycling between 43 bursts of I/O and CPU activity will continue through the lifetime of this program. In the Ith cycle the pages a(I), b(I), c(I), and d(I) will be allocated and then processed. The I/O burst time will be 4*T, T being the average time of servicing a page fault (measured in memory references) and the duration of the CPU burst will be 7*Z references. The cycle time, T , will be 4*T + 7*Z references. Thus the mean time c between the clusters of page faults is large, 4*T+7*Z. This behavior will be the same for the LRU, FIFO, or MIN replacement algorithms. The total number of page faults will be 4*K and the total space-time cost will be 4*K(4*T + 7*Z) . This is a well behaved program. In a multi- programming system, programs of this type will make the best use of the system. I/O and CPU bursts of different programs can be overlapped such that the I/O and CPU utilization will be maximized. The memory space will be saturated with different parts of different programs to maximize throughput. Such programs will run efficiently in virtual memory computers. To have such a nice performance, Program 1 needs 4 pages of main memory. If 3 or less pages are assigned to it we will have one or more page faults per iteration. The number of page faults will be very large, 0(N), instead of 0(K), where N is the number of words in an array while K is the number of pages spanned by the array. In addition to the large increase in I/O activity, the program will lose the nice property of clustered page faults or bursts of I/O activity. The use- ful CPU activity will be constantly interrupted by page faults. The performance of the virtual memory system will collapse under such conditions . 44 For every elementary loop , there is a critical memory allotment which is needed in order to avoid performance collapse. In the case of Program 1 this number is 4. In general we will denote this number by m . o The behavior of Program 1 and similar programs can be nicely modeled by the sequence of the 2-tuples: (L v T 1 ), (L 2 , T 2 ), ..., (L ± , T ± ), ..., (L k , T fc ) L. = the ith locality set of pages T. = the residence time in this locality set of pages. For Program 1 L. = (a(i) , b(i), c(i), d(i)}, and T. = 4*T + 7*Z. The size of L., |l. , is equal to m which is constant at 4 for 1 ' l ■ o all i. Moreover, T. is the same for all i and is equal to the eye le time , T , as discussed previously. Note that the phases of execution of this program have been easily identified. Since the behavior of an elementary loop follows precisely the model of the ideal program, we will denote the ideal program model by the elementary loop model (ELM) . Note that the ELM and an elementary loop are two different things. An elementary loop was defined in Definition 1. The ELM is the model of the ideal program. An elementary loop is an ideal program and it can be modeled using the ELM. Other loops, however, can also be modeled by the ELM. The following are the necessary conditions which must hold for a given loop so that it can be modeled by the ELM: The Critical Memory Allotment, m = (# of different array names in the loop); 2.1 45 The Cycle Time, T = (R *c + m *T), where 2.2 C X, o Rp = # of occurrances of array names in the loop, c = integer constant (# of iterations per cycle) ; Mean Virtual Time Between Clusters of Page Faults, MTBPF = 0(R *c); 2.3 Mean Virtual Time Between References to a page, MTBR = 0(R £ ). 2.4 Equations 2.1-2.4 are the definition of the ELM. Before proceeding any further, let us generalize an observation which we made concerning the execution of Program 1 to all elementary loops . Theorem 1 : Given an elementary loop L, let m = the number of different array names o referenced in the loop . R p = the number of array references per iteration of the loop. T = the average page fault service time. K = the number of pages spanned by each array referenced in the loop. With m page frames, the cost of executing the loop will be the same whether the replacement algorithm used is the LRU, FIFO, or Belady's MIN algorithm. The cycle time is given by: T = R * c + m * T, where c = Z/(the coefficient of the ex- o index variable in the subscript expressions of the array variables). The space-time cost is given by: 46 ST = T * m * K c o Proof: When the execution of an elementary loop is started, m o different pages will be referenced. If m page frames are allotted to the loop, all the three replacement algorithms will allocate these page frames to the first locality set of pages. In other words, the pages referenced in the first cycle of execution will be allocated space in main memory. From our previous discussion in this section, the loop will have a cyclic behavior. We will use induction to prove our theorem. First, we show that the three replacement algorithms replace the set of pages referenced in the first cycle by those referenced in the second cycle. Second, given that the pages referenced in the (I-l)th cycle will be in memory when the Ith cycle is started, we will show that the three algorithms will replace these pages by the pages referenced in the Ith cycle. When the first references to the pages of the second cycle are made, the MIN algorithm will replace pages referenced in the first cycle. This is because the forward distance of all these pages is infinite. Similarly, the LRU and FIFO algorithms will replace pages of the first cycle, though not necessarily in the same order. Thus, all these algorithms will produce m page faults and the second cycle time duration will be R„ * c + m * T. Note that c is the number of loop iterations x, o per cycle. If we assume that the pages of the (I-l)th cycle will be in memory when the execution of the Ith cycle starts then by a similar 47 argument to the one presented in the previous paragraph, we conclude that the three algorithms will replace the pages of the (I-l)th cycle by those of the Ith cycle. Hence, in general the cycle time is given by Rj * c + id * T, and the total space- time cost is given by (R p * c + m * T)*m *K. o o Q.E.D. The important point which Theorem 1 makes is that the per- formance of elementary loops will not be affected by the replacement algorithm used. It is totally determined by the amount of memory allotted. Note that Theorem 1 does not hold for the least frequently used replace- ment algorithm. Although elementary loops are not a non-existing species in real programs, very often more complex loops will be encountered. Some of these can still be modeled by the ELM. Others, however, cannot. In the next section we will discuss some examples. In chapter 3 we will present two types of compile time optimizing transformations. The first type will be used to force any loop to behave such that it can be modeled using the ELM. The second type will be used to improve the cost of execution of loops; namely, to reduce the value of m , number of I/O o transfers, and the space-time cost. 2.2.3.2 Other Loops In this section we show examples of loops which are not ele- mentary. Our examples fall in three categories. In the first category, the loops are not elementary but their behavior follows the ELM. Moreover, 48 Theorem 1 holds for these loops. In the second category, the behavior of the loops follow the ELM model but Theorem 1 does not hold. The behavior of such loops is asymptotic to the behavior of elementary loops and they do not really have serious problems. In the third category the loops do not follow the ELM and their problems are serious. In Chapter 3 effort will be made to design transformations to cure the problems of such loops. A loop can be in one of these categories for different reasons. In what follows we give examples of these different reasons, (i) Multi-dimensional arrays in loops. The existence of large multi-dimensional arrays in a loop can easily cause problems in a virtual memory system. Let us first give an example in which multidimensional arrays cause no problem and the behavior of the program can still be modeled by the ELM. Consider the following loop : Program 2. DO DO S 1 I - 1,N S J - 1,N S 1 A(J,I) = B(J,I) + C(J,I) In Fortran two-dimensional arrays are stored column-wise. In all our examples and analysis, we will consider large arrays which 2 satisfy the condition N <_ Z < N . If each of the arrays of Program 2 spans K pages, then a close examination of the program will show that it can indeed be modeled by the ELM. For Program 2 the MTBR is 0(R.) and m = # different array names = 3 o MTBPF =T =Z*R +T*m c 36 o = Z*3 + T*3 ST = space-time cost/cycle =3* (3*Z+T*3) 49 The following program, however, cannot be modeled by the ELM: Program 3. DO S I = 1,N DO S J = 1,N S 1 A(I,J) = B(I,J) * C(I,J) To make the analysis simple let N = Z. We will make this assumption through the rest of this chapter. Each column of a matrix will span one page. The different number of array names here is still 3. With three page frames, however, three page faults will be generated per iteration of the inner-most loop. There is no clustering of page faults, i.e. CPU and I/O activities will be interleaved. Consequently, the system will suffer performance collapse. This loop needs all its virtual space to be allotted in main memory in order to generate the minimum number of page faults and to minimize its space-time cost. The reason behind the difficulty with Program 3 is that the array elements are not being referenced in the order in which they were stored. If Program 3 was written in PL1, in which multi-dimensional arrays are stored row-wise, the problem would disappear. Tn PL1, however, Program 2 will have a problem. Thus it is obvious that for multi- dimensional arrays, the storage scheme and the pattern of reference are important in determining the behavior of a loop. This is what all of Elshoff's paper was about [ELSH74]; matching the pattern of reference to the storage scheme. In [McKE69] three storage schemes of multi- dimensional arrays were compared: row-wise, column-wise, and submatrix storage. If RZ = /Z, then in the submatrix storage scheme an (nxn) two- dimensional matrix will be divided into square submatrices of size 50 o (RZ x RZ) as shown in Figure 7. If N = [n/RZl then there will be N of these submatrices. Each submatrix is stored in a page. An m- dimensional array with the dimensions D. x D~ x D,, x ... x D will be 12 3m stored in D. x D. x . . . x D planes. Each plane will contain D., x D„ 3 4 m r 12 array elements. There will be TD /RZl rows of pages and TD /RZl columns of pages in each plane. Hence each plane will have TD /RZl*[D /RZl pages The element of the array with the subscripts d,, d_, . .., d will 12 m belong to the { (fa /RZl-l)*rD 2 /RZl + Td^RZl + (d 3 -l)* p /RZl*rD 2 /RZl + (d.-l)*[D /RZ1*[D /RZ1*D. + ... + (d -l)*[D n /RZl*fD /RZl * D * D. * 4123 ml2 34 ... *D m _ 1 } page. In [McKE69] it is shown that matrix algorithms can be designed such that with the submatrix storage scheme, enormous reduction in the number of page faults relative to row-wise storage can be achieved. With 3 page frames and the submatrix storage scheme, Program 3 will have 3 page faults every RZ iterations of the inner-most loop. The duration of the interrupt-free CPU activity will be 3*RZ . This is not as good as the performance in Program 2 where the CPU burst time was 3*Z references long. Moreover, we still cannot use the ELM to model the behavior of program 3 even if the submatrix storage scheme is used to store the arrays. The problem here is that we will not reference all the elements involved in the calculation of each page while the page is in main memory. In Program 3 all the Z elements of a page will be referenced in the calculation while only RZ elements will be referenced every time the page is in main memory. Thus a given page will be transferred RZ times between secondary and main memory. In effect what 51 «■ RZ ■+ A RZ Page-1 Page-2 . . . Page-N Page- N+l Page^ N Figure 7. A Two-Dimensional Array Stored by the Submatrix Scheme 52 we are saying is that although the MTBPF for Program 3 is better with submatrix storage as compared with column storage (3*RZ compared to 3) it is still not as good as it is for Program 2 (3*Z). In Chapter 3 the page indexing transformation will be intro- duced to cure the problems of multi-dimensional arrays. This is de- signed to transform a program such that all words of a page involved in a calculation will be referenced while the page is in main memory. We will adopt the submatrix storage scheme because of its inherent ad- vantages as presented in [McKE69] . (ii) Mixing of arrays of different dimensions in a loop. The performance of a loop can be affected in different ways when arrays of different dimensions are referenced. Consider the fol- lowing example: Program 4-a. DO 3 J = 1,N DO 3 I = 1,N T(I,J) = .5 * DELT + TTA(I) 3 continue Since the elements of the two-dimensional array are referenced in the order in which they are stored, column-wise, the two-dimensional array represents no problem. This loop can be modeled by the ELM because equations 2.1 - 2.4 are satisfied. Namely, we have m = 0(# different array names) = 2 o MTBR = 0(R £ ) = 2 MTBPF = 0(R *Z) = 2*Z T = 0(R *Z + m *T) = 0(2*Z + 2*T) C X, o 53 Because of the existence of the one-dimensional array, T c is not fixed through the execution of the program. In the first cycle two page faults will occur because t(l) and tta(l) must be allocated. Thus T = 2*Z + 2*T. In the following cycles, however, only the t page will be replaced. Thus the steady state cycle time , T , is given by 2*Z + T. Theorem 1 does not hold for this loop be- cause the cycle time is not constant althrough the lifetime of the program. In other situations, Theorem 1, will not hold for different reasons. For example, the following loop will not have identical per- formance under LRU and MIN replacement algorithms. Program 4-b. DO 3 J = 1,N DO 3 I = 1,N T(I,J) = T(I,J) + .5*TTA(J) 3 continue The reference string generated during the two iterations: (J = j-1, I = N) and (J = j , I = 1) is the following: ...,t(j-l), tta(l), t(j-l), t(j), tta(l), t(j),... With 2 page frames under LRU, tta(l) will be replaced at the 4th reference to allocate t ( j ) . MIN will replace t(j-l). Thus under LRU, T = 3*Z + 2*T while under MIN T = 3*Z + T . Note, however, that cs cs this loop can still be modeled by the ELM because equations 2.1 to 2.4 are satisfied. In the previous two examples, mixing arrays of different dimensions in a loop did not present severe problems. Both loops could 54 be modeled by the ELM although Theorem 1 does not hold for them. Their behavior is asymptotic to the behavior of elementary loops, (iii) Loops with assignment statements at different nest levels. Consider the following program: Program 5. DO 3 J = 1,N PT(J) = TTA(J) DO 3 I = 1,N T(I,J) = .5* DELT + TTA(J) 3 Continue With 3 page frames, this loop will have T = (2*Z + 2) + T which is 0(R *Z + m *T) . There is, however, an obvious waste in the x, o space-time resource. The PT page is referenced only once during a cycle time. In other words, the N references made to PT are uniformly dis- tributed through the execution time of the loop. This is reflected by the MTBR to the PT page which is 0(N) instead of 0(R„). Hence this loop cannot be modeled by the ELM. The loop distribution transforma- tion presented in Chapter 3 will cure this problem. As another example of a loop with large MTBR, consider the fol- lowing program: Program 6. DO 10 I = 1,N Al = W(I,1)* X(I,1) DO 10 J = 1,N A2 = WW * G(J,I) Y(J,I) = Y(J,I) + (A1-A2)/DZ Al = A2 10 Continue 55 Here, with 4 page frames, T will be (3*Z + 2 + 2*T) . The MTBR for the W and X pages is 0(N) and not 0(R.). Hence the ELM will not hold. A combination of the scalar expansion technique and loop distribution will handle the situation of this loop. This will also be discussed in Chapter 3. (iv) IF statements in loops. IF statements in loops will control the order of execution of assignment statements. Moreover they control which statements are to be executed during every iteration of the loop. Thus the memory require- ment might in general vary between two cycles or even within one cycle. Moreover, the cycle time might vary from one cycle to another. Thus static measurements might not reflect an accurate estimate of the parameters of the ELM for a loop that contains an IF statement. IF statements can be classified in several types [TOWL76] . One type of IFs called the A-type can be easily removed from the scope of the loop. The condition tested by a type-A IF is independent of the loop index and all variables computed within the loop. The result of the test will be the same for all iterations of the loop. This type of IF is illustrated in the following loop: Program 7-a. DO 10 I = 1,N IF (S.EQ.O) GO TO 3 A(I) = 4*B(I)*C(I) - D(I)**2 GO TO 10 3 A(I) - 10 Continue 56 The IF here is a static switch which can be removed as follows: Program 7-b. IF (S.EQ.O) GO TO 3 DO 101 I = 1,N 101 A(I) = 4*B(I)*C(I) - D(I)**2 GO TO 103 3 DO 102 I = 1,N 102 A(I) = 103 Continue One of the resulting two loops will be executed depending on the value of S. Each of the loops can be modeled by the ELM. It is important to note that we are not using the ELM to predict which parts of the program will be executed and which will not. What we are trying to do is to transform programs such that whatever loops are executed will be loops which can be modeled using the ELM. In the other types of IFs, the condition tested will be a function of the index of the loop or some variables computed in the loop. These types of IFs cannot be removed outside the scope of the loop in the simple manner illustrated in Program 7. In many situations, however, the IFs do not affect all the statements within the loop. This is illustrated in the following two examples. Program 8-a. DO S 4 I = 1,N S B(I) = G(I) - 7*DELT 5 2 IF(B(I) .GT.0) C(I) = C(I)/B(I)*D(I) 5 3 C(I) = C(I) + 5 5 4 E(I) = D(I) * E(I) 57 Program 8-b. TEMP = DO S I = 1,N S A(I) = B(I) * C(I) + 3 S IF (TEMP. GT . A(I)) F(I) = 1 S TEMP = TEMP + X(I) * Y(I) In Program 8-a only S ? is affected by the IF and in Program 8-b SI is not affected by the IF. The loop distribution transformation will transform loops such that either the resulting loops are free of IF statements or all the assignment statements within the loop are affected by the IF statements such that they must be left in the same loop. In real programs, the number of statements and arrays in the latter type of loops is small and hence the variations in the parameters of the ELM for these loops are small. 2 .3 Summary In this chapter previous stochastic and deterministic models of program behavior were discussed. The difficulty of developing a simple accurate model of program behavior is due to the fact that programs as written by people are not well behaved from a paging system point of view. The concept of the elementary loop model, ELM, was developed and the parameters of this model were discussed. Examples of programs which do not follow this model were presented. In the next chapter compiler transformations will be designed to cure such problems as those illustrated by the examples. Other transformations will aim at 58 improving the ELM parameters of a given program. Thus, after applying the transformations of Chapter 3 to programs, they will be simple to model and cheap to run in a virtual memory computer. 59 3. PROGRAM TRANSFORMATIONS A large portion of the early work in program analysis and trans- formations was motivated by the development of high speed parallel and vector machines like the ILLIAC IV, CDC STAR, and TI ASC around the turn of the decade. For these supercomputers, and the more recent ones, the Cray-1 and Burroughs Scientific Processor, the need for a vectorizing compiler is definite. The enormous computational power of these machines cannot be widely utilized by the general scientific community of users unless people can use ordinary high level languages to write programs for these machines. Moreover, there is an obvious need to be able to run the large amount of existing software, which was originally written for serial machines, on the new machines. For the last few years research has been conducted at the University of Illinois to solve these problems. The problem of trans- forming ordinary serial programs to run on parallel and vector machines has been investigated and the results have been very good. A large soft- ware package called the PARAFRASE compiler evolved with the progress of these investigations. The PARAFRASE compiler takes an ordinary serial Fortran program and uses different compiler transformations to expose the inherent parallelism of the program [LEAS76] , [WOLF78] . Pseudo- code is generated and used to find the resulting speedup if the program were executed on parallel machines compared to serial machines. 60 The theme of this thesis is the enhancement of the performance of virtual memory computers. In this chapter we present program trans- formations to achieve this goal. These are intended to be optimizing compiler transformations which are tailored to cure the problems of large programs in virtual memory computers. Each transformation will serve one or both of two purposes. The first aim is to make programs follow the ELM and the second is to improve the parameters of this model for a given program. A transformation aimed at the first goal is a fix-up transformation. A transformation aimed at the second goal is an enhancement transformation. Several of the concepts and transformations developed for speeding the execution of programs on parallel machines will be useful to us either with or without modifications. Thus we will use some of the trans- formations implemented in the PARAFRASE compiler, modify some, and in- troduce some new ones. We will think of the transformations and present them as source-to-source transformations. Our description of the trans- formations which were developed originally for parallel program execution will be very brief. We will present the modified and the new trans- formations in more details. The flow chart shown in Figure 8 gives an overview of the general transformation process. This flow chart is intended to help the reader of this chapter understand the relationship between the different trans- formations and their relative order. Going back to examining this flow chart while reading this chapter will clarify the purpose and the logic behind the different transformations. 61 Input Fortran Program i Preliminary Transformations Clustering (Generation of Name Partitions) Scalar Transformations Data Dependence Analysis Fusion of Name Partitions I Identifying the TT-Blocks of Each Name Partition Transforming Any Nonbasic TT-Blocks to Basic TT-Blocks Figure 8. An Outline of the Transformation Process o o z QJ 4-1 3 JD •H P-, 4-1 0} 0) c 03 3 ^ <-\ O u co -h O 4J 4-1 .H C 3 ffl O J3 I N 4-J •H Q 4-1 U 4-1 -— ' Q O •rl X O ,3 33 62 en 0) >H X QJ 3 ■U OJ O 3 6 -H M CO 4-1 2 -H QJ 4-J 60 QJ >-i CO X CO P* 4-» PL, C OJ O 4-1 •H 3 4J ,£3 •H M 4-1 rt III C P3 u jj QJ C M Pn «H O n a 0) QJ to e xi CO (0 c P-i z co O O 3 O •H 4-1 3 t= 4-1 >-( V-l 4J W QJ QJ W 4J J= > -H H 4J *— ' Q •a QJ 3 •5 4-1 3 O a 00 QJ u 3 60 •H fa 63 In the preliminary transformations stage we apply (without modification) the following set of transformations which are currently implemented in PARAFRASE [WOLF78] : (i) DO Loop Normalization, (ii) IF Pattern Matching. (iii) Scalar Renaming, (iv) Induction Variable Substitution and Subscript Cleaning, (v) Type-A IF Removal from DO Loops . These transformations are aimed at breaking data dependences, and simplifying the control structure of the program. We will not discuss these transformations any more and refer the interested reader to [WOLF78], Basic to the analysis of programs and development of transforma- tions is the concept of data dependence. A brief discussion of this con- cept and related definitions will be presented in Section 3.1 and is based on [KUCK78] , [TOWL76] , and [BANE76]. In Sections 3.2 through 3.5 we discuss the rest of the transforma- tions. In genera] we will present in each section some necessary defini- tions, some examples to illustrate the usefulness of the particular trans- formation, the transformation algorithm, and if needed some tests to check for the correctness of the transformation. We will try to strike a balance between formal and informal definitions of the transformations. A very formal definition leads to complex notations which explain un- important details. Although we will present the transformations as separate entities, the intent is that all those relevant to a program segment will be applied. 64 3. 1 Data Dependence Analysis The set of input variables of an assignment statement S, IN(S) , is the collection of variables appearing to the right of the assignment symbol. The output variable of S, OUT(S) , is the variable which is assigned a value as a result of executing statement S. The output variable appears to the left of the assignment symbol. When S is executed each member of IN(S) is fetched from memory at least once and the output variable is stored in memory. Outside loops, an assignment statement S is said to be data depend- ent on another asignment statement S if IN(S ) \\ OUTCS^) = x ^ and the p q P value computed in S for x is used in IN(S ). We denote this by S =* > S . P q P q If we have x = OUT(S ) , x e IN(S ) , and the value of x computed in S is not used in S then S is data antidependent on S . Antidependence xs denoted by S -/"""S . S is said to be data output dependent on S , S =$=> S , if x = OUT(S ) = OUT(S ) and the value calculated in S is stored in x after the value which is calculated in S . If x = OUT(S ), is a scalar variable then testing for dependence between S and statement S is simple and only involves name searching for x in IN(S ) and OUT(S ) and finding the order of execution of S relative to S . If x is an array element then the value of its subscripts in S and S should be identical in order for a dependence to exist. The definition of dependence relations can be extended to cover statements in loops. Let us use S [i , i„, ..., i, ] to denote the instance of statement S p during the particular iteration when I, = i, , I 2 = i 2 ' ..., I, ■ i,. I.. , I , .... I, are the index variables of the loop. Let d d 1 I a 65 x = OUT(S (k.. , k , ..., k,)). If we use the notation S $ S to denote p 1 2 d pq that S is executed before S then we have [KUCK78]: P q 1) S =>S if x eIN(S a ,1 , . .., I )) and S (k. , k_ , ..., k ) * pq q 1 z d plz a S q (il l' % 2* "' Z d ) 2) S =^>S if x E IN(S a., l , . .., I,)) and S (I I , ..., I) % qp qlz d q!2 a s p (k 1 , k 2 , ..., k d ) 3) S =#>S if x = OUT(S (£., £_, ..., £,)) and S (L, k„, ..., k.) $ pq qlz d plz a s , a r l 2 V Testing for dependence between statements within loops can be done by unrolling the loop and listing each statement for each iteration of the loop. Each statement can be checked with following statements for data dependence as described earlier. This testing procedure is lengthy and expensive. Tests for data dependence can be performed without actual unrolling of the loop. For array variables this involves testing the sub- script expressions for the set of values which the index variable can take. In [BANE76] sufficient and necessary conditions for dependence are derived for index expressions that are linear functions of one index variable. For the rare case when the subscript expression is more complex or the sub- scripts are array elements, data dependence is usually assumed. To simplify the testing procedures in [BANE76] it is assumed that the subscript expressions are functions only of the index variables. More- over, the increment of an index variable between one iteration and the next is assumed to be 1. In [WOLF78] several transformations are described to ensure that index variables and subscript expressions satisfy these con- ditions. 66 In the previous discussion it was implicitly assumed that the loops are IF-free. In [TOWL76] procedures for removing IF statements from the scope of loops are described. Some types of IFs cannot be removed and in such situations it is currently assumed in the PARAFRASE compiler that all statements in the loop are interdependent. Research to improve the treat- ment of IFs is still going on. The data dependence relations between statements in a block, of assignment statements or a loop can be represented by a data dependence graph G. Each assignment statement S is represented by a node in the graph. If S.=* S. we draw a directed arc of the type -> from the node representing S. to the S. node. An arc of the form -0-> is drawn from S to S . if S.=#*> S.,and an arc of the type — /-> is drawn from S. to S. if S. =^- S.. Figure 9 shows a loop and its data dependence graph. Note the cycle in the graph. In general a cycle can exist in a graph if there are two statements S and S such that the relations S A S and S A S p q P q q P are both true. The relation S A S is defined by: x y S (dependence operator) S., (dependence operator) ...(dependence operator) S. (dependence operator) S , n > 0. in y — The dependence operator can be any of =*• , =? tt> , or =#=>. The A rela- tion can be used to partition the nodes of a data dependence graph into a set of node partitions . Two nodes representing statements S and S are in the same node partition, called a TT-block, if and only if S. A S„ and k I S A S . In other words all the nodes which are in a cycle of the graph belong to the same it -block. A node which is not in a cycle is a TT- block by itself. Later in this chapter an algorithm will be presented to distribute the loop control on its tt -blocks. 67 DO 1 I = 2, 3 A(I) = B(I-i)*3 + C(I) S ll A(2) = B(l)*3 + C(2) C(I) = A(I+1)*3 S 21 C(2) = A(3)*3 B(I) = C(I)+A(I) + B(I) S 31 B(2) = C(2)*A(2) + B(2) CONTINUE S 12 A(3) = B(2)*3 + C(3) S 22 C(3) = A(4)*3 s,„ B(3) = C(3)*A(3) + B(3) Figure 9. A Loop, Its Unrolled Version, and Its Data Dependence Graph 68 3.2 Clustering of Assignment Statements Algorithm Programmers tend to group in the same loop different assignment statements which perform similar operations on different sets of arrays. Very obvious examples of such loops are initialization loops where dif- ferent arrays of similar dimensions are initialized. This situation can also occur in loops where much more sophisticated calculations are per- formed. Examples of these loops are those performing similar calcula- tions on real and imaginary parts of complex arrays. The clustering transformation is designed to separate the set of statements inside a loop into several subsets such that in each subset a different group of arrays will be referenced. Each subset thus formed is called a name partition (NP) . The transformation is applied to the loops of the program one at a time. The aim is to reduce the memory require- ments of the program. 3.2.1 Definitions and Notations Before describing the algorithm we make some definitions. For a particular loop L, let be the ordered set of assignment statements controlled by L. For statement S., i is the ordering number. The set A(L) = (a ,a , ..., a , ..., a^) is the set of arrays referenced in L. If a is a subset of S , then the Jj set of arrays referenced in a are denoted by A(a) . Definition 3.1 The name partitions of the loop L, are a set (NP NP , ..., NP ) of subsets of SL with the following properties: 69 w \- U»p q q=l H (ii) NP.D NP. = for all 1 < i, j < k, i t j (iii) A(NP.)0a(NP.) = 4 for all 1 £ i, j < k, i 4 j k (iv) A(L) = \j A(NP ) q=l q (v) If S.eNP and S.eNP„ then there is no data dependence or data l q j £ antidependence between S. and S. due to scalar variables. i J 3.2.2 The Clustering Algorithm It is obvious from definition 3.1 that the control of a loop L can be distributed over its NP's. The order of execution of the result- ing loops will be arbitrary. The NP's of a loop can be found by con- structing an undirected clustering graph according to the following algorithm: (i) Corresponding to each assignment statement draw a node and label it with the label of the statement, (ii) For each array a referenced in the loop make a list, La., of the statements in which a. is referenced. i (iii) Take every list formed in (ii) and travel through the nodes re- presenting the statements in the list. When moving from one node to the next draw an undirected arc if no such arc existed because of a previous list. (iv) Draw an arc, if one was not already drawn, between the nodes of any two assignment statements if there is a data dependence, or antidependence between the two statements due to a scalar variable. 70 (v) Divide the nodes of the graph into clusters. Each cluster will represent one NP and will contain the maximum number of connected nodes. Thus every pair of nodes in a cluster will be connected either directly or through other nodes which belong to the same cluster. Figure 10 shows an example of applying the clustering algorithm to a loop. We note that the worst case complexity of the clustering algorithm is (number of statements of the loop* number of variables referenced in the loop) . We now elaborate on the usefulness of the clustering transforma- tion in reducing the cost of execution of multi-NP loops. If the original loop was assigned a number of page frames equal to its critical memory allotment, then one needs to assign to the transformed program only the maximum of the critical memory allotments of the resulting NP's. With this memory allotment the amount of I/O transfers will be the same for the original and transformed programs. Thus the space-time cost of the program will also be reduced by almost the same amount as its space re- quirement. This is true because the increase in the CPU time due to the additional control statements of the transformed program is not signifi- cant. One can establish a bound on the reduction of the space and the space-time cost. This is expressed in the following theorem: Theorem 3.1 The upper bound on the improvement in the space requirement and the space-time cost of a loop due to the clustering transformation is a factor of K, where K is the number of name partitions generated by the clustering algorithm. 71 6 10 S 8 S 9 S 10 S ll S 12 20 DO 20 J = 1, NY1 DO 10 I = 1, NX QVT 1 = QV(I, J) + T S *QV1(I, J) QCT 1 = QC(I, J) + T S *QC1(I, J) QV(I, J) = QV1(I, J) QC(I, J) = QC1(I, J) QV1(I, J) = QVT 1 QC1(I, J) = QCT^ CONTINUE QV(NX, J) = QV(1, J) QC(NX, J) = QC(1, J) QV(NXP, J) = QV(2, J) QV(NX+2,J) = QV(3, J) QC(NXP, J) = QC(2, J) QC(NX+2,J) = QC(3,J) CONTINUE LQV = (S v S y S ? , S g , S 1Q ) LQV1 = (S 1 , S 3 , S 5 ) LQC = (S 2 , S 4 , S g , S n , S 12 ) LQC1 = (S 2 , S 4 , S fi ) NP1 NP2 1' 3* 5' 7' 9' 10 (s 2 , s 4 , s 6 , s g , s u , s 12 ) Figure 10. A Loop and Its Clustering Graph. 72 DO 201 J = 1, NYl DO 101 1=1, NX 5 1 QVT1 = QV(I, J) + T S *QV1(I ,J) S 3 QV(I, J) = QV1(I, J) 5 5 QV1(I, J) = QVT 1 101 CONTINUE S ? QV(NX, J) = QV(1, J) S g QV(NXP, J) = QV(2, J) S 1Q QV(NX+2, J) = QV(3, J) 201 CONTINUE DO 202 J = 1, NYl DO 102 I = 1, NX 5 2 QCT 1 = QC(I, J) + T S *QC1(I, J) S^ QC(I, J) = QC1(I, J) 5 6 QC1(I, J) = QCT^ 102 CONTINUE S g QC(NX, J) = QC(1, J) S ±1 QC(NXP, J) = QC(2, J) S 12 QC(NX+2, J) = QC(3, J) 202 CONTINUE Figure 11. Distributing the Control of the Loop in Fig. 10 on Its NP's 73 Proof : The critical memory requirement of the original program, m , OL is 0(// of different array names in the loop). For the transformed pro- gram the critical memory requirement, m , is determined by the maximum of the number of array names in the different resulting NP's. If K is the number of NP's, then the smallest value which m can take is (m T /K) . o oL Since the clustering algorithm does not change the I/O time, the space- time cost will also drop by a factor of K. Figure 11 shows the loop of Figure 10 with the control distributed on the NP's. Obviously the space and space-time cost are reduced by a factor of 2. 3. 3 Fusion of Name Partitions 3.3.1 The Usefulness of the Fusion Transformation The aim of this transformation is to reduce I/O time without in- creasing the memory requirements of a program. This is achieved by combin- ing in one name partition several name partitions from different loops. The memory requirements of the combined name partition will not exceed the maximum memory needs of the individual NP's. As an example consider the following loops taken from a Fast Fourier Transform program: Program 9-a. DO 6 K = Kl, N2N, NDISP KPNG = K + NG S 1 CR(KPNG) = CR(K) - STOUTR(K) S CI(KPNG) = CI(K) - STOUTI(K) 74 6 CONTINUE DO 8 K = Kl, N2N, NDISP S CR(K) » CR(K) + STOUTR(K) 5 CI(K) = CI(K)+ STOUTI(K) 6 CONTINUE Using the clustering algorithm we get two NP's from the first loop: NP 11 = ^ S l^ and NP 12 = ^ S 2^' We also get two NP ' S from the second 1°°P : NP 21 = ^ S 3^ and NP 22 = ^ S 4^ ' If we distriDute the 1°°P control on the NP's we get the following program: Program 9-b . DO 61 K = Kl, N2N, NDISP 61 CR(K + NG) = CR(K) - STOUTR(K) DO 62 K = Kl, N2N, NDISP 62 CI(K + NG) = CI(K) - STOUTI(K) DO = 81 K = Kl, N2N, NDISP 81 CR(K) = CR(K) + STOUTR(K) DO 82 K = Kl, N2N, NDISP 82 CI(K) = CI(K) + STOUTI(K) The critical memory allotment of the first loop in the original program is four page frames and the total number of page faults is 4*K, K is the number of pages spanned by each array. The second loop has simi- lar memory requirements and number of page faults. Thus the original program can execute in four page frames, the total number of page faults is 8*K, and the total space-time cost is 32*K. After applying the 75 clustering transformation, Program 9-b needs only two page frames to execute without changing the number of page faults. Thus with clustering we have achieved an improvement of a factor of two in the memory requirement and space-time cost, without increasing the I/O time. If we examine the arrays being referenced in the NP's, we find that A(NP ) = A(NP ) = (CR,STOUTR) and A(NP ) =A(NP 22 ) = (CI,STOUTI). More- over, the loop structure of NP and NP 91 is identical; the nesting levels, the starting values of the index variables, the increment values, and the upper bound of the index set values, are all identical. We also notice that there are no data dependences between NP 19 abd NP-... Thus it is valid to combine NP - and NP~ in one name partition, NP . Because of similar arguments we can combine NP and NP „ in a single name partition, NP~. Hence after NP fusion the program will be transformed to the following: Program 9-c . DO 1 K = Kl, N2N, NDISP CR(K + NG) = CR(K) - STOUTR(K) 1 CR(K) = CR(K) + STOUTR(K) DO 2 K = Kl, N2N, NDISP CI(K + NG) = CI(K) - STOUTI(K) 2 CI(K) = CI(K) + STOUTI(K) The memory requirement of Program 9-c is the same as that of Pro- gram 9-b, namely, two page frames. Program 9-c, however, will produce less page faults: a total of 4*K page faults compared to 8*K page faults for the clustered and the original program. Table A compares the memory, I/O, and space-time cost of Programs 9-a, 9-b, and 9-c. We note that by 76 Table 4- Resource Requirements of Programs 9-a, 9-b, 9-c Critical Memory Total Number of Space-Time Allotment Page Faults Cost Original Program 4 8*K 32*K Clustered Program 2 8*K 16*K Fusion Applied to the Clustered Program 2 4*K 8*K using NP fusion of the clustered program we have improved the memory re- quirement, I/O, and space-time cost of the original program by factors of 2, 2, and 4 respectively. 3.3.2 Notation and Definitions After illustrating the usefulness of the fusion transforma- tion, we discussed some definitions relevant to the general fusion algo- rithm. The program is divided into a set of basic blocks. A basic block is defined as a section of code with only one point of entry and one point of exit. It contains a sequence of loops and possibly groups of assignment statements outside the loops. The fusion algorithm is applied to one basic block at a time. This is preceded by applying the clustering algorithm to the loops of the basic block. Let the number of loops in a basic block be m, m >_ 1. For loop L,, l_ !• These are denoted by NP, n , NP, „, . . . , NP, . The set of arrays references in NP is kl k2 kn, Ki denoted by A(NP, ) . 77 Although the NP's of one loop are by definition data independent, dependence relations can exist between NP's from different loops of a basic block of code. A name partition, NP is data dependent on another name partition, NP ., (k ^ q) if and only if there exists at least one statement in NP, . which is data dependent on a statement in NP . . We kx r qj denote this by NP . =^> NP, . . Similarly a data antidependence and data qj ki output dependence can exist between the name partitions NP, . and NP . if ki qj and only if there exists at least one statement in NP, . which is data ki antidependent or data output dependent on a statement in NP . . If NP, . qj ki is data antidependent on NP . then this is denoted by NP, . =fc> NP . . qj ki qj NP . =#*• NP, . means that NP, . is data output dependent on NP . . qj ki ki qj 3 -3.3 Correctness of Fusing Two Name Partitions Before presenting the fusion procedure, let us discuss the ques- tion of the correctness of fusing two NP's, NP, . and NP . (k < q) . When ki qj we fuse the two NP's we add the set of statements of NP . to those of NP qj ki' i.e., NP, . = NP, .U NP .. The fusion of the two NP's will be valid if the ki ki qj following conditions are satisfied. (A) The control structure of NP, . and NP . is identical. This means ki qj that the index variable sets, and the nesting structure for the two NP's are identical. (n) if (mp mp mp ^ K dj ^ V £+1' "•' £+g ; is the set of NP's between NP kl and NP . then there is no data dependence, antidependence, or output dependence between NP . and any NP in this set. Moreover, there is no dependence between any assignment statement in NP . and any assignment statement which occurs outside NP's and between NP, . and NP .. ki qj 78 We now present the general fusion algorithm. Again, we have m loops in a basic block of code (L 1 , L 2 . . . , L ) . Each loop L, has n, name partitions, (NP,,, NP,-, ..., NP ). 3.3.4 The Fusion Algorithm (i) Set k = 1, I - 1, i = 2 (ii) Compare A(NP ) with A(NP ) . If A(NP ) CLA(NP . J or A(NP ±1 ) ^ A(NP, p ) then test for the correctness of fusing NP p and NP.... If the fusion can be done, replace A(NP ^) by A(NP j ,)L)A(NP ) and NP by NP, UNP . Decrement n. and eliminate NP... from the set of NP's of loop i. If n. = o then decrement m. (iii) Repeat step (ii) by considering NP and NP . . , j = 2, ..., n . (iv) If i = m go to step (v) else increment i and go to step (ii) . (v) If £ = n, go to step (vi) else increment 2, and go to step (ii). (vi) If k = m exit, else increment k and go to step (ii) . We note that the complexity of this algorithm is 0( (total number 2 of NP's in the basic code block) ). 3.4 Scalar Transformations Programmers usually introduce assignment statements with scalar output variables inside loops for different reasons. A scalar variable can be used as a temporary to hold the value of an expression which is common to several assignment statements. Sometimes the right-hand side expression of an assignment statement is very long and programmers pre- fer to divide the expression into parts to improve the readability of the program. Every part is assigned to a scalar variable and these are used in the right-hand side expression of the assignment statement. In another 79 possibility the assignment statement to the scalar variable can be a recurrence. 3.4.1 The PARAFRASE Compiler Scalar Transformations As will be shown in the next section, distributing the loop con- trol of an NP on its TT-blocks can be used to reduce the amount of memory required to execute the NP. In the PARAFRASE compiler several techniques are used to remove the arcs in the data dependence graph of a loop which are due to assignment statements to scalar variables. This will simplify the graph and reduce the number of statements included in a tt -block. This is useful to us because, the smaller the number of statements in the tt -blocks of an NP, the smaller the amount of memory which is needed for its execution. Of the techniques used in the PARAFRASE compiler to break data dependences due to scalars we use (without modification) the scalar renaming, induction variable substitution and forward substitution of right-hand sides of assignments statements to scalars which are used in subscript expressions. The dead code elimination pass will eliminate the assignment statements to those scalars treated by these techniques. In the PARAFRASE compiler all scalars which cannot be handled by the previous three techniques will be expanded into array variables. Figure 12~a shows an example program and its data dependence graph. Notice the cycle in the graph. In Figure 12-b the scalar has been expanded into an array of size N and thus the cycle in the dependence graph has dis- appeared. The distribution algorithm which will be presented in the next section can be used to distribute the loop of the program in Figure 12-b. The program in 12-a is undistributable. 80 "3 10 DO 10 I = 1, N T = A(I) - E(I) A(I) - B(I)*C(I) B(I) = T + F(I)/D(I) CONTINUE Figure 12 -a. A Loop Including an Assignment Statement to a Scalar Variable and Its Data Dependence Graph. 3 10 DO 10 I « 1, N T'(I) = A(I) - E(I) A(I) = B(I)*C(I) B(I) = T'(I) + F(I)/D(I) CONTINUE Figure 12 -b. The Loop of Figure 12-a After Expanding the Scalar and Its New Data Dependence Graph. 81 3.4.2 The Scalar Forward Substitution Transformation Figure 13-a shows another example program and its data dependence graph. The cycles in the graph are again due to an assignment statement to the scalar variable T. One can still use the scalar expansion technique to simplify the data dependence graph of the program in Figure 13-a. In fact, this is what is done in PARAFRASE. However, for this example the right-hand side of S can be forward substituted in S_ and S„. S.. can then be eliminated from the loop. This is shown in Figure 13-b. In PARAFRASE, this technique is not used because redundant computation might be intro- duced. This is the case in the loop of Figure 13-a because the scalar T is used in the right-hand side of two statements. Since PARAFRASE was written to speedup program execution, forward substitution would not be a suitable transformation. When people are compiling for parallel or pipelined machines they are not worried too much about the increase of the memory requirements of a transformed program if it can be executed on a parallel machine much faster than the original program on a serial machine. In this thesis we are concerned with compiler transformations for serial virtual memory computers. We are interested in a modified version of the PARAFRASE loop distribution transformation. In the next section we will describe our distribution transformation, the vertical distribution algorithm. Hence we are also interested in techniques to break data dependences in a loop which are introduced by assignment statements to scalar variables. How- ever, we are concerned here with the memory requirement of the program and its I/O activity. 82 3 10 DO 10 I = 1, N T = A(I)*C(I) D(I) - D(I)**2 - T**.5 F(I) = T*(A(I) - C(I)) + F(I)/C(I) CONTINUE Figure 13-a. Another loop with an Assignment Statement to a Scalar Variable and Its Data Dependence Graph. 10 DO 10 I - 1, N D(I) -D(I)**2- (A(I)*C(I))**.5 F(I) - (A(I)*C(I))*(A(I) - C(I)) + F(I)/C(I) CONTINUE © © Figure 13-b. The Use of Forward Substitution to Simplify the Data Dependence Graph of Fig. 13-a. 83 Our approach will be to use the forward substitution technique in some situations and a modified version of the scalar expansion technique in other situations. Shortly we will give some rules to be used in decid- ing what to do for every specific case. Before presenting these rules we make one observation and then explain our modification to the scalar expansion transformation. 3.4.2.1 Correctness of the Forward Substitution Transformation We note that the scalar expansion transformation can be applied to any scalar output variable of any assignment statement. This transfor- mation is always correct as long as all references to the expanded scalar are replaced by references to appropriate elements of the resulting array. The details of the scalar expansion algorithm can be found in [WOLF78]. The forward substitution transformation on the other hand, cannot be used in all cases. For example it cannot be applied to the program in Figure 12-a. To address the correctness of the forward substitution transforma- tion we make the following definitions. Definition 3.2 If the output variable of an assignment statement is a scalar variable x, then this statement is called the source statement of th e scalar x, S . The set of arrays referenced in S is denoted by AS . _x J x x Definition 3.3 A destination statement of a scalar variable x, D , is an x assignment statement which is data dependent on S . In other words S =* X X D . We denote the set of array referenced in D by AD . x J xx The necessary and sufficient condition for the correctness of the forward substitution transformation can now be stated as follows: If 84 the source statement of a scalar variable x, S , is not a recurrence then its right-hand side expression can be forward substituted in a destina- tion statement of x, D if and only if there is no statement executed after S and before D which is antidependent on S . If this condition x x K x is satisfied, then none of the input variables set of S changes its value before the execution of D and the substitution will be valid. x 3.4.3 Modifying the Scalar Expansion Transformation When a scalar variable is expanded into an array in the PARAFRASE compiler a different element of the array is used for every iteration of the loop. Thus for example, in Figure 12-b the expansion array, T', will be of size N. For execution on a parallel machine the loop can be distri- buted as shown in Figure 14-a. The distributed loop can be executed in 4 time steps on a parallel machine with N processor (we assume that perform- ing any arithmetic operation takes one time step). On a serial machine, the program of Figure 12-a takes 4*N time steps to be executed. Thus a speedup of a factor of N has been achieved by distributing the loop. In Section 3.5.1 we will show that although this kind of distri- bution, which we will call horizontal distribution , can result in reducing m of a loop, it may increase the I/O activity and possibly the space- time cost of the execution. In the same section we will modify the dis- tribution algorithm to avoid any increase in the I/O activity and to ensure a reduction in the space-time cost. We will call our modified distribution algorithm, vertical distribution . Figure 14-b shows the vertically distributed version of Program 12-a. By using vertical distribution the size of the expanded scalar need 85 DO S I m 1, N s l T'(I) = A(I) - E(I) DO S I = 1, N S 2 A(I) = B(I)*C(I) DO S I = 1, N S 3 B(I) - T'(I) + F(I)/D(I) Figure 14-a. The Loop of Figure 12-b After Applying the Horizontal Distribition Transformation. DO 10 IP = 1, fN/Zl ILB = 1 + (IP-1)*Z IUB = MIN(IP*Z, N) DO S 1 I = ILB, IUB S 1 T'(I MOD(Z) + 1) = A(I) - E(I) 1 DO S I = ILB, IUB 5 2 A(I) = B(I)*C(I) DO S I = ILB, IUB 5 3 B(I) = T'(I MOD(Z) + 1) + F(I)/D(I) 10 CONTINUE Figure lA-b. The Vertically Distributed Version of the Loop in Figure 12-b. 86 only be Z words, one page size. The expression (I M0D(Z)+1) is used as a subscript expression for the expanded array. Thus, with 4 page frames, the execution of the program in Figure 14-b starts by using the first page of A and the first page of E to compute one page of T' in S.. In S ? the first page of B and the first page of C are used to modify the first page of A. In S„ the page of T' is used with the first page of F and the first page of D to write into the first page of B. In the second iteration of the outermost loop, the IP loop, the second pages of the arrays A, B, C, D, E, and F will be processed. However, the same page of T 1 can be again utilized to hold the temporary Z values computed in S.. to be used in S„. This will be true for all iterations of the IP loop. Thus the difference between expanding scalars in parallel machine transformations and in virtual memory computer transformations is that the size of the expansion array in the latter case is less than or equal to one page size. As mentioned before, the details of the expansion algorithm are found in [WOLF78]. We use the same algorithm except for reducing the size of the expansion array. In Figure 15, we show another example program and its vertically distributed version. Note that we have expanded the output scalar variable of statement S.. into an appropriate one page array. 3.4.4 Choosing Between Scalar Expansion and Forward Substitution When the control of an NP is vertically distributed on its tt- blocks, the critical memory allotment, m , for each 7T-block will be roughly equal to the number of arrays referenced in it. Expanding the scalar output variable of an assignment statement into an array will increase m of the TT-block containing this assignment statement by one 87 DO 10 I = 1, N DO 10 J = 1, N s l T = A(I, J) S 2 A(I, J) = A(J, I) + B(I, J) S 3 A(J, I) = T + C(I, J) 10 CONTINUE Figure 15-a. An Example Loop . DO 10 IP = 1, fN/RZ] ILB = 1 + (IP-1)*RZ IUB = IP*RZ DO 10 JP = 1, [N/RZ1 JLB = 1 + (JP-1;*RZ JUB = JP*RZ DO S I = ILB, IUB DO S J = JLB, JUB S T'(I MOD(RZ) + 1, J MOD(RZ) + 1) = A(I, J) DO S I = ILB, IUB DO S J = JLB, JUB S 2 A(I, J) = A(J, I) + B(I, J) DO S I = ILB, IUB DO S J = JLB, JUB A(J, I) = T'(I MOD(RZ) + 1, J MOD(RZ) + 1) + C(I, J) 10 CONTINUE Figure 15-b. The Vertically Distributed Version of the Loop in Figure 15-a. 88 page frame. Moreover, all references to the scalar variable in other ir- blocks must be replaced by references to the appropriate elements of the expansion array. Thus, m 's for these it -blocks will also be increased. For example in Figure 14-b scalar expansion has increased the number of arrays referenced in both statements S and S . However, by vertical dis- tribution, which is possible in Figure 14-b because of scalar expansion, m is equal to 4 instead of 6 for the original loop in Figure 12-a. If forward substitution is possible and if AS CH AD then substi- x — x tuting the right-hand side expression of S in D will not increase m of x x o D . If this can be done for all the destination statement of x, S can x x be eliminated. Otherwise x must be expanded into an array and references to x in those statement for which AS <£- AD must be replaced by references to appropriate elements of the expansion scalar. From the previous discussion we conclude that scalar expansion should not be done unless it is incorrect to use the forward substitution transformation or if AS ct- AD for some of the destination statements of x — x x. 3.5 Distribution of Name Partitions By applying the clustering and the fusion transformations to a program we expect to reduce its I/O activity, space requirement, and space-time cost. At this point of the transformation process, the differ- ent NP's of a basic block of code in the program will reference different sets of arrays. In a particular NP, however, it is not necessary that all the arrays of the NP will be referenced in each of its statements or even by any single statement. Thus it is intuitive that by distributing the 89 control of an NP on its u-b locks its space requirements can be reduced to be roughly equal to the maximum number of arrays referenced in any of its u-blocks instead of the total number of arrays referenced in the NP. In Section 3.5.1 we will present the distribution algorithm as currently implemented in the PARAFRASE compiler. In the same section we will differentiate between basic and nonbasic Tr-b locks. As mentioned previously, although this kind of distribution, the horizontal distribu- tion, reduces m of an NP, it might increase its I/O activity and space- time cost. We will discuss an example to illustrate this point. For NP's with basic TT-b locks we describe the vertical distribution algorithm in Section 3.5.2. This is the horizontal distribution algorithm modified by the page indexing transformation. Vertical distribution re- duces m of an NP but does not increase its I/O activity. In Section 3.5.2.1 we describe the algorithm when used for elementary loops. In Sec- tion 3.5.2.2 we discuss the algorithm when applied to multinested loops in which multi-dimensional arrays are referenced. In the same section we illustrate the use of the page indexing transformation in matching the pat- tern of reference of multi-dimensional arrays to their storage scheme. The general vertical distribution algorithm is presented in Section 3.5.2.3. Some implementation issues will also be considered in the same section. In Section 3.5.2.4 we present two theorems to be used in testing the correctness of applying the page indexing transformation. Transforming NP's with nonbasic TT-blocks is discussed in Section 3.5.3. 90 3.5.1 Horizontal Distribution of Name Partitions We apply the horizontal distribution algorithm [KUCK78] to all NP's in which the set of arrays of the NP are not all referenced in every statement of the NP. If none of the arrays referenced in the NP is a multi-page array, horizontal distribution will be the last transformation applied to the NP. Otherwise, the method of distributing the control of the NP on its tt -blocks will be modified using the page indexing transfor- mation as will be described in the next section. 3.5.1.1 The Horizontal Distribution Algorithm (i) By analyzing the subscript expressions and the index set for each index variable of the NP construct its data dependence graph, (ii) Identify the TT-blocks of the NP as defined in Section 3.1. We define the following partial ordering relations between two tt- b locks, it. and tt . : i J (a) tt. > tt . if and only if there exists S. ett. and S n £tt . l J k l £ j such that S, => s • k £ (b) tt . £ tt . if and only if there exists S, ett. and So ett. i j ki * J such that S, =#> S n . k £ (c) tt . > tt. if and only if there exists S. ett. and S „ ett. j i ' k l £ j such that S =h S, . £ k We order the Tr-blocks of the NP according to these three relations. Note that the resulting ordering is not unique. (iii) Distribute the NP control on its ordered 7r-blocks. 91 Figure 16 shows an NP, its data dependence graph, and its horizon- tally distributed version. 3.5.1.2 The Problem with Horizontally Distributing an NP with Multi-page Arrays If multi-page arrays are referenced in different TT-blocks of an NP, then the number of page transfers will be increased if the NP is hori- zontally distributed. As an example consider the program in Figure 17-a. The critical memory allotment for this NP , m , is equal to 3 and total number of page faults (using the LRU replacement algorithm) is 3*[N/Zl. In the distributed NP of Figure 17-b , m is reduced to 2 page frames. However, the total number of page faults is increased to 6*fN/Z~|. The space-time cost is increased by a factor of (2*6* TN/Z 1) / (3*3* fN/Z 1 = 4/3. In the undistributed NP, statements S , S , and S will issue all their references to a particular page of the A array while this page is in main memory. In the horizontally distributed version, statement S will issue its references to an A page, then this page will be replaced. The same page will be reloaded into main memory when it is referenced by S ? and again when it is referenced by S_. Similarly a B page will be loaded twice, once when it is referenced in S and again when it is referenced in S . Note that the distributed program will have no problems if the size of each array was one page or less. Before curing the increased I/O problem by adding the page index- ing step to the horizontal distribution algorithm, let us differentiate between basic and nonbasic TT-blocks. 92 10 (a) The NP DO S DO S J = 1, NY1 DO S I = 1, NX QVT1'(I, J) - QV(I, J) + T S *QV1(I, J) QV(I, J) = QV1(I, J) QV1(I, J) = QVTI'CI, J) QV(NX, J) = QV(1, J) QV(NXP, J) = QV(2, J) QV(NX+2, J) = QV(3, J) (b) The Dependence Graph J = 1,NY1 DO S I = 1, NX QVTl'(I, J) = QV(I, J) + TS*QV1(I, J) DO S J = 1, NY1 DO S I = 1, NX QV(I, J) = QV1(I, J) DO S 5 K - 1, NY1 DO S I = 1, NX QV1(I, J) = QVT1'(I, J) DO S J = 1, NY1 QV(NX, J) = QV(1, J) DO S J - 1, NY1 QV(NXP, J) - QV(2, J) DO S 1Q J - 1, NY1 QV(NX+2, J) = QV(3, J) (c) The Distributed NP Figure 16. A Name Partition and Its Horizontally Distributed Version 10 93 DO S I = 1, N 5 1 C(I) = C(I) - A(I) 5 2 A(I) = 4*A(I)*B(I) - 2 5 3 B(I) = B(I)*A(I) + B(I) Figure 17-a. A Loop Referencing Multi-page Arrays DO S I = 1, N 5 1 C(I) = C(I) - A(I) DO S I = 1, N 5 2 A(I) = 4*A(I)*B*(I) - 2 DO S I - 1, H 5 3 B(I) = B(I)*A(I) + B(I) Figure 17 -b . Horizontally Distributing the Loop of Figure 17-a 94 Definition 3.4 A basic Tr-block is a TT-block in which all the statements are at the same nest depth level. Some of the statements of a nonbasic TT-block will fall at different nest depth levels. The vertical distribution transformation handles NP's which have only basic tt -blocks. Such NP's are called basic NP's . 3.5.2 Page Indexing and Vertical Distribution of Basic Name Partitions In the following subsections we will often need to refer to a set of consecutive integers. We now define a function, INT, which will denote such a set. We also give a formal definition of a basic NP. Definition 3.5 Let w and k be two integers w > 0. The function INT(w,k) will denote the set of consecutive integers {(k-l)*w+l, (k-l)*w+2, ..., (k-l)*w+w-l, k*w). Definition 3.6 A basic NP, BNP , is denoted by BNP = (I 1 «■ o v I 2 «- a 2 , ..., I d «- a d ) (B^ B 2 , ..., B ) where I. is an NP index, a. is an ordered index set, and B. is a basic TT-block J J J or another BNP. In some cases index variables of an NP are never used in subscript expressions of arrays. They are used as some kind of counters. We wish to differentiate between the DO statements associated with such index variables and those associated with index variables used in subscript expressions. Definition 3.7 A type-A DO statement has an index variable which is used in some subscript expressions in an NP. If the index variable of a DO statement is never used in a subscript expression, then such a DO 95 statement is of type B . In Figure 18, DJ and DI are type-A DO statements. DIJ is of type-B. 3.5.2.1 Vertical Distribution of Elementary NP's By definition, an elementary NP has one DO statement and no multi- dimensional arrays. The NP in Figure 17-a is an example of elementary NP's. Let a be the ordered index set. Let I . be the smallest mm integer in O and I be its maximum integer. If I . >eINT(Z,k. . ) and max ° mm mxn I ElNT(Z,k ) then vertical distribution of the elementary NP means max max J executing its first TT-block, it. , for the ordered index set crAlNT(Z,k . ), 1 mm then executing tt~ for the same set and so on until the last TT-block is executed for this set of values of the index variable. The same process is repeated for the ordered index set aAlNT(Z,k . +1). We keep intera- mm ting until we execute all the Tr-blocks for the last subset of the index variable set, namely afllNT(Z ,k ). max Figure 19 shows the vertically distributed version of the NP in Figure 17-a. Vertical distribution is achieved by adding a set of state- ments called the page indexing statement set . In Figure 19 these are the ADD1I, ADD2I, and ADD3I statements. ADD1I is the paging DO statement . Its scope includes all the Tr-blocks in the NP. Statement ADD2I defines the lower bound of a(\lNT(Z , IP) for all the values of IP: 1, 2, ..., [N/Zl. Similarly ADD3I defines the upper bound of a(\lNT(Z , IP) . We will refer to statements ADD2I and ADD3I as the lower bound and upper bound definition statements of the index variable I . Note that the vertically distributed program in Figure 19 does not have the increased I/O problem of the horizontally distributed program in 96 DIJ DJ DI 10 DO 10 I J = 1, 3 PK(1) = 1. - G*DZ/(2.*PT(1)*QV0(1)) DO 10 J = 1, NY1 PK(J) = PK(J-l)*CP*QVO(J) DO 10 1=1, NX1 QV(I, J) = HUM(J)*QVS QV1(I, J) = QV(I, J) CONTINUE Figure 18. A Loop with Type-A and Type-B DO statements ADD1I DO 10 IP = 1, fN/Zl ADD2I ILB = 1 + (IP-1)*Z ADD3I IUB = MIN(IP*Z, N) DO S I = ILB, IUB S x C(I) = C(I) - A(I) DO S I = ILB, IUB S 2 A(I) = 4*A(I)*B(I) - 2 DO S I = ILB, IUB B(I) = B(I)*A(I) + B(I) CONTINUE Figure 19. The Vertically Distributed Version of the NP in Figure 17-a 3 10 97 Figure 17-b. With two page frames, two page faults will occur when exe- cution is started , to allocate two pages of the arrays A and C. This is followed by a burst of CPU activity during which the S loop will be exe- cuted for Z iterations. A page fault will occur when a B page replaces the C page as the execution of the S loop is started. The execution of this loop will last for 3*Z memory references. The same A and B pages will be used when the S loop is executed for 4*Z references. Next the value of IP is incremented and a new execution cycle is started. Thus the number of page faults per cycle is 3 and the total number of page faults in 3MN/Z]. This is equal to the number of faults for the undis- tributed program. Since m was decreased from 3 to 2, the space-time cost was also decreased by the same factor, namely, 3/2. Table 5 compares the space, I/O time, and the space-time cost of the program in Figure 17-a, its horizontally distributed version, and its vertically distributed ver- sion. 3.5.2.2 Vertical Distribution of Multi-nested Basic Name Partitions with Multi-dimensional Arrays As was mentioned in Chapter 2 we will adopt the submatrix storage scheme to store multi-dimensional arrays. We start this section by illus- trating the usefulness of the page indexing transformation in matching the pattern of reference of multi-dimensional arrays to their storage scheme. We then describe using the page indexing transformation to vertically dis- tribute multi-nested basic NP's which reference multi-dimensional arrays. Consider the following matrix addition program 98 Table 5. Resource Requirement of the Program in Figure 17-a, Its Horizontally Distributed Version, and Vertically Distributed Version Critical Total Space- Memory Number of Time Allotment Page Faults Cost Original Program 3 3*Tn/z1 9*Tn/z1 After Horizontal Distribution 2 6*Fn/z1 12*Tn/z1 After Vertical Distribution 2 3*Fn/z1 6*Tn/z1 Program 10-a DO 10 I = 1, N DO 10 J = 1, N 10 A(I,J) = B(J,I) + C(I,J) Although the behavior of this program is improved by storing the arrays using the submatrix scheme rather than row -wise or column-wise, the MTBPF is still lower than predicted by the ELM. According to the ELM the MTBPF is 3*Z. For Program 10-a the MTBPF is 3*RZ, where RZ = /Z . The page index- ing transformation will make the MTBPF equal to 3*Z. The transformed pro- gram is shown below. Program 10-b . ADD1I • DO 10 IP = 1, [N/RZ1 ADD2I ILB = 1 + (IP-1)*RZ ADD3I IUB = MIN(IP*RZ,N) ADD1J DO 10 JP = 1, [N/RZ1 99 ADD2J JLB = 1 + (JP-1)*RZ ADD3J JUB = MIN(JP*RZ,N) DO 10 I = ILB, IUB DO 10 J = JLB, JUB 10 A(I,J) = B(J,I) + C(J,I) Again the idea here is to change the indexing pattern such that the maximum possible number of references are made to a page while the page is in primary memory. For this program we have two index variables I and J. The index variable set of I , a = (1, 2, ..., N) , is divided into the subsets (a f\ INT(RZ, 1) ) , (of) INT(RZ,2)), ..., (p (\ INT(RZ J N/RZl )) . Similary G is divided into similar subsets. The assignment statement in program 10-b will be first executed for the subsets (I = a \\ INT(RZ,1)) x (J = a OlNTCRZ.D). Next the subsets (I = a fi INT(RZ,1)) x (J = a j(\ INT(RZ,2)) will be used. The rest of the sequence of the index subsets will be: {(I = Oj.0 INT(RZ,l))x (J = a HINT (RZ, 3)); ...; (I - a I rtlNT(RZ,l))x (J = OjO INT(RZ, fN/RZ D); (I = o INT(RZ,2))x (J = O (1 INT(RZ,1)); •••; (I = a ] .fiINT(RZ, fN/RZ D)x(J = /UNT(RZ, fN/RZ 1) ) } . When IP = ip and JP = jp the index variables subsets will be (I = a (\ INT(RZ, ip,)) and (J = a (\ INT(RZ, jp )) . With this pattern of indexing, page indexing , all the elements of an A, B, or C page will be referenced before any elements of any other pages. Thus using page indexing and with 3 page f rames ,the pro- gram will have the minimum number of page fault, 3*rN/RZl*pN/RZl . The MTBPF will be 3*Z and MTBR will be 3 references. As is shown in Program 10-b page indexing was achieved by adding a paging statement set for every type-A DO statement in the program. For the DO statement of the I index variable, the statements ADD1I, ADD2I, 100 and ADD3I were added. The scope of the ADD1I paging DO statement is identi- cal to the scope of the I DO statement in the original program. The ADD2I defines the lower bound of the subset (a f\ INT(RZ,IP) ) and the ADD3I state- ment define the upper bound of the same subset. Similar remarks apply to the ADD1J, ADD2J, and ADD3J statements. Vertical distribution of multi-nested basic NP's can be achieved by adding a page indexing step to the horizontal distribution algorithm. After the Tr-blocks of the NP are identified, each type-A DO statement will be replaced by an appropriate paging statement set. Each iT-block which was in the scope of a replaced DO statement will be enclosed by a new DO statement using the old index variable and the bounds of the index set as defined in the added paging statement set. The control of all type-B DO statements will be distributed on the relevant TT-blocks without any modification. We illustrate the vertical distribution procedure by considering the following example: Program 11-a. DI DO 10 I = 1, N DJ DO 10 J = 1, N s i C(I,J) = DK DO 10 K = 1, N S 2 C(I,J) = C(I,J) + A(I,K)*B(K,J) 10 CONTINUE There are two ir-blocks in this program tt = {S }, and tt = {S }. There are three type-A DO statements DI, DJ, and DK. The scope of DI and DJ includes both tt and tt . The scope of DK includes only tt 2< Thus the 101 scope of the paging DO loops in the vertically distributed version of the program (ADD1I and ADD1J in the program below) will include tt and tt . The scope of the paging loop in the statement set replacing DK will include only S . The vertically distributed version of Program 11-a is as follows: Program 11-b . ADD1I ADD2I ADD3I ADD1J ADD2J ADD3J DO 10 IP = 1, [N/RZl ILB = 1 + (IP-1)*RZ IUB = MIN(IP*RZ,N) DO 10 JP = 1, [N/RZ] JLB = 1 + (JP-1)*RZ JUB = MIN(JP*RZ,N) DO S I = ILB, IUB DO S J = JLB, JUB s l C(I,J) = ADD1K DO 10 KP=l,[N/RZl ADD2K KLB = 1 + (KP-1)*RZ ADD3K KUB = MIN(KP*RZ,N) 2 10 DO S I = ILB, IUB DO S J = JLB, JUB DO S K = KLB, KUB C(I,J) = C(I,J) + A(I,K)*B(K,J) CONTINUE Note that in this program a page of the C array will be initial- ized in tt then the same page will be referenced in tt . Hence with verti- cal distribution , a page which is referenced in several TT-blocks will not leave memory until it has been used in all these TT-blocks. 102 3.5.2.3 Vertical Distribution of Basic NP's - the General Algorithm and Some Implementation Considerations After introducing the concept of vertical distribution by examples in the previous two sections we now present the general algorithm. (i) Construct the data dependence graph and identify the TT-blocks of the NP as described in Section 3.5.1.1 • (ii) Start with the outmost type-A DO statement. Replace it by an appropriate page indexing statement set. The scope of the paging loop is the same as the scope of the replaced DO state- ment, (iii) Enclose each Tr-block which was within the scope of the replaced DO statement by a DO statement using the same index variable. The upper and lower bounds of the index set are as defined in the added page indexing statement set. The increment is the same as in the replaced DO statement, (iv) Repeat (ii) and (iii) for the next outermost type-A DO state- ment. This process continues until all type-A DO statements have been replaced. The control of all type-B DO statements will be distributed on the relevant TT-blocks as done in the horizontal distribution algorithm. We note that the added complexity of the distribution algorithm due to page indexing is (// of DO statements in the NP) . In all the examples discussed in the previous sections all the subscript expressions were linear functions of one index variable, i.e., of the form a*index variable + 3. Moreover, for these examples the coef- ficient of the index variable, a, was the same for all the subscript 103 expressions and it was equal to 1. 3 was equal to zero in all expressions. If 3 ^ for some expressions, we will still use the same implementation techniques as illustrated in the examples. If a ^ 1 but it was the same number, c, for all subscript expressions, our implementation method can be modified slightly to accomodate such cases. This is illustrated in the following example. Program 12-a . DO 1 I = 1, N S A(3I) = B(3I)*3 S 2 D(3I) = B(3I-l)/3 1 CONTINUE The vertically distributed version of this program is shown below Program 12-b . ADD1I DO 1 IP = 1, Tn/ lZ/3 J 1 ADD2I ILB = 1 + (IP-1)*IZ/3J ADD3I IUB = MIN(IP*lZ/3J,N) DO S I = ILB, IUB S A(3I) = B(3I)*3 DO S I = ILB, IUB S 2 D(3I) = B(3I-l)/3 1 CONTINUE We note that IZ/3J is the number iterations which is spent by Program 12-a referencing one page of A, one page of B, and one page of D. Thus [N/IZ/3J1 is total number of pages of A referenced. Similarly the same number of pages of the B and C arrays are referenced. Program 12-b 104 will have [N/[Z/3J1 cycles. In each cycle 2*LZ/3J references will be made to two pages of A and B in the S.. loop. This is followed by 2* LZ/3 J references made to the same B page and a D page in the S_ loop. In general, if the coefficient of all the index variables in all the subscript expressions was the number c, Z should be replaced by LZ/cJ in the added statement set (or RZ should be replaced by [RZ/cJ when multi- dimensional arrays are involved) . If the coefficient of the index vari- ables were not the same for all subscript expressions, we use their mini- mum, c . . Thus Z will be replaced by IZ/c . I in the added statement mxn r ' l mxn J set. Such cases , where the subscript expressions are more complex functions of one or more index variables ,are of little practical interest and hence we will not discuss such cases any further. Before leaving this section we remark that the lower bound of the added paging DO statement was equal to 1 in all our examples. This is not true in the general case. If the lower bound of the index set of the re- placed DO statement was I . and I . eINT(Z,k . ) then the lower bound of the min min mm paging index set will be k . (in the case where multi-dimensional arrays mm are involved I . cINT(RZ,k . )). Note that the added lower bound definition mm mm statement should be adjusted to make ILB = I . when IP = k . . mm mm 3.5.2.4 The Correctness of the Page Indexing Transformation The correctness of the horizontal distribution algorithm is obvious from the definition of data dependences and TT-blocks. When page indexing is used to achieve vertical distribution, the order of referencing elements of multi-dimensional arrays in 7T-blocks is different from the order of their reference as specified in the undistributed program. Thus we need to establish some necessary and sufficient conditions which can be used 105 to test whether the page indexing transformation is valid. We will illus- trate the problem by considering the following example. Program 13-a. DO 1 I = 1, 48 DO 1 J = 1, 48 5 1 A(I,J) = B(I,J)*2 5 2 C(I,J) = A(I-1, J+l)/2 1 CONTINUE In this program there is one dependence relation, namely S is data dependent on S . Thus there will be no cycles in the data depend- ence graph and the program can be horizontally distributed as shown below. Program 13-b . DO S I = 1, 48 DO S J = 1, 48 Sj A(I,J) = B(I,J)*2 DO S 2 I = 1, 48 DO S J = 1, 48 s 2 C(I,J) = Att-1, J+D/2 For a page size of 64 words we get the following program if we apply page indexing to Program 13-b. Program 13-c ADD1I DO 10 IP = 1, 6 ADD2I ILB = 1 + (IP-1)*8 ADD3I IUB = IP*8 106 ADD1J DO 10 JP = 1, 6 ADD2J JLB = 1 + (JP-1)*8 ADD3J JUB = JP*8 DO S I - ILB, IUB DO S J - JLB, JUB 5 1 A(I,J) = B(I,J)*2 DO S 2 I = ILB, IUB DO S J = JLB, JUB 5 2 C(I,J) = A(I-1, J+l)/2 Program 13-c will produce erroneous results. To see this consider for example the value assigned to C(2, 8) in S . On the right-hand side of S the value of A(l, 9) is used in computing C(2, 8). In Programs 13-a and 13-b this value of A(l, 9) will be computed in S . In Program 13-c the value of A(l, 9) used to compute C(2, 8) is an old value, i.e., when the assignment to C(2, 8) is made the new value computed in S for A(l, 9) has not been stored in A(l, 9) yet. Hence Program 13-a cannot be vertically distributed. To simplify our discussion of this subject we will consider only a basic Tr-block with only one assignment statement. This will be of the form: Program 14-a . DO S I = 1, N DO S I - 1, H S A(F 1 (I 1 ), F 2 (I 2 )) = ACf^I^, f 2 (I 2 )) + < an expres- sion not containing references to A> 107 where F (I ) and f-.(I-,) are linear functions of I.. Similarly F-CI-) and f_(I~) are linear functions of I„. At the end of this section we will discuss extending our analysis and theorems to cover more general cases. The problem here is to find sufficient and necessary conditions for the correctness of page indexing Program 14-a, i.e., we want to test whether the following program will produce identical results to those produced by Program 14-a: Porgram 14-b. DO S IP = 1, [N/RZl ILB = 1 + (IP -1)*RZ IUB = MIN(IP*RZ,N) DO S IP. - 1, [N/RZl ILB = 1 + (IP 2 -1)*RZ IUB 2 - MIN(IP 2 *RZ, N) DO S I = ILB , IUB DO S I = ILB 2 , IUB 2 S A(F 1 (I 1 ), F 2 (I 2 )) = A(f 1 (I 1 ), f 2 (I 2 )) + ... Figure 20-a shows the It xI t plane. Each point (i- , i„) in this plane can be associated with the execution of the statement S when I = i 1 and I„ = i„. One can imagine a cursor that moves from one point to another in the l-|Xl„ plane as S is executed with the index variables taking the values of the coordinates of the first point, then executed with the index variables taking the values of the coordinates of the second point, etc. Thus the cursor will trace a particular curve in the I..xl plane during the execution of S(actually it will visit discrete 108 I- initial value I final value I initial value Page Boundary Figure 20-a. The Curve Traced by the Cursor in the I-,xI 9 Plane when Program 14-a Is Executed. ■*- I, 1 f \ L -- j *r Z ? +7 ] \JzC~ i Figure 20-b. The Curve Traced In the I-|Xl 9 Plane when Program 14-b Is Executed. 109 points on the curve) . Figure 20-a shows the curve traced by the cursor when Program 14-a is executed. In the figure N = 8. If the cursor passes through the point P with the coordinates (i , i„) before the point P' with the coordinates (i' , i') we will say that P precedes P' and denote this by P < P' or (i , i ) < (i' i' ) . Figure 20-b shows the curve traced by the cursor when Program 14-b is executed. In the figure RZ = 2. According to the execution sequencing of S in Program 14-a, if 0UT(S(i^, ip) eIN(S(i , i 2 )) and (i* ij) < (1 , 1J then there is a dependence vector, V = (i , i ? )(i' i') from point P' to point P in the I-.XI- plane. In general, if there are references to several different elements of A on the right-hand side of S, there might be several depend- ence vectors from several points, P 1 , P", P ,M , ... to the point P. The points, P', P" , ... are called the source points of these dependence vectors and the point P is the destination point. The page indexing transformation will be correct if and only if, for all computed points, the cursor will pass through all the dependence source points of each given point before it passes through the point itself . As an example consider the program: Program 15 . DO 10 I = 1, 4 DO 10 I 2 = 1, 4 10 A(I X + 1, I 2 ) = AO-I^ 5-I 2 ) Table 6 lists the points visited by the cursor, their coordinates in the I^xl plane, 0UT(S(I ,1 )), and IN(S(I I„)). Examining the table we find the following dependence vectors: 110 Table 6. The Points on the Execution Trace of Program 15. Point The Coordinates OUTCSCi^ i 2 )) IN(S(i r i 2 ) 10 11 12 13 14 15 16 1 1 1 2 1 3 1 4 2 1 2 2 2 3 2 4 3 1 3 2 3 3 3 4 4 1 4 2 4 3 4 4 a(2 , 1) a(2 , 2) a(2, 3) a(2 , 4) a(3, 1) a(3, 2) a(3, 3) a(3 , 4) a(4. 1) a(4. , 2) a(4 , 3) a(4, , 4) a(5 1) s(5. 2) a(5 , 3) a(5. , A) a(4, 4) a(4, 3) a(4, 2) a(4 s 1) a(3, 4) a(3 , 3) a(3 , 2) a(3, 1) a(2, 4) a(2 s , 3) a(2 , 2) a(2 , 1) a(l , A) a(l , 3) a(l , 2) a(l , 1) Ill 12 11 10 {(3,4) ! (1,1)} {(3,3) (1,2)} {(3,2) , (1,3)} {(3,1) ; (1,4)} {(2,4), (2,1)} {(2,3) ; (2,2)} Figure 21 shows these dependence vectors in the I xL plane. If this program is page indexed for a page size of 4 the cursor will visit the points of the I xl ? plane in the following order: PPPPPPPPPP P P P P P 1» V 5' 6' 3' 4' V 8' 9' 10' 13' 14' 11' 12' 15, P 16 We note that the source point of every dependence vector is visited before its destination point. Thus the page indexing transformation is valid for a page size of 4. The transformation will not be valid, however, for a page size of 9 . In this case P_ will be visited before P.. We present next a theorem to be used in testing the validity of the page indexing transformation for all page sizes. Theorem 3.2 For the program: DO S I = 1, N DO S I = 1, N S A(F 1 (I 1 ), F 2 (I 2 )) = A(f 1 (I 1 ), f 2 (I 2 ))+< an expression not containing references to A > Let F , f be linear functions of I, and F„ , f be linear functions of I~. 112 1 2 3 4 p. P„ P P, ♦J $2 /3 >»4 p^ P. ^Ak/ S"^ P- P n •-* — Z?!\ rT »*7 # 8 rLO *?n ^^12 p p , p P , •13 •14 •15 •16 Figure 21. Dependence Vectors for Program 15 113 Moreover, Let Y x = {F 1 (l), F 1 (2), ..., F 1 (N)} Yl = {f 1 (l), f 1 (2), ..., f 1 (N)} Y 2 = (F 2 (l), F 2 (2), ..., F 2 (N)} y 2 = {f 2 (l), f 2 (2), ..., f 2 (N)} Then the page indexing transformation cannot be applied to this program if and only if both of the following conditions are true: CI: T l = Y l^ y l * * = {F l( k H>» F 1 ^ 12 )i ••" F i (k im )} = {f 1 (k 21 ), f 1 (k 22 ), ..., ^(k^)} and there exists k., and k , 1 < p < m such that k, < k~ . lp 2p' - v - lp 2p Note that F.(k, ) = f . (k ) e T. . 1 lp 1 2p 1 C2: t 2 -i 2 ny 2 M- <* 2 ( Jii>' W- •••• F 2 ( JU )} = Cf 2 (J 21 ), f 2 (J 22 ). •.-, f 2 ( % )} and there exists j. and j„ , 1 < q < I such that j > i „ . iq 2q — - lq 2q Note that F 2 (j lq ) - f 2 (J 2q )cT r Proof: The theorem states that the combined condition C = C1*C2 is a necessary and sufficient condition for the page indexing transformation 114 not to be valid. This is equivalent to saying that the page indexing transformation is valid if and only if CI is not true or C2 is not true. If T is an empty set then CI will not be true. Note that since F and f, are functions only of I. then the program will write in a parti- cular row of A using only points from a single row. Thus if T = 0, then when the program is writing in a row of A it will read values from points on a row which was never ( and will never be) written into. Thus there will be no data dependence vectors between any two points of the I-|Xl 9 plane. This means that the cursor can visit the points in the I-xI ~ plane in any order, and hence the page indexing transformation will be valid. C2 will not be satisfied if T„ is empty. By an argument similar to the one presented in the previous paragraph, if T~ = there will be no data dependence vectors between any two points of the I-i x Io plane. Thus the transformation will be valid. If T 1 ^ and T„ ^ then dependence vectors may exist. Consider Figure 22. When the cursor is at point P (k ? , j„ ) (i.e., the program is assigning a value to A(F (k ) , F 2 ^2 ^' the value of A ^ f l^ k 2p^' f 2^2q^ will be used on the right-hand side of S. If f (k~ ) eT andf„(j„ )eT„, then there must exist k n such that F n (k n ) =f n (k ) and j , such that F„(j. ) = lp 1 lp 1 2p J lq 2 J lq f_(j_ ). Thus there will exist a vector from the point P = (k n ,j n ) to the 2 J 2q r x lp J lq point P = (k , j_ ). Let 6 be the angle between the vector drawn from zp zq P to P and the I„ direction. As shown in Figure 22, 9 can take any value between 0° and 360°. From our previous description of the manner in which the cursor will travel in I^Io when page indexing is used (see Figure 20-b) we conclude the transformation will be valid for all page sizes if and only if 115 P 7 (i 17 , i 2? ) P 6 (i 16' i 26 ) P 5 (i 15 , i 25 ) "7 = 45' ^J r P 8 (l 18' 1 28 ) = 315 p i (i ir W P = 270' e, = 90 6 + < p q ^V 1 ^' i ?A> e. = 180' 4 6 \ = 225 ■* I. •t7 P 2 (l 12' i 22 ) P 3 (i 13 , i 23 ) Figure 22. Dependence Vectors in the IiXl 9 Plane 116 0° £ 6 <_ 90° or 180° £ 9 <_ 360°. In other words, the transformation will not be valid if and only if 90° < 6 < 180°. For 90° < 9 < 180°, sin6 > and hence k„ - k n > 0. Moreover 2p lp cos6 < and hence i_ - 1, < 0. 2q J lq Q.E.D. Fince F and f are linear functions of I and similarly F and f_ are linear functions of I_ the following theorem can be used to test whether condition CI or C2 of Theorem 3 is satisfied. Theorm 3.3 [BANE781 : Given the two functions f(I) = a + al and g(I) = 3 + bJ where a, 3, a, b are integer constants (^0), and I is an integer variable such that 1 1 I 1 N, then the two sets {f(l), ..., f (N) } and (g(l), ..., g(N) } intersect and there will be at least two integers i, j such that f(i) = g(j) with i < j, if and only if the following conditions are satisfied. (A) gcd(a, b) = d divides $-a; and (B) [max U(i o> J o )l < Lmin V(i Q ,J o )J where (i) gcd(a, b) is the greatest common divisor of a and b. (ii) (i=i,j=j)is any solution to the equation ai - bj = 3 - a (iii) the two sets U = U(i , j ) and V = v(i , j ) are defined o o ~ o o as follows : 117 (1-1 )*d/b is in U if b > 0, in V if b < 0; (N - i )*d/b is in U if b < 0, in V if b > 0; o (1 - j )*d/a is in U if a > 0, in V if a < 0; (N - j )*d/a is in U if a < 0, in V if a > 0. (i - j +l)*d/(a-b) is in U if a > b, in V if a < b . Proof : see [BANE78] We now illustrate the use of Theorem 3.3 in testing CI and C2 of Theorem 3.2. Consider the following program: DO S I = 1, 9 DO S I = 1, 9 S A(2I -1, I 2 +2) = A(I +1, 10-I 2 ) We first check if CI is true. Thus we test whether the two functions f(I) = 21 - 1 = al + a and g(J) =J+l=bJ+3 will intersect and whether there is some i and j such that f(i) = g(j) and i < j. The gcd(a, b) = 1 and 3 - a = 2. Thus gcd(a,b) divides 3 — a. A particular solution to the equation 21 - J = 2 is i =10 and i = 18. o J o U is the set (-9, -8.5, -7) and the set V is (-1, -4.5). [maxU(i , j )] = -7 and [ min V(i , j )i = -5. Hence condition (B) is satisfied, o o Thus CI is true of this program. Now we test whether C2 also holds. Thus we test whether the two functions f(I) = -I + 10 = al + a and g(j) = J + 2 = bJ + 3 118 intersect at some I = i and J = j, such that i, < j - . Here we have a - 3 = 8 gcd(a, b) = 1 Hence condition (A) is satisfied. A particular solution to the equation -I - J = -8 is i =2 and i =6. Thus we have: o J o U = (-1, -3), Tmax U(i Q ,J o )l = -1 V = (7, 5, |),Lmin V(i Q ,J o )J = 1 Hence condition (B) is also satisfied and C2 holds for this program. Since both CI and C2 are true, page indexing cannot be applied to this program. In Theorem 3.3 we assumed that a 4 0, b t 0, and a-b 4 0. The conditions to be tested are simple if these assumptions did not hold. For example, if a ^ and b = then the two functions f (I) = al + a R — Ci and g(J) = 3 will intersect if and only if is an integer between a 1 and N. In the case where a = b ^ o the two functions will intersect if a - is an integer between 1 and N. For this case the two functions f(I) = bl + a and g(J) = bJ+ $ will intersect at the points (I = i, j.i+Aii), 1.1, 2 N-^fA If different elements of the array A are referenced in the right- hand side of the statement S in the NP under consideration, then we use Theorems 3.2 and 3.3 to determine whether CI and C2 hold between the sub- script expression of the output variable A(F (I ) , F„(I~)) and the sub- script expressions of every reference to a different element of A on the right-hand side of S. If the 7T-block has more than one statement then 119 we do the testing between the set of multl- dimensional array output variables and all references to different elements of these arrays in the set of input variables of the ir-block. Note that for the Tr-block: DO S I, = 1, N n ml 1 DO S m I 2 = 1, N 2 S l S 2 s m m the set of output variables is \j OUT(S (I , I 9 )) and the set of input k=l k 1 Z variables is given by: m k-1 U [IN(S,(T, I )) - U0UT(S.(i., ij)] k=l l 1=1 If the basic NP has several TT-blocks we must do the testing be- tween the set of multi-dimensional output variables of the NP and their occurrences in its set of input variables. 3.5.3 Transforming Nonbasic TT-Blocks into Basic TT-Blocks The page indexing algorithm does not achieve its goals if applied to nonbasic TT-blocks. As an example consider the following program: 120 Program 16-a. DI S, DO S I = 1, N B(I,1) = A(I,1)**.5 DJ DO S J = 1, N S 2 A(I+1, J) = B(I, J) + C(I, J) If we apply the page indexing algorithm as described in the previous section to Program 16-a, we get the following program: Program 16-b. DO 10 IP = 1, fN/RZl ILB = 1 + (IP-1)*RZ IUB = MIN(IP*RZ, N) DO 10 I = ILB, IUB S B(I,1) = A(I,1)**.5 DO 10 JP = 1, TN/RZ1 JLB = 1 + (JP-1) * RZ JUB = MIN (JP * RZ,N) DO 10 J = JLB, JUB A(I+1,J) = B(I,J) + C(I,J) CONTINUE We note that the index sequencing of Program 16-b is identical to that of Program 16-a. The advantages of page indexing, i.e., making the maximum number of references to a page while it is in main memory are not achieved. Any nonbasic it -block, however, can be changed to a basic one by expanding the scope of some of its DO statements to make all assignment 2 10 121 statements fall at the same nest depth level. Of course some of these statements must now be executed conditionally. For example the scope of the DJ statement in Program 16-a can be expanded to include S . The resulting basic TT-block is shown below. Program 16-c. DI DO S 1=1, N DJ DO S J=1,N 5 1 IF(J.EQ.l) B(I,1) = A(I,1)**.5 5 2 A(I+1,J) = B(I,J) + C(I,J) The page indexing transformation will now be effective. This is shown below. Program 16-d. DO S IP = 1, TN/RZl ILB = 1 + (IP-1)*8 IUB = MIN(IP*8,N) DO S JP = 1, fN/RZl JLB = 1 + (JP-1)*8 JUB - MIN(JP*8,N) DO S I = ILB, IUB DO S J = JLB, JUB S IF(J.EQ. JLB.AND.JP.EQ.l) B(I,1) = A(I,1)**.5 S 2 A(I+1,J) = B(I,J) + C(I,J) Note the modification in the IF statement. We now discuss a general algorithm to transform any ff-block struc- ture into a basic structure. Let the set of DO statements in the 7T-block 122 be Dtt = {DI , DI , . . . ,DI }, ND > 1. The set of corresponding index variables is denoted by {i- ,I_, . . . ,1 }. For each DO statement DI . we denote the lower bound of its index variable set by L. and the upper bound by U.. Let the set of non-DO statements in the TT-block be denoted by Sir = {S,,S_,...,S }, m > 1. For each S. let 1 I m l DB. = {the set of DO statements that precede S. and whose scope do not include S.} ■ !DI bi,l' DI bi,2 DI bi,k.>' ?<«!<*»• 1 Moreover, let DA. = {the set of DO statements that follow S.} l l = {DI . ,DI . ,...,D . }, < s. < ND. ai,l ai,2 ai,s. ' - i l Then the Tr-block can be transformed to the form: DI i DI„ DI ND B 1 .C 1 s 2 .c 2 s .c m m where C . is a Boolean variable which controls the execution of S. .If l l C. is true then S. is executed .else it is not. C. is given by: i l • l C. = {(I, . =U, . .).AND.(I, . =U, . _) AND. (I,. =U. . . ).AND. l bi,l bi,l bi,2 bi,2 bi,k. bi,k. l l (I . -L . -KAND.CI . =L . ) AND. (I . =L . )} ai,l ai,l ai,2 ai,2 ai,s. ai,s. To illustrate the application of this algorithm consider the Gaussian elimination program shown below: 123 Program 17-a. DI X DO S 2 I = l.N-1 DI 2 DO S 1 I 2 = (Ij+D.N 5 1 A(I 2 ,I 1 )=A(I 2 ,I 1 )/A(I 1 ,I 1 ) DI 3 DO S 2 I 3 = (1^1)^ DI 4 DO S 2 I 4 = (Ij+l),!! 5 2 A(I 4 ,I 3 )=A(I 4 ,I 3 )-A(I 4 ,I 1 )*A(I 1 ,I 3 ) Here we have : Dtt = {DI ,DI 2 ,DI ,DI A } Stt = {S 1 ,S 2 > DB X = (() DA = {DI 3 ,DI 4 } DB 2 = (DI 2 ) DA,,-* C = I 3 .EQ.(I 1 +1).AND.I 4 .EQ.(I 1 +1) C 2 = I 2 .EQ.N Thus the corresponding basic 7T-block is as follows: Program 17-b. DI DO S I =1,N-1 DI 2 DO S 2 Ij-C^+D.H DI 3 DO S 2 I 3 -(I+1),N DI 4 DO S 2 I 4 =(I 1 +1),N Sj IF (I 3 .EQ.(I 1 +1).AND.I 4 .EQ.(I 1 +D) a(i 2 ,i 1 ) = a(i 2 ,i 1 )/a(i 1 ,i 1 ) S 2 IF (I 2 .EQ.N) A(I 4 ,I 3 ) = A(I 4 ,I 3 ) - A(I 4 ,I 1 )*A(I 1 ,I 3 ) 124 We note that this algorithm will introduce a large amount of control instructions when the TT-block is executed. This excessive control can be reduced by fusing some of the loops in the 7T-block, when- ever possible, before expanding their scopes. Note that at this point in the transformation process we know the data dependences in the TT- block and thus checking for the validity of loop fusion is a trivial additional expense. The combined loop expansion-fusion transformation can be applied to Program 17-a in the following steps: (Expand DI 3 ) Program 17-c. DI 1 DO S 2 I 1 - 1,(8-1) DI 3 DO S 2 I 3 - (Ij+D.N IF (I .EQ.I +1) DI, DO S l 1 2 = (I i +1) ' N DI A(I 2 ,I 1 ) = A(I 2 ,I 1 )/A(I 1 ,I 1 ) DO S 2 I 4 = (Ij+1) ,N A(I 4 ,I 3 ) = A(I 4 ,I 3 ) - A(I 4 ,I 1 )*A(I 1 ,I 3 ) (Fuse DI and DI 4 > Program 17 -d . DI DO DI„ DO DI, DO 1 1 = 1,(N-1) I 3 = (Ij+1) ,N 1 2 = (1^1) ,N 125 S ± IF (I 3 .EQ.I 1 +1) A(I 2 ,I 1 ) = A(I 2 ,I 1 )/A(I 1 ,I 1 ) S 2 A ( I 2 ' I 3 ) = A ( I 2 ' I 3 ) " A(I 2> I 1 )*A(I 1 ,I 3 ) Thus in general, the nonbasic to basic TT-block transformation consists of a series of loop expansion and fusion steps. One starts by trying to fuse loops in the given TT-block. This is followed by expand- ing the scope of the farthest reaching DO statement (if we associate a CONTINUE statement with each DO statement and number these CONTINUE statements sequentially, then the farthest reaching loop is the one assoc- iated with the CONTINUE statement with the largest label) . This process of fusion followed by expansion is continued until a basic tt structure is reached. Note that to expand a loop we use the algorithm presented previously in this section. For Program 17-d the page indexing transformation can now be applied as shown below. (This is a legal Fortran version. Also note that we have substituted K for I , J for I , and I for I„). Program 17-e RZ = Z ** .5 NP =fN/RZl DO S KP = 1, NP KLB =1+(KP-1)*RZ DO S JP = KP, NP JLB = 1 + (JP - 1) * RZ JUB = JP * RZ DO S 2 IP = KP, NP TLB = 1 + (IP - 1) * RZ 126 IUB = IP * RZ IF (IP.EQ.KP) KUB = KP * RZ - 1 IF (IP.NE.KP) KUB = KP * RZ DO S K = KLB, KUB IF (IP.EQ.KP) ILB = K + 1 IF (JP.EQ.KP) JLB = K + 1 DO S J = JLB, JUB DO S I = ILB, IUB IF (J. EQ. JLB. AND. JP.EQ.KP) A(I,K) = A(I ,K)/A(K,K) IF (J.LE.JUB) A(I,J) = A(I,J) - A(I,K) * A(K,J) 127 4. EXPERIMENTAL RESULTS The aim of this chapter is to provide some preliminary experi- mental evidence of the usefulness of the transformations presented in Chapter Three. We will also discuss some experiments which we performed to investigate the concept of bounded locality intervals [BATS76a] and the correlation between a program's syntactic structure and its BLI*s We have chosen 17 Fortran IV programs to experiment with. There were two reasons to select programs written in Fortran and not in other languages. First, there are a large number of all kinds of Fortran pro- grams available for experimentation. Second, the current version of the PARAFRASE compiler accepts only Fortran programs. We think of the trans- formations presented in Chapter Three as modifications and extensions to some of the transformations already existing in the PARAFRASE compiler in addition to some new ones which are specifically aimed at the en- hancement of the performance of virtual memory systems. Eleven of our programs were chosen from a collection of programs which we got from different national laboratories. In the other six pro- grams we coded some standard matrix algorithms. In selecting the eleven programs we followed two guidelines. First, we wanted a set of programs which was fairly representative of various numerical Fortran programs. We wanted the complexity of the calculations performed in the programs to vary from simple or merely data movement operations to complex compu- tations. Second, we eliminated any programs which have relatively small memory requirements. We required that each of the chosen programs has a virtual address space of more than twenty pages. 128 We have chosen the page size to be 256 bytes (64 words) . For our purposes, the choice of the page size is not critical. We are trying to demonstrate that programs which reference multi-page arrays, irrespective of the size of one page, can be transformed to behave better in a paged virtual memory environment. At the end of this chapter we will discuss the effect of varying the page size on our re- sults. We will show that the effectiveness of our transformations is rather independent of the page size. For our purposes, what matters is not the absolute value of the size of pages and the sizes of arrays but their relative sizes. Since we are mostly interested in programs which have large virtual space (these are the programs which usually can have disasterous behavior in virtual memory machines) a page size of 256 bytes seemed to be suitable to ensure that our collection of pro- grams have large space requirements. As mentioned earlier we will re- turn to this subject in much more detail at the end of this chapter. Table 7 shows a brief description of the programs used in our experiments. The total number of source cards (excluding comments) is 1598. The total number of DO statements is 200. We generate the trace of a program using the arrangement shown in Figure 23 . The input Fortran program is passed through the scanner of the PARAFRASE compiler and the IBM Fortran IV Gl level 2.0 compiler. The output of the Fortran compiler is a listing showing every statement of the source program and the por- tion of the object code associated with it. We examine this output and make a list of the statement numbers of those statements which must be executed by the trace generator. These include any statements which en 4-1 d cu o cj Cfl QJ I bO U T3 CO CU CU fa IH CJ CU 14-1 o Pi 0) >. U CO CO CU r-l 4-1 H CU > ctj u u < o O I CU CO 4-1 4-1 4-> Odd •U cu =fc co S I IW CU CO o 4-1 -u CO d =*fc 4-1 CU co cu a u d O CO CO U toO O u fa CO o •H CO 43 fa d O vO CNI CN a> H CJ W > CO CJ •H CO >s •d fa T3 3 O o cj o o co CT\ CN a w en <: d o •H -1 (0 N •r( rH CO •H 4-1 •H c 00 > to 'J ■H S r«. st CN CO vO CN CO r-\ CU •H fa o •H 4J cu e 00 CO I u 4-1 O CU CJ o w CN m m vO CO CN CN O m cn * vO 00 CO O m 4-1 CO >> CO 00 00 d o •H 4-> CO fa d co •H CO CO d CO o vO CO sr vO CM vO d o •H 4-> CO N •H rH CO •H 4-1 m ■J- CM st m CM in CM w CO Pi w fa CO Q rJ W M fa Pi rJ fa Pi H Pi ^> O fa 8 cu 4-1 CO >> CO 00 st 00 d o •H 4J •H CO O fa I o cu dJ ►J VD CO CM CM CO O •H co ^ 43 fa d o r-l tj 00 CT« 00 00 CO co co 00 d •H rH a d o u CU r4 4J m 00 cn r^ av cn r-~ o vO co CM m vO m m m CM CM CO CO cu •H U 4-1 CO o o St d o •H 4J cfl o •H rH fa •H 4-1 rH X •H r-l 4-1 CO o o vO r» m CN 00 co o St o st cu o CO rH fa d •H CU CO O fa CO d co u H X •H S-i 4J CO m m r~ cn vO CM u o 4J CO d cu o >% 4J •H o o rH CU > o •H 4-1 r4 CO fa 6 o d CO fa 00 00 vO 00 m vO vo co cm m St W o 73 CU •d cu ^ •d T3 rJ pi z CJ u a rJ & 3 s H ^) i < H fa CO CJ •H CO >% fa 73 d o rH CJ 00 CM rH 00 St CN 00 m vO m CM rH m o 129 u o 4J cfl o x> CO ►J J3 CJ r-l CO 0) CO cu Pi CJ •H V4 cu rd 0- CO o a 4J < C/) CO O H 43 O CO d rJ rH CO rH d M O fa 14-1 (0 O CU 4-1 cu •H CJ CO r-l V4 O cu fa > •H U d -h p> <: 1 1 >> u o 4J CO u o CO rJ d 43 CU CJ 'H r-l a co CO 0) CO rH CU CO Pi d CO > CO 25 X) c cfl CO 60 a CO CU o c cu M cu «w cu Pi CO M cu 3 cu o o i-> cu CO :=> CU g o CO CU CO CN cu M 3 60 131 calculate index variables, loop bounds, or conditions of logical IF statements. The trace generator receives the output of the Fortran compiler, the program description tables from the PARAFRASE scanner, and a control input which includes the list of statements to be executed, specification of the storage scheme of multi-dimensional arrays (storage by rows, columns, or square blocks), the page size in words, necessary values for some variables used in the input program, and branching probabilities to be used in those IF statements for which the test con- dition cannot be evaluated by the current trace generator. Thus the trace generator will simulate a partial execution of the input program which is sufficient to get an accurate trace of array references. The branching probabilities and the values to be given to variables are chosen with the help of the documentation of the input program or by personal communication with the people who supplied the program. In two occasions we had to eliminate a loop in a program. We eliminated the following loop from the Fast Fourier Transform program, FOURTR : DO S, 1=1, N2N D S IIN1 = 1 + REVERS (I) S IIN2 = 1 + REVERS (1+1) S CR(I)= INR(IINl) + INR(IIN2) S 4 CI(I)= INI(IINl) + INI(IIN2) S 3 CR(I+1) = INR(IINl) - INR(IIN2) S 6 CI(I+1) = INI(IINl) - INI(IIN2) This had to be done because the current trace generator cannot evaluate statements S.. and S„ which is necessary to calculate some subscripts in 132 statements S„, S,, S_, and S.. The current trace generator does not evaluate expressions if they contain array elements. In the other occasion we eliminated a loop from the TWOWAY pro- gram. This loop contained 211 statements with several inner loops at different nest depth levels. Analyzing this program with this loop included exceeded the capabilities of the current PARAFRASE compiler. Our original plans for the experiments were to apply one trans- formation at a time to each of our programs in order to measure the contribution of each transformation to the total achieved improvement. We decided to abandon these plans for the time being due to the enormous amount of results which would be generated. Thus we applied all the transformations possible to a given program in order to achieve the best possible improvement. We used a mixture of automatic and manual means for applying the transformations. The data dependence relations were analyzed automatically. Part of the transformations were already imple- mented in the PARAFRASE compiler. The clustering transformation has been added to PARAFRASE and work is continuing to add the rest of the trans- formations. To obtain our current results, whenever we had to, we ap- plied the transformations manually. We would like to emphasize that we look at the experimental results reported here as preliminary results. We decided that initially it is important to get a feeling for the amount of improvement which can be achieved in the behavior of real programs by transforming a few programs, using automatic and manual means, and examining the results rather than waiting to fully automate the trans- formations before generating any results. We feel that our preliminary results serve as the green light which signals that the investment of effort in automating all our techniques is a safe investment. 133 In Table 8 we compare some of the characteristics of the original and transformed programs. This table is meant to give a feeling for the worst possible cost of transforming a program. We will explain as our discussion progresses why this is the worst cost of the transformations. For 6 of the 17 programs the number of pages referenced in the transformed program exceeds the number in the original program. This is due to the scalar expansion transformation. We note that the maximum increase is 5 pages. We also notice an increase in the number of array references for those programs where scalar expansion was used. This is not a real increase in the number of memory references to data words in the trans- formed program. These extra memory references reported for the trans- formed program are also made in the original program, but to scalar variables. For the original programs these references were just not counted because we only count references to array elements. The increase shown in the number of source statements in the transformed programs is not really accurate. It is an over estimate. The reason for this is that our current trace generator is not very smart and in many cases we had to insert redundent statements to make the tracer do what it is supposed to do. For example the current tracer cannot evaluate ILB in the following statement: IF (KP.EQ.l) ILB = K+l To achieve this assignment to ILB we do the following IF (KP.NE.l) GO TO 1 ILB = K + 1 1 Moreover, our tracer does not evaluate functions. Thus to make the assignemnt : T3 CD T3 4-1 OJ 3 e U M CD O K <■« W CO c CO nj C M O H ON o o CM CM rx nO rx xt co m m NO m xt ON ON NO o CO O ON 00 00 CO CO o o xt nO m 00 ON m CO rx m o rH o\ rH CO xt & c C CD o S y-i CD CO ■u C co CO ■u u en H as co NO xt CM co xt as cm rH rH CO rH CO nO xt as rx CO CO 00 CO rH m m CO o rH <-{ NO CO CM rH CM -d- o NO CM NO rx CO -\ rx CO CO CM m CM m rH rH OJ -a e CD i-i O O C <+-« CD CO t-i c CD CO CD H OS CO CM rH m o oo CO CO CO CM xt CO rx CM m co CM 00 CM NO CO m xt CM 00 CO o o CM m rx 00 in rx m oo m CM rH oo xt CM CO r-\ CD Cfl 60 C CO -H Ph 00 NO o CM O CM CO m iH 00 CM CO xt CO rx. CM m CO CM 00 NO CM CO m xt CM NO CO 00 ON m rx 00 m m oo cm rx CM rH CO xt CM u o u 00 o u Pm W w H Pd P£J c_> Z w Q H w w W ru r4 Pi > en O en w Pi & 5 < M Q M M rJ o W PQ PQ c_> o b fe Pn O S P-i o 135 IUB = MIN (N,IP*Z) we do the following IUB = IP*Z IF(IUB.LE.N) GO TO 2 IUB = N 2 These and other inefficiencies in our tracer lead also to an over es- timation of the increase in the number of instructions executed in the transformed programs. The more pronounced increase in the number of executed instructions for programs CD, LUD, and GE is mainly due to the nonbasic to basic ir-block transformation. Our current implementation of this transformation introduces an appreciable amount of control instructions. Further effort needs to be made to improve the implemen- tation of this transformation. In Chapter 5 we make some suggestions concerning this point. Our experiments fall in three categories. In the first we implemented the algorithms described in [BATS76a] to find the BLI's of our programs and their transformed versions. The purpose of these experiments is to investigate the validity of the BLI concept in defining the localities of a program. Moreover, we wanted to compare the characteristics of the localities found in a program to those found in its transformed version. We also wanted to compare our findings to the experimental results reported in [BATS76a]. We will discuss all these issues in Section 4.1. In the second category of experiments we simulated the local LRU memory management algorithm and generated the page-faults vs. memory 136 allotment and the space-time cost vs. memory allotment curves of every program and its transformed version. The purpose is to compare the cost of executing original and transformed programs under LRU. We have chosen the LRU algorithm because it is known to be the best among the heuristic replacement algorithms and because most of the existing virtual memory machines use some sort of an LRU algorithm for memory management [ScHE73] , [ JONE72] . The results of these simulations are dis- cussed in Section 4.2. The third category of experiments are designed to investigate the important question of finding whether there are any merits for using variable memory allotment policies as compared to using fixed memory allotment policies for the memory management of transformed programs. We have chosen to use the working set management policy as a represen- tative of variable memory policies [DENN68] . We compared the space- time-cost for the transformed programs under the LRU and working set policies. For several programs we encountered the real memory-fault rate and parameter-real memory anomalies as described in [FRAN78]. This point and the LRU-working set comparison will be discussed in Section 4.3. In Section 4.4 we summarize the implication of our findings and investigate the sensitivity of our results to the page size. 4 . 1 Measuring the Characteristics of Program Localities To measure the characteristics of program localities one has first to identify these localities. This can easily be done for the transformed versions of our collection of programs because they follow the ELM. In a transformed program, whenever a 7T-block is being executed, the reference string will stay within one locality interval. The MTBR to every page of this locality is small, 0(R„), where R p is the number 137 of array references made per iteration of the innermost loop of the tt- block. The density of references to a page is high. Hence for trans- formed programs one can identify localities, count the number of pages referenced in each locality, and its duration. Loops in untransformed programs do not in general follow the ELM and hence it is not easy to identify localities and measure their characteristics . Thus one can measure the characteristics of localities in transformed programs but cannot compare these measurements in an accurate way to measurements made on the original programs. The localities of original programs are simply not well defined! The locality of reference of untransformed programs is a vague, loose, and unquantif iable concept. The work of Batson and Madison [BATS76a] , [BATS76b] , is the only effort previously made to identify localities in reference strings of programs. In Chapter 2 we have shown that there are several problems with the concept of BLI's as developed in [BATS76a] . We confirmed the existance of these problems by implementing Batson' s algorithms and finding the BLI's of our programs. We then correlated the BLI's structure of a program to its syntactic structure. We made assumptions which are identical to those made by Batson. He assumed that there is a one-to-one correspondence between array names and segment identifiers. In other words, he assumed a segmented virtual memory system in which the segment size can vary. Each array, irrespective of its size, is stored in a single segment. After using the BLI's generated for our programs to investigate the correctness of the BLI concept and find its problems, we decided to 138 use the resulting data for other purposes. If meaningless and misleading BLI's are discarded one can identify those BLI's that correspond to loops. Thus by carefully examining the BLI's of a transformed program one can find the duration of execution of each TT-block and the number of referenced arrays. This gives the size and lifetime of true locali- ties in transformed programs. For the untransf ormed programs we identi- fied the BLI's corresponding to outermost loops and recorded their duration and number of referenced arrays. Our findings will be discussed in Section 4.1.1. We used the same techniques discussed in the previous para- graph to collect data about the size and duration of localities for paged virtual memory systems. In this case an array, depending on its size, will span several 256 byte pages. In a transformed program, when a TT-block is executed, one locality set of pages will be referenced after another. We collected data about the size and duration of localities of a program by carefully examining its BLI's when generated under a paged system assumption. For the untransf ormed programs we collected data about the number of pages referenced in BLI's that correspond to outer- most loops. Our findings are discussed in Section 4.1.2. 4.1.1 Localities in Segmented Systems Because of the kind of segmented system we have assumed in this section, we do not include any data from programs CD, FLR, GE, LUD, MATMUL, and MATTRP . Including data from these programs would have biased our findings towards localities of small sizes. In each of the programs MATTRP, LUD, CD, and GE only one segment is referenced. In FLR two 139 segments are referenced and in MATMUL three segments are referenced. In selecting programs for his experiments Batson rejected any programs which reference less than six arrays . In Table 9 we compare some of the characteristics of his programs and our programs (excluding the six previously mentioned). We note that our programs have, on the average, fewer arrays. Hence the locality of our untransformed programs is slightly better than Batson' s programs. Thus the improvement results which will be reported are on the conservative side. The results would have been even better if we had Batson 's collection or programs with more arrays. This fact is emphasized by Figure 25 which will be dis- cussed shortly. Table 9. Comparing Some Characteristics of Our Programs and those Used by Batson and Madison. Our Programs Batson and Madison Programs Number of Arrays Referenced in a Program Minimum 6 6 Average 24.3 26.1 Maximum 57 127 Size of the Reference Strings Minimum 11 152 5 459 Average 71 651 42 857 Maximum 236 027 102 227 140 Before discussing our results we make one more remark. Our programs were transformed with the assumption that they will run in a paged system with a page size of 256 bytes. However, it is not diffi- cult to deduce the characteristics of the localities if the programs were transformed to run on a variable segment size system. The only thing we have to do is to eliminate the effect of the page indexing transformation on the generated data. For example consider the following program: Program 18-a. DO 10 I = 1,16 DO 10 J = 1,16 A(I,J) = B(I,J) + C(I,J) 10 D(I,J) = B(I,J)/2 For a variable segment size virtual memory system this program will be transformed as follows: Program 18-b. DO 101 I = 1,16 DO 101 J = 1,16 101 A(I,J) = B(I,J) + C(I,J) DO 102 I = 1,16 DO 102 J = 1,16 102 D(I,J) = B(I,J)/2 The resulting BLI structure is shown in Figure 24-a. We have two localities. The first includes segments A, B, and C and lasts for 768 memory references. The second locality includes segments B and D and lasts for 512 memory references. If the loop of program 18-a was in one 141 CM rH m •^ B < oo U oo 60 O H Ph J-l o PQ CM 60 •rl 00 00 o o I 00 B cd u 60 O Pd M O PQ I CM (U 3 60 •H ft. CM CT\ O PQ < 0) ►J CM QJ > 142 of our collection of programs, it would have been vertically distributed as shown below: Program 18-c. DO 10 IP - 1,2 ILB = 1 + (IP-1)*8 IUB = IP*8 DO 10 JP = 1,2 JLB = 1 + (JP-1)*8 JUB = JP*8 DO 101 I = ILB, IUB DO 101 J = JLB, JUB 101 A(I,J) = B(I,J) +C(I,J) DO 102 I = ILB, IUB DO 102 J = JLB, JUB 102 D(I,J) = B(I,J)/2 10 CONTINUE The BLI's of this program are shown in Figure 24 -b. It can easily be seen that the localities for the segmented case in Figure 24-a can be found from those in Figure 2 4-b by lumping into one locality all the BLI's which have the members A,B,C. In this way we get the locality in figure 24-a with the members A,B, and C and with the duration 192*4 = 768. Similarly we get a locality of duration 128*4 = 512 and with the mem- bers B and D by simply lumping in one locality all the BLI's of Figure 24-b in which these arrays are referenced. Figure 25 shows the characteristics of localities for the transformed programs. In the 11 programs a total of 756 121 references CO 5 < cr o to in o 2 f cr < < Q_ K cr o o Q o o UJ cr cr 5 a. a. (E O _i V) u. < Z o z < z o cr cr < l- o m >J> ^ J* UJ u. UJ cr k\\\\\\\\\\\\\\\\\\\\\\\\vw^\\\\\\\^\xv: IS l\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\ m^ W k\\\\\\ \\\\\\\\\ \\\\\\\\\\\\\\\\ < ^: TOW r>\\\\\ N \\\\\\\\\ \\N\\\ \\\\\\\\\\\\vsv SSTO cz -TO \\\\\\\\\\\\\\\\\\\ \\ \\ \\ \\ \ \ \\\W TOTO K^ 8SS k\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\w? TO5 Mmmm^ ^^M ^mm WWWWWWXWVCV^ TO5 immimu ,\\\\\ \ \\ \ \\\ ^zz^5? TO .\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\x\ Tg ^yro ■\\X\\W\X\\\NXN\NX\V TO ^^^^MH^HS TOW^WTOTO H hWWWWWWNWWNSS -J UJ UJ > tsl *-~ ^ w _l c >* 0/ >- J-) CD ID if) ro on CD IP i^.>fcXX>XVXX>XXV< eS g?^ ro 3^ oooooooooo Oor>cor~-ioif>^frO(\j'-' CO a I 0) CO z v| 0J N •H CO 03 o o c •H XI CO o c 0) "4-1 H <4-| o QJ 00 c 0) CJ M CD Pl, m CM u 3 60 •H 143 144 were made. 753 859 references were made when the programs were executing within localities. These make 99.7% of the total number of references. Part of the remaining .3% of the references were made outside loops. The other part of the .3% can be attributed to the fact that the BLI method is not exact in finding the duration of a locality. As shown in the figure more than 48% of the references were made while the transformed pro- grams were executing within localities of size 2 or less. More than 97% of the references were made within localities of size 5 or less. In Figure 25 we also show data for our untransformed programs and Batson's programs [BATS76a] . For Batsons' programs the figure shows the distribution of array references on level one BLI's of different sizes. For our untransformed programs the data represents the distribu- tion of array references on BLI's which correspond to outermost loops . If we accept the argument that the data of our untransformed programs and Batson's data do not represent very different things, then one can deduce from the figure that our programs are more local than Batson's. While 45% of references are issued in level one BLI's of size less than or equal to 5 segments in Batson's programs, almost 70% of the references in our programs are made in loops with 5 or less arrays. Thus, as was mentioned earlier, our reported improvement results are on the conserva- tive side because untransformed programs can be less local. One can get an intuitive idea about the improvement achieved by our transformations by comparing the data of the original and trans- formed programs in Figure 25. Because of the assumptions made when the data was generated (one segment per array) the improvements which we see here underestimate drastically the power of the transformations. The 145 power of the transformations will be more fairly seen when paged virtual memory systems are discussed. As was mentioned previously, in the transformed programs more than 97% of the references are made in localities of size 5 or less. For the untransf ormed programs only 85% of the references are made in outermost loops with 14 or less arrays. While almost 50% of the refer- ences in the transformed programs are made in localities of size 2 or less, only 30% of the references in the original programs are made in loops with 4 or less arrays. 4.1.2 Localities in Paged Systems The general intuitive impression one gets from examining Figure 25 is that the locality of untransformed programs is not really that bad under the assumptions of the previous section. Almost 80% of the ref- erences are made in loops with 6 or less segments (arrays). More than 98% of the references are made in loops with 15 or less segments. Since the number of segments in our programs varied between 6 and 57 with an average of 25.3, then their locality is good. One can arrive at similar conclusions from examining the data representing Batson's programs. Virtual memory systems, however, face their serious problems when they execute programs for which the assumptions of the previous section do not hold. Batson's programs were selected from the daily work load of the University of Virginia computing center. They were executed on the B5500 computer which supports a variable segment size virtual memory system. The segment size can take values between 1 and 1023 words. Since, in his programs, there was a one-to-one correspondence between array names and segment identifiers, none of the programs had an array 146 larger than 1023 words. Although in our programs there are many arrays which are larger than 1023 words, we still assumed that each array will occupy one segment when we generated the data of the previous section . We were interested in investigating the BLI concept and in finding a lower bound in some sense on the improvements achieved by our transforma- tion techniques. When multi-segment or multi-page arrays are referenced in pro- grams, their degree of locality becomes drastically low. This is because, in general, there is no one-to-one correspondence between the number of array names referenced per iteration of a loop and the number of pages referenced. In [ELSH74] it was shown that in a paged system the locality of a matrix multiplication program which makes references only to 3 array names can be improved drastically by using some rules in accessing the elements of these multi-page arrays. Batson in [BATS76b] points out that the implications of his measurements of program localities do not apply to paged systems. We quote, "Thus it seems clear that major phases, with relatively small activity sets, span the major part of the execution epochs of programs. This phenomenon, otherwise known as locality of reference, is the raison d'etre for the successful operation of symbolically- segmented virtual memory systems. Its implications for paged virtual memory systems are less promising, since there is no correspondence in general between pages and symbolic segments ." As we have mentioned in Chapters 2 and 3, our transformations serve two purposes. First, they make all loops behave like elementary loops for which the number of pages referenced is highly correlated to the number of array names. Thus for transformed programs there will be a 147 one-to-one correspondence between array names and pages referenced. Second, the transformations will reduce the cost of executing programs in a paged system, namely the space-time cost, the number of page faults, and the amount of memory allotment required. Figure 26 supports our argument. We have generated the BLI's of our 17 original programs and their transformed versions. Here we assume a paged system with a page size of 64 words (256 bytes). For the transformed programs the data in the figure represents the percentage of array references made while the programs executed with locality sizes of a particular number of pages or less. We got this data by careful cor- relation of the generated BLI's to the source programs. For the untrans- formed programs the data represents the percentage of references made while the programs executed in BLI's of sizes equal to or less than a particular number of pages. The BLI's correspond to outermost loops. In the figure we see that for the transformed programs more than 71% of the 1 483 921 array reference were made in localities of size 3 pages or less. 83% were made in localities of size 5 pages or less and more than 97% of the references were made in localities of size 8 pages or less. If we compare Figures 25 and 26 we find that the locality of the transformed programs is comparable for both paged and segmented sys- tems. The only noticable difference is that the percentage of references made in localities of 3 pages is higher than the percentage made in localities of 3 segments. This is because the results shown in Figure 26 include data from the six programs which we excluded from our experiments in the previous section. The transformed versions of five of these pro- grams (CD,FLR,GE,MATNUL, and MATTRR) issue their references in localities UJ Z o ™ UJ — u a S ■j " s. ^ o 148 < QQ u o o in o ^^^^ c/> S "* «■«■■■! ^^^^^^^^^ < K O O < K ^^^^^^^^ Q. O K a. ^^^^^^^^ Q UJ 2 K _i < O u. z < IT " f— ^U 1 1 -* / cr> CD id m ro cm o o o o o O O o O o 0> 00 r- ID ifi 5f ro CVJ CO ai CO PL, m , the f curves approach their asymptotic values with Kl L small slopes. Since page fault curves are in general not smooth curves, i.e. the slopes change abruptly, we cannot choose a particular slope to find the exact location of the knee point for each curve. For example saying that the knee point is the point at which the slope of the curve is 135° would not work. Examining the space-time curves for the trans- formed programs, we noticed that the memory allotment at the absolute minimum space-time cost points can be used to identify the knee points in the page fault curves. If we denote the memory allotment at the minimum space-time cost point of a transformed program by m , then we choose to take hl = m . Using this method of finding the value of iil was 156 successful in locating the steep drop regions in the page fault curves. Although m and m have the same value for each transformed program, we wish to use two symbols to emphasize the distinction between the dis- cussion of mono and multiprogrammed systems. Let us now restate what we are trying to do. We want to compare the memory allotment needed in the machine executing the untransformed programs to the memory needed in the machine exeuting the transformed programs while both machines operate at the same level of performance. Here we have to decide on the levels of performance to be used in making the comparisons. We will make two sets of comparisons. In the first set we take the performance level achieved by the transformed programs at m = m to be the comparison level. In other words we will compare m, and m , where f (iil ) < f (m . ) (the less than sign is used because the ckt t k.t ckt f curves are not continuous curves). Thus, m is the memory allotment needed by the untransformed program to generate no less than f (m, ) page faults. This type of comparison shows the value of the transfor- mations for each program individually because m, is in general different for different programs. In the second set of comparisons we are more interested in the improvements across the programs from the OS point of view. In other words, if the machine of the transformed programs has only 4 page frames to be alloted to each of these programs, then it is interesting to know the number of page frames needed by the untransformed programs machine to achieve the same level of performance. We will do this comparison with 4, 6, and 8 page frames. Table 11 shows the results of the first set of comparisons. We note that iil ranged from 1 to 8 with an average of 4.53. The median 157 Table 11. Memory Requirements of Transformed and Original Programs at Similar Performance Levels - the Transformed Programs Knee Points Level. Program \t \t /DP LP(m kt> m ckt W DP m ckt /m kt ADVECT 6 .0265 .229 31 .137 5.17 BASE 5 .0167 .817 38 .127 7.60 BIGEN 2 .0052 .877 5 .013 2.50 CD 3 .1429 .231 11 .537 3.67 DISPERSE 3 .0041 .762 60 .082 20.00 FIELD 8 .1538 .853 18 .346 2.25 FLR 2 .0870 .821 7 .304 3.5 FOURTR 6 .0468 .133 65 .508 10.8 GE 3 .0833 .229 35 .972 11.67 INIT 1 .0041 1.00 64 .267 64 LUD 6 .1667 .231 22 .611 3.67 MAIN 5 .0252 .240 26 .131 5.2 MAMOCO 6 .0068 .360 30 .034 5 MATMUL 3 .0400 .273 34 .453 11.3 MATTRP 2 .0800 1.00 9 .360 4.5 PAPUAL 8 .0056 .989 176 .124 22 TWOWAY 8 .0283 .115 56 .199 7 MIN. 1 .0041 .115 5 .013 2.25 AVG. A. 53 .0542 .541 40.53 .039 11.20 MED. 5 .0283 .360 31 .261 5.2 MAX. 8 .1538 1.00 176 .972 22 158 is 5. Thus on the average only .0542 of the virtual space of programs needs to be in primary memory to achieve an average LP of .541. The average number of page frames needed in the untrans formed programs machine to achieve identical levels of performance is 40.53. This num- ber varies between a minimum of 5 and 176 page frames. The median is 31. On the average 11.20 times more page frames are needed in the un- transformed programs machine than are needed in the transformed programs machine. Note that the paged machine running untransformed programs needs on the average .309 of the virtual space of programs in primary memory. This factor of 3.24 reduction of memory needed which was achieved by the introduction of paging to nonpaged systems is surpassed by the amount of reduction of the memory needed in the transformed programs machine from the untransformed programs machine (an average of 11.20 compared to an average of 3.24), where both machines are paged. Tables 12, 13, and 14 show our second set of comparisons. In these tables we use m , , m , ,and m ~ to denote the memory allotments needed by the untransformed programs to generate no less that f (4) , f (6) , and f (8) respectively. With 4 page frames the transformed programs machine will have on the average an LP of .382 with a median of .244. In Table 12 we note that the untransformed programs machine need on the average 29.35 page frames to achieve the same level of performance with a median of 12.00 page frames. Thus the transformed programs machine achieves an average factor of 7.34 reduction in the required memory to achieve this level of performance (the median is 3.00). Note that on the average, the untransformed programs machine is achieving a factor of 26.25 saving in primary memory compared to an unpaged machine. The transformed programs machine is achieving a factor of 74.40. 159 Table 12. Memory Requirements of Transformed and Original Programs at Similar Performance Levels - the Transformed Programs 4 Pages Level. Program LP t (4) m . c4 m c4 M DP/4 DP/m , c4 ADVECT .022 14 3.50 56.50 16.14 BASE .444 38 9.50 75.00 7.89 BIGEN 1.00 6 1.50 96.25 64.17 CD .244 11 2.75 5.25 1.91 DISPERSE .763 60 15.00 183.50 12.23 FIELD .369 11 2.75 13.00 4.73 FLR .885 7 1.75 5.75 3.29 FOURTR .003 5 1.25 32.00 25.6 GE .234 35 8.75 9.00 1.03 INIT 1.00 64 16.00 61.25 3.83 LUD .0001 1 .25 9.00 36 MAIN .071 14 3.50 49.50 14.14 MAMOCO .0101 4 1.00 218.75 218.75 MATMUL .272 34 8.50 18.75 2.21 MATTRP 1.00 9 2.25 6.25 2.78 PAPUAL .165 174 43.50 354.50 8.15 TWO-WAY .0102 12 3.00 70.50 23.50 MIN. .0001 4.00 1.00 5.25 1.03 AVG. .382 29.35 7.34 74.40 26.25 MED. .244 12.00 3.00 49.50 8.15 MAX. 1.00 174 43.50 354.50 218.75 160 Table 13. Memory Requirements of Transformed and Original Programs at Similar Performance Levels - the Transformed Programs 6 Pages Level. Program LP t (6) m c6 m c6/ 6 DP/6 DP/m c6 ADVECT .229 31 5.17 37.67 7.29 BASE .833 38 6.33 50.00 7.89 BIGEN 1.00 6 1.00 64.17 64.17 CD .280 11 1.83 3.50 1.97 DISPERSE .885 64 10.67 122.33 11.47 FIELD .571 14 2.33 8.67 3.71 FLR .962 7 1.17 3.83 3.29 FOURTR .133 65 10.83 21.33 1.97 GE .246 35 5.83 6.00 1.03 INIT 1.00 64 10.67 40.83 3.83 LUD .237 22 3.67 6.00 1.64 MAIN .281 28 4.67 33.00 7.07 MAMOCO .367 30 5.00 145.83 29.17 MATMUL .272 34 5.67 12.50 2.21 MATTRP 1.00 9 1.50 4.17 2.78 PAPUAL .165 174 29.00 236.33 8.15 TWOWAY .039 17 2.83 47.00 16.59 MIN. .039 6.00 1.00 3.50 1.03 AVG. .499 38.18 6.36 49.52 10.25 MED. .281 30.00 5.00 33.00 3.83 MAX. 1.00 174 29.00 236.33 64.17 161 Table 14. Memory Requirements of the Transformed and Original Programs at Similar Performance Levels - the Transformed Programs 8 Pages Level. Program LP t (8) m c8 m c 8/8 DP/8 DP/m c s ADVECT .245 31 3.88 28.25 7.29 BASE .855 38 4.75 37.50 7.89 BIGEN 1.00 8 1.00 48.13 48.13 CD .349 11 1.38 2.63 1.91 DISPERSE . .893 64 8.00 91.75 11.47 FIELD .855 19 2.38 6.50 2.74 FLR 1.00 8 1.00 2.88 2.88 FOURTR .155 65 8.13 16.00 1.97 GE .275 35 4.38 4.50 1.03 INIT 1.00 64 8.00 30.63 3.83 LUD .234 22 2.75 4.50 1.64 MAIN .313 38 4.75 24.75 5.21 MAMOCO .469 30 3.75 109.38 29.17 MATMUL .272 34 4.25 9.38 2.21 MATTRP 1.00 9 1.13 3.13 2.78 PAPUAL .99 176 22 177.25 8.06 TWOWAY .115 59 7.38 35.25 4.78 MIN. .115 8 1.00 2.63 1.03 AVG. .592 41.82 5.23 37.20 8.47 MED. .469 34.00 4.25 28.25 3.83 MAX. 1.00 176 8.73 177.25 48.73 162 Table 13 shows similar data when the transformed programs machine allots 6 page frames to all programs. The average LP is .499 (the median is .281). The untransformed programs machine needs on the average 38.18 pages to achieve this level of performance which is an average factor of 6.36 more than the memory needed by the transformed programs machine. On the average, the untransformed programs machine is achieving a factor of 10.25 savings in primary memory (compared to a nonpaged machine) while the transformed programs machine is achieving a factor of 49.52. Table 14 shows the data when 8 page frames are alloted to all the transformed programs. The average LP is .592 (.469 median). The average memory needed by the untransformed programs is 41.82, which is an average factor of 5.23 more page-frames than 8. Thus from Tables 10 through 14 it is clear that with few page frames (4 to 8) the transformed programs have a much lower rate of page faulting (on the average a factor of 19.9 lower). To achieve similar levels of page faulting, the untransformed programs need on the average a factor of 5.23 to 7.34 more memory (on the average 29.35 to 41.82 page frames compared with 4-8 page frames for the transformed programs). 4.2.2 The Space-Time Cost vs. Memory Allotment Results As discussed in Chapter 2, the throughput of a multiprogrammed machine is inversely proportional to the average space-time cost of exe- cution of programs. Thus the concern here is to reduce the space-time cost of programs. Moreover, one would like to reduce the amount of memory alloted to each program because this will improve the degree of multiprogramming. 163 In this section we compare the space-time cost of executing un- transformed and transformed programs. Here we assume that the OS uses the local LRU replacement algorithm and a fixed memory allotment policy. In other words, when a program is executed it is assigned a fixed amount of memory. When this program generates a page fault the OS will replace, if necessary, one of the pages of the same program. In later sections of this chapter we will discuss the implications of our results when the OS uses different memory management strategies. Traditionally people have used the number of memory references made by a program to measure the time spent by the CPU to execute the program. If we denote this number by R then the space time cost of ex- ecuting a program under our assumptions is given by: Space-Time Cost = m * (R + f (m) * T) 4.1 where m is the number of page frames alloted to the program, f (m) is the number of page faults and T is the average page fault service time (in memory references) . With the same m the space-time cost of the transformed version of the program is given by: (Space-Time Cost) = m * (R + f (m) * T) 4.2 We note that equations 4.1 and 4.2 have a common term m*R. If we plot the curves representing these equations (versus memory allot- ment) then this term is a common, bias to both curves. The bias term of the space-time cost of a program is independent of its degree of locality. The locality of the program affects only the non-bias term. Thus to compare the improvement in the locality of programs one needs to compare only the non-biased space-time costs of the original and transformed 164 programs. This is in some way analogous to measuring the voltage gain of an amplifier by the ratio of the AC output voltage to the AC input voltage. In the Appendix we show the space-time cqst curves for our pro- grams after removing the bias terms. These curves are also independent of the value of T, the page fault service time. We have normalized these curves by making T equal to one unit time, i.e., one unit of the space-time cost is equal to a page frame-page fault service time. Thus the curves represent m*f(m) and m*f (m) for the original and transformed programs respectively. We denote these two functions by ST(m) and ST (m) . Note that the difference between ST (m ) and ST (m_) is equal to the difference between the total values of the space-time costs when m.. = m„ . However, if m > m then ST(m ) - ST (m„) is less than the difference between the total values of the space-time cost. This is because the bias term, m*R, increases as m is increased and hence it will be greater for ST(m ) than for ST (m ) . Thus the comparisons which we will make shortly are on the conservative side (we will be comparing ST(m..) and ST (m ) with either m 1 =m_ or m, > m ). In other words our results would have been better if we plotted the total values of the space-time cost functions. In the rest of this thesis, unless otherwise specified, we use the term space-time cost to mean the total space-time cost minus the bias term. Thus for the original programs the space-time cost will be given by the ST(m) function and for the transformed programs by the ST (m) function. Both the ST and ST curves have absolute minimums . We will use M to denote the memory allotment at the minimum point of the ST curve, o 165 Similarly we use M to denote the memory allotment at the minimum point of the ST curve. Table 15 shows M 's for all our programs. We note that M ranges between 1 and 67 with an average of 24.8 and a median of 24. There are 6 programs with M < 10, 7 programs with M > 30, and 4 programs with 10 < M £ 30. In each of these three sets of programs M is spread over the range of the set. In the first range M takes the values 1, 1, 6, 6, 8, and 9. In the second set the values are 13, 20, 24, and 28. In the third set the values are 31, 32, 36, 39, 41, 60, and 67. Thus the first important observation we make is that M 's of the original programs are well scattered over a wide range. Another important observation which we make is that the ST curves are not well behaved for m < M (see the Appendix). For some parts o of this memory range ST increases with m for others it decreases. More- over, often sudden jumps in the value of ST are encountered. In other words the ST curves wiggle, going up and down for m < M . For m > M the ST functions are rather linearly increasing with m. Since M is scattered over a wide range, it is impossible to choose a narrow band of memory allotment in which all programs will run efficiently, i.e. with ST values close to ST(M ). o 2 In Table 15 we also show the ratios M /DP and ST(M ) /DP , where o o DP is the number of distinct pages referenced. These are intended to give a feeling for the potential advantage that paged virtual memory machines have over non-virtual memory machines. If a program is alloted a number of page frames equal to its M , then on the average it will be using only .303 of the memory it needs in a non-virtual memory machine and its space-time cost will be only .388 of the cost in the non-virtual memory machine. 166 Table 15. Characteristics of the Minimum Space-Time Cost Points of the Original Programs. Program Mo Mo /DP ST(M Q )/DP' ADVECT BASE BIGEN CD DISPERSE FIELD FLR FOURTR GE INIT LUD MAIN MAMOCO MATMUL MATTRP PAPUAL TWOWAY 32 39 6 13 1 20 8 67 36 6 24 28 31 41 9 1 60 .1416 .1300 .0156 .6190 .0013 .3846 .3478 .5234 1.000 .0244 .6667 .1414 .0354 .5467 .3600 .0007 .2127 .2218 .1300 .0156 .6485 .0186 .4215 .3478 .6583 1.000 .0846 .6587 .4900 .0357 .5467 .3600 .0167 .9431 MIN. AVG. MAX. MED. 1 24.8 67 24 .0007 .303 1.0 .2127 .0156 .388 1.0 .3600 167 The space-time cost curves of the transformed programs have a much better behavior. The minimum points in the ST curves occur at memory allotments which fall in a much narrower band. Table 16 shows the M 's of our programs. We note that all the transformed programs have 1 < M < 8. There are 3 programs with M = 8,4 with M = 6, 2 ot r o ot ot with M _ = 5, 4 with M =3,3 with M _ = 2, and one program with M _ = 1. ot ot ot ot The average M is 4.53 and the median is 5. The implications of the ot difference in the range of M and M and in the behavior of the ST and o ot ST curves will be discussed shortly. 2 In Table 16 we also show M /DP and ST (M )/DP . On the average, ot t ot when a transformed program is alloted a number of page frames equal to its M then it will be using .0542 of its virtual space (which is the same as the virtual space of the untransformed program) and it will be costing only .1822 its cost on a non-virtual memory machine. Table 17 compares the optimum ST and ST points. On the average an untransformed program needs 5.66 more primary memory to achieve its minimum space-time cost. Moreover, the minimum cost of an untransformed program is on the average 4.04 more than the minimum cost of the trans- formed programs. Note that if the untransformed program was alloted M page frames then it will cost (on the average ) 29.84 more than the trans- formed program cost. Although comparing the optimum ST and ST points does serve the purpose of showing the effectiveness of our transformations in improving the behavior of programs and reducing their execution costs, it is still more interesting to make comparisons under more practical assumptions. The point is that an OS has no means of determining the values of M or M , and hence we cannot expect an untransformed program to run with M ot r r o Q 168 Table 16. Characteristics of the Minimum Space-Time Cost Points of the Transformed Programs. Program M, ot M ot /DP ST t (M ot )/DP' ADVECT 6 BASE 5 BIGEN 2 CD 3 DISPERSE 3 FIELD 8 FLR 2 FOURTR 6 GE 3 INIT 1 LUD 6 MAIN 5 MAMOCO 6 MATMUL 3 MATTRP 2 PAPUAL 8 TWO-WAY 8 MIN. 1 AVG. A. 53 MAX. 8 MED. 5 .0265 .0167 .0052 .1429 .0041 .1538 .0870 .0468 .0833 .0041 .1667 .0252 .0068 .0400 .0800 .0056 .0283 .1157 .0203 .0059 .6190 .0054 .1804 .1059 .5315 .3634 .0041 .722 .1052 .0190 .1467 .080 .0057 .2467 .0041 .0542 .1667 .0283 .0041 .1822 .722 .1059 169 Table 17. Comparing the Minimum Space-Time Cost Points of the Original and Transformed Programs. Program M /M ot ST(M ot )/ST t (M ot ) ST(Mo)/ST t (Mo t ) 36.18 1.917 54.00 6.376 40.49 2.361 46.32 1.047 9.41 3.477 38.72 2.336 7.5 3.286 40.25 1.872 53.73 2.751 42.29 20.74 34.97 .949 7.32 4.656 3.52 1.881 58.97 3.727 7.72 4.5 7.87 2.923 17.46 3.837 3.52 1.05 29.84 4.04 58.97 20.74 36.78 2.92 ADVECT 5.3 BASE 7.8 BIGEN 3 CD 4.3 DISPERSE .3 FIELD 2.5 FLR 4 FOURTR 11.17 GE 12 INIT 6 LUD 4 MAIN 5.6 MAMOCO 5.2 MATMUL 13.7 MATTRP 4.5 PAPUAL .125 TWOWAY 7.5 MIN. .125 AVG. 5.66 MAX. 13.7 MED. 5.2 170 page frames or a transformed program to run with M page frames. Thus ot the comparison at the optimum ST and ST points is probably only of academic theoretical interest. Although we do not wish at this point to discuss some particular existing OS's, we want to make some comparisons under assumptions which are closer to what happens in the real world. We will make two sets of comparisons. In the first set we compare ST to ST when both the transformed and untransformed programs are allocated similar memory allotments (4 < m < 8) . This type of comparison will show us the reduction of the space-time cost which our transformations achieve if the OS uses the policy of alloting a small fixed number of page frames for all programs. In the second set of comparisons we show that on the average, the cost of a transformed pro- gram when alloted a number of page frames in the range 4 to 8 is much less (an order of magnitude) than the cost of the untransformed program even if it is alloted a number of page frames from a much larger range (12 < m < 48). Here we will be comparing ST at m=4, 6, and 8 to ST at memory allotments in the range 12 < m < 48 with an increment of 4 page frames . Since at a fixed memory allotment, m = m , we have: ST(m )/ST (m ) = m *f(m )/m *f (m ) = f(m )/f (m ) a ta a aata ata then the results of comparing ST to ST at similar memory allotments in the range 4 < m < 8 are identical to those shown in Table 10. Thus all our previous discussion about the improvements in page faults for this memory range apply directly to the improvements achieved in the space-time cost. Hence, on the average the transformed programs will have 19.9 times less space-time cost than the untransformed programs when all 171 programs are assigned a fixed memory allotment in the range A to 8 page frames. Tables 18, 19, and 20 show our second set of comparisons. In Table 18-a we show for all our programs the ratio ST(m)/ST (4), where 12 < m < 48. Note that we do not make the comparison for a program at any m which is greater than DP of the program. We observe that for most programs and for most memory allotments we have ST (4) < ST(m). This is not true for program ADVECT with 32 < m < 48. This is because for ADVECT M = 32 and M =6. When some more memory is given to the transformed version of ADVECT (6 or 8 page frames) ST will be less than ST(m) for any 12 < m < 48 (Tables 19-a and 20-a) . Similar remarks apply to program MAMOCO. In Table 20-a we note that programs CD and LUD are the only two programs for which ST fc (8) is greater than ST(m) for some m, 12 ^ m < 48. The ratio ST/ST improves as the transformed versions of these two programs are given less pages. This is because M for CD is 3 and for LUD is 6. From Tables 18-a, 19-a, and 20-a it seems that an OS can use the simple rule of allocating 4 pages to the transformed programs with relatively small DP (say less than 100 or 75 page frames) and 8 page frames to those with larger DP's. In this case the transformed programs will (in almost all cases) cost less to execute than the original programs no matter how much memory is assigned to the untransf ormed programs. (Note that it is not our purpose here to determine the exact values of such numbers as 4 page frames for programs with DP < 100, otherwise 8 page frames. More programs and more detailed studies need to be done in order to determine such numbers. However, using statistics available from large collections of Fortran programs and arguments CN co o 00 r^ en o ON CN r-~ m 00 00 * • • • • • 1 -cr rH iH rH rH . co o CN rH o o 00 ON rH CN m CO CO rH CN CO 00 CO O o ~3- co CN 00 O O m CN m rH rH rH <1- CO o CO W Pi P u § < H Q H M r-1 O W Oh PQ PQ O Q fe tn Pn O H H 3 si pi 1 172 00 ON o r-\ Z w Q H W w w Pm rJ g H > Crt o w w pei M fN 3 <: M Q M M rJ O w z p PQ PQ U O U-, ta Pu o M rJ H H CM NO oo o o CM m CM NO NO NO »3- in 00 m CO CO o 00 m 00 CO co CO m m ON O rH NO NO o rH co W Pi B S < M Q M M hJ o w PQ « U Q ^ fe Cn o 9 ►J M H H O co CO co co m co ON sr CO NO m CM CN NO CO CN 00 vO CN ON ►J <: Si p-l 00 NO NO 00 CTN ON CO 00 ON 00 NO 00 NO 00 -3- 00 CO NO NO sf NO O CN ON vO CN NO O o o CN NO CN rH ON NO m CO NO rH ON o o m r->- m r^ co rH CN CN CN CN CN NO NO NO r~~ 00 00 CN m rH NO m NO CO CN o m ON NO CO NO m o m CO o m CN ON rH co CN 00 CO 00 NO o m rH 00 ON m m o rH rH CO CN i-H m CO r— m NO 00 NO NO rH rH rH CN rH < o 174 175 about the number of statements in a TT-block and the number of operands per statement, we are inclined to believe that our numbers are close to being accurate.) Tables 18-b, 19-b, and 20-b give some statistics about Tables 18-a, 19-a, and 20-a respectively. With 4 page frames, a transformed program will have on the average between 6.09 and 10.8 times less space time product than the untransf ormed program when executed with a memory allotment in the range 12 < m < 48. The memory reduction is between a factor of 3 and 12 with an average of 7.5 and a median of 7.5. Note that the median of the reduction in the space-time cost ranges between 3.54 and 8.30 with an average of 5.32. The average of the averages of the improvement in the space-time cost is 8.81. In Table 19-b, with 6 page frames the average improvement in the space-time cost ranges between 8.94 and 12.88 with an average of 11.49. The median of the improvement ranges between 4.42 and 7.78 with an average of 6.31. The reduction in memory ranges between a factor of 2.00 and 8.00 with an average and a median of 5.00. In Table 20-b the transformed programs are assigned 8 pages. The average reduction of the space-time cost ranges between 8.39 and 13.68 with an average of 11.73. The median of the improvement ranges between 4.54 and 7.45 with an average of 6.03. The reduction in memory ranges between 1.50 and 6.00 with an average and a median of 3.75. From Tables 18-b, 19-b and 20-b one can say that when trans- formed programs are executed with memory allotments in the range 4 to 8 pages they have less space-time cost than the untransformed programs by an average factor of 10.68 (this is the average of 8.81, 11.49 and 00 -3" CM rH CM rH m o co av • • • • ON m co CO rH rH rH rH m CO vO m 00 • • • • ov -d- r-. CO o >3- o rH o vO vl- • • • • 00 -d- vO CO vO CO cri o o m rH m O • • • • ON -* CO cm CO 00 o o o 00 o CO • • • • o rH oo CM CO 00 CM r-> co o o rH 00 rH 4-1 H c/3 H C/3 co CO vO 00 CM o m CM -d- o in CM O r*» o vO • • • • r^- m VO CM vO % ,0 H U 3 oo ,o I J3 00 *d- o o 00 rH 00 oo m CO CM vO oo CM rH vO o -d- co CO O O 00 o o 00 r^ CM CM rH m 00 O vo O -J" vO rH VO CM vO CM rH rH LO vO r- vO CO o O 00 CO CM CM vO rH rH rH vi- CM CM CO CO co CO CM oo vO o> rH m rH CM rH m 00 vO 00 CM vO 00 o 00 oo 00 vO CO CM vO vO vO H C/3 vO vO m m vO r-. m r^ m CM 00 r» oo CM o m 177 o o r^ m 00 st m m m m st >* m rH CO rH m 00 vO o CM CM m OS O o o\ O o CM st m H CO iH m M3 MO o m O st r^ \D m o st m r*» CO st rH CM rH st CM vO o »* CM st r^. CM o Os vO oo CM CO «* CM rH m os m o CM 00 r-» CO oo m 00 vO m CO CM CO CM rH vO CO m o CM On m r^ St o r^ m st CM CM co o m On 00 o o m m st CM CM CM CM rH Os sO 00 o on Os vO o v£> o MO CO m CM H CM 00 vO CM o on CO 00 m CM m vO 00 co vO rH rH OS vO m si- 53 M 00 "a 00 H co H en 178 and 11.73). To achieve this improvement, a transformed program will be executing with a memory which is on the average 5.42 less than the memory alloted to the untransf ormed program. Thus in a multiprograming system our program transformations can result potentially in an order of magnitude improvement in the throughput with an increase in the degree of multiprograming of more than a factor of 5 . 4 .3 Measuring the Performance Improvement of Paged Virtual Memory Systems - the Variable Memory Allotment Case Most existing virtual memory multiprogrammed systems use memory management policies that vary the memory alloted to a program during its execution. Here we choose the working set policy to represent variable memory allotment policies [DENN68] . Other policies are varia- tions and approximations to the working set policy. Our interest is in finding the effect of our transformations on the space-time cost of executing programs under the working set memory management policy. Several studies have shown that variable memory allotment policies are superior to fixed memory allotment policies like the LRU [CHU72] ,[COFF72] JDENN75] . The main reason behind the superiority of the variable memory allotment policies is because the main memory require- ment of a program may change drastically during its execution. While fixed memory allotment policies assign to a program the same amount of memory during its entire execution time, variable memory allotment policies try to adapt the memory alloted to a program to the changing size of its locality sets. The working set policy keeps in memory pages referenced during the previous t references. This set of pages is called the working set 179 and is denoted at time t by W(t,T). T is the window size. The size of the working set at time t is denoted by w(t,x). From the results of our experiments reported in Section 4.1, it is obvious that the changes of the sizes of the locality sets of a transformed program are much less than these changes in an untransformed program. Hence it is interesting to see whether the working set policy is any better than the LRU policy for the transformed programs. For untransformed programs, it seems that enough previous work was done to show that variable memory allotment policies are better. More work on these lines seems to be insignificant. Thus our interest is to compare the space-time cost of the transformed programs under the LRU and the working set policy (WS). Under the LRU policy one can plot the space-time cost as a function of memory allotment. Under the WS policy, however, the memory alloted to a program, i.e. its working set size, w(t,i), varies during its execution. Thus in order to make a comparison to the space-time cost under LRU, one needs to calculate the average memory alloted to the program during its execution using the WS policy. With a given window size T, a program trace of length R references will generate f (t) page w faults. Let w. (t . ,x) be the working set size when the ith page fault occurs, 1< i < f (t). Then if we denote the page fault service time by w T, the average memory alloted to the program is given by R f w (T) M(T) = ( I w(t,T) + T * I w.(t.,x))/(R + T * f (T)) t=l i=l W By varying T one is supposed to get different M(t) and ST (x), the w space-time cost under WS , and hence make a plot of ST (t) versus M(t) w which car then be compared to the space-time cost curve under LRU. 180 When collecting data for the ST (x) curves we found that several w programs exhibited anomalous behavior under WS . Recently Franklin, Graham, and Gupta have discovered by experimentation, anomalies with the page fault frequency replacement algorithm [FRAN78]. In the same paper they pointed out that for some reference strings and some t's the WS policy can also have anomalous behavior. They called these anomalies the parameter (x)-real memory and real memory-fault rate anomalies. In the paper a short reference string was constructed to illustrate the anomalies with the WS policy. These are the same anomalies that we found experimentally for some of our transformed programs. Namely, the parameter of the working set policy t did not have a consistent relation to the average real memory alloted to a program. One expects that the average memory allot- ment should be a nondecreasing function of T. In otherwords, given T and T„, if T > T then it is expected that M(t„) ;> M(t,). Similarly one expects the number of page faults generated under WS to be non- increasing with the average alloted memory, i.e. if M(t„) > M(t-..) then it is expected to have f (t„) < f (t, ) . That the WS policy should w 2. w 1 possess these properties is essential to be able to control the performance of a multiprogrammed system by changing the parameter T. As it is put in [FRAN78], "...Load control is attempted by varying the paging algorithm parameter. A load control based on an anomalous performance measure may be unstable because a change of given sign in the parameter need not produce changes of corresponding sign in the controlled variable." For several of our programs we noticed that for some T_ > t we get M(t ) < M(t ). This is the parameter-average real memory 181 allotment anomaly. Moreover, for M(x n ) > M(x„) we noticed that f (x ) > 1 2. w 1 f (x~). This is the average real memory allotment-page fault anomaly. To find the average real memory allotment we had to choose a value for T, the page fault service time. For a page size of 64 words, we have chosen to use three different values of T: 32 references, 320 references, and 3200 references (1/2 page size, 5 page sizes, and 50 page sizes) . The 3200 value seems to reflect a 64 word page fault ser- vice time between disc and primary memory. The 32 references seems to reflect a 64 words page fault service time between an interleaved primary memory and a fast cache memory. Page fault service time between CCD's and primary memory seem to fall between these two extremes [JULI78] Since our main aim was to compare the space-time cost of the trans- formed programs under LRU and WS, we have chosen values of T, the window size, in different ranges and with different increments so as to get M(t) in the relevant range of the LRU space-time curve for each pro- gram. Generally speaking we used 2 < t < 8 with an increment of 1 to give us M(t) in the range 1 < M(x) < 5 and we used x > 16 by increments of 8, 32, 64, or 128 to give us M(x) > 5. The selection of the intial value of T and its increment was tuned in every program to cover the range of M(t) of interest. Table 21 shows the anomalous behavior of WS which we discovered in 5 of our 17 transformed programs. M (x) , M~(x), and M (x) are the average alloted memory with the three values of the page transfer time used: 32, 320, and 3200 references respectively. We notice that there is a significant difference between f (x,) and f (x„) . Thus depending tw 1 tw 2 on the page fault service time, when the value of x is increased from x to x , the reduction in the number of page faults might be big enough en c o 3 X CO • )-< en ao E3 o M U Cu OJ -o T3 c OJ 3 p H o ■+4 •H 09 > C 03 n) A M O O CM ON CM CM 00 CM 00 CM cr. oo CO St vO CM O st CO v£> u Q 00 w hJ o 00. w u «3 o M Pm PQ u U-i CT\ CTi CO St O CM CO 00 v£> st O 00 00 ON m vO CO vO vO CO o> r-« 00 o 00 00 st m st CM CO st 00 r^ O^ CO O r^ co i-4 O 00 rH st .H st CO 00 o v£> CM CO vO o> CM O o> r-i iH vO o v£> 00 183 to make the drop in the space-time integral greater than the drop in time, Thus the average memory allotment will be decreased rather than increased, In general if T is increased from T to T~, then in order for the anomaly to exist we must have: w (( I w(t,T 1 ) + T * I w.(t.,T 1 )y(R + T * f w (T 1 )) > t=1 i=l W (( I w(t,T ) + T * I w.(t.,T ))/(R + T * f (T )) t=l i=l W Thus the existence of the anomaly depends on the program,!.., T_, and T. We do not see an obvious way of explaining the dependence of the anomaly on each individual one of these factors. The four factors interact to produce the anomaly. In [FRAN78] an argument was presented to support a theory that when the anomaly occurs for a given program, T.. , and t then there exists a crossover value of T = T such that the anomaly will c occur for all T > T . Our experiments have shown that this theory is not valid. For example in programs CD and FIELD the anomaly occurs for T = 32 and T = 320 but it does not occur for T = 3200. For all our transformed programs we noted that the anomaly either does not exist or it occurs at values of M(t) which are less than M , the memory allotment at the minimum space-time point under LRU. We found that for all those programs which are anomaly free there was no difference between the space-time cost under LRU and WS in any memory range and for the three values chosen for T. For the 5 programs which ex- hibited the anomalous behavior, there was no difference between the 184 space-time cost under LRU and WS for memory allotments greater than M ot For memory allotments less than M the anomaly existed and no compari- son can really be made. Note that when we say there was no difference between the cost under LRU and WS we mean that one cannot really draw two different curves to represent the LRU and WS space-time cost func- tions. In Figures 27, 28, and 29 we show the space-time cost for two programs which have the anomaly (CD and BASE) and for one program which is anomaly free (MATMUL) . We will not show curves for any more programs because they do not reveal any additional interesting information. Because of our observation that the anomaly occured at values of memory allotments less than M (which might be interpreted by some people to mean that the anomaly only shows for some programs when they are thrashing , whatever the definition of thrashing might be) we did some more experimentation to see whether this is always true. We generated the space-time cost functions under the WS policy for 7 of our untrans- formed programs, namely ADVECT, BASE, BIGEN, DISPERSE, FOURTR, INIT, and PAPUAL. The anomaly showed in 3 of these programs; INIT, DISPERSE, and FOURTR. For program INIT the anomaly occurred at memory allotments below and above M (For INIT M =6). For program DISPERSE the anomaly occurred at memory allotments greater than M = 1. For the FOURTR pro- gram the anomaly occurred at memory allotments less than M =67. As a matter of fact we did not check whether it also occurs at allotments greater than 67 (Remember that these experiments are very costly because the trace has to be scanned once for every value of T. We could not find in the literature any algorithm for calculating the real average memory allotments for different x's in one scan of the trace. Moreover, from 185 ^4 2 -I oo vO - m - cfl U 00 QJ CX Cfl M-^ rrj PU QJ o X o X en QJ a C QJ U QJ U-4 QJ Dd CN m II H T3 QJ B )-i o 14-4 CO c CO H W C/3 < CQ B CO U 00 o u &4 w o o QJ B •H H I QJ O CO a. C/l QJ H CN QJ M 3 oo •H 186 -I oo - v£> - -st U O e S to M-l o CO 0) CO o CO o rH 1 0) H X 01 o X - 0) o CN C> II H T3 CD CO c CO u H w c/3 « i ao o u fa U-( 4J 03 O I H I 0) o CO a C/3 I CM 0) Ui 3 00 •H fa 187 u o B cu co cu OS CO o 60 CO X n en E CD •H a H 1 c 1 4-1 CO CU U co ai >- CJ o BO OJ CO c_> ca u-i ft CU QJ C/3 w OS X CN CO CU u c 0) >- cu 14-1 0) OS o o CN T3 o e M o '— CO c CO )» H w en < 6 CO u oo o — CO o u cu E •H H I 01 CJ CO a C/3 CU J= CJ I r~- CN cu u 3 cjO •H 188 4X10 r- Space-Time Cost (Pages- References) 1*10 - 1x10 A LRU O WS 3 4 Pages of Real Memory Figure 28-a. The Space-Time Cost of Program CD (Transformed), T = 32 References 189 5x10 r- 1x10 - Space-Time Cost (Pages- References) lxlCT - 1x10" A LRU O ws 3 4 Pages of Real Memory Figure 28-b. The Space-Time Cost of Program CD (Transformed), T = 320 References 4*io r 1x10 - Space- Time Cost (Pages- Refer- ences) 1x10 _ 1*10 A LRU O WS 2 3 4 Pages of Real Memory Figure 28-c. The Space-Time Cost of Program CD (Transformed), T = 3200 191 2x10 V 1x10 - Space-Time Cost (Pages- References) 1x10 - 6xi0 3 A Pages of Real Memory Figure 29-a. The Space-Time Cost of Program MATMUL (Transformed), T = 32 2X10 ,- 1X10 _ Space-Time Cost (Pages- References) 1x10 1x10 A LRU O WS 3 4 Pages of Real Memory Figure 29-b. The Space-Time Cost of Program MATMUL (Transformed), T = 320 2 X 10 „ 1X10 _ Space- Time Cost (Pages- Refer- ences) 1x10 _ 1x10 _ 2x10 A LRU O WS 3 4 5 Pages of Real Memory Figure 29-c. The Space-Time Cost of Program MATMUL (Transformed), T = 3200 194 our own investigation of this matter we reached a conclusion that one needs to save so much information when going through the trace to cal- culate the real average memory for different x's, that it is probably cheaper and much simpler to go through the trace several times. To locate anomalies one ideally needs to start at T = 1 and increase it by increments of 1. This is really a prohibitive expense even for short traces. Most probably, this is the reason why the working set anomaly was not discovered for more than ten years since the introduction of the working set policy [DENN68] . Most probably this is also why nobody else has tried to investigate this anomaly in real programs to date) . Table 22 summarizes our findings concerning the anomalies in the untrans- formed versions of programs INIT , DISPERSE, and FOURTR. Note that in our previous conclusion there was no difference between the space-time cost of executing a program under the LRU or the WS policies; we are using the average behavior of the program under the WS to make the comparison. In fact it should be clear that the LRU policy is a better policy for transformed programs. If one plots the memory alloted to a transformed program as its execution progresses in real time, the WS curve will have sharp peaks whenever the program changes localities. The LRU curve, however, stays at the same level for the entire execution time of the program. Although the WS sharp peaks are usually short, they can still cause serious problems in a multiprogrammed system. If no free page frames are available when such excessive demand for memory occurs, other programs may be deactivated. In [SMIT76] re- ducing the seriousness of this problem is approached by making the WS policy more elaborate and introducing a second parameter for the policy. CO 2 o 1—1 c3 6 o 3 42 •P •H & CO S c3 S-i 00 • o CD >-i 3 ru M -o 0) 0) -V e 5 3 o M-l k CO C •H Ed > n cd ■U x: d cu & CQ CM C>4 CD r-i X) C3 H CO CM 195 CM CM CO vO -a- CO CM CM CO CM 00 CO H O CM in r^- • • • • m >* CO -d- m , the number of page faults f is 0(K), where K is the number of pages per array. For m < m ,f is 0(1 of words per array) . When a loop is alloted m < m we will say that the loop is thrashing . Thus, to simplify the discussion, if we assume only one dimensional arrays of size N, then for m > m , f^is 0(N/Z) and for m < m , f is 0(N)(see Chapter 2 for more details about these points). We consider different possibilities. In the first case let us see what happens when the page size Z is increased without increasing the array sizes, or N. In the second case we find out the effects of increasing N without increasing Z. In the last case both N and Z are increased. In all cases we are interested in the programs as long as N > Z, otherwise their memory requirements are relatively small and they are not of concern to us. When the array sizes are not changed and the page size is in- creased, then by extending our previous discussion from the behavior of 198 loops to the behavior of a transformed program, we do not expect M to change significantly and for most programs it will not change at all. To see why M is not expected to change, let us remember that M is ' ot v & ot the memory allotment at the minimum space- time cost of the program. When Z is increased, the reduction in the number of page faults generated by each loop in the program (when it is not thrashing) is proportional to the reduction in the number of pages spanned by the arrays of the loop. Thus although m 's of the individual loops are not expected to change, the relative contribution of each loop to the total space-time cost might change. This will happen if relative changes in the number of page faults generated by the loops are not the same. However, since most of the transformed program's time is spent in localities (iT-blocks) with five array names or so, the changes in M , if they ever occur, will be very little. In other words M for any of our programs will always be less than 8 and mostly around 5, irrespective of the page size or the array sizes. Since as Z is increased the number of pages spanned by each array will decrease, then DP, the number of distinct pages referenced by each program will decrease. Hence the asymptotic value of the page fault curves of both the transformed and untransformed program will drop. Thus the values of f (m) for m > M will decrease. For m < M . f (m) t _ ot ot' t is not expected to change much because the program will be thrashing. This is also true for f(m) of the untransformed program at m < M . Thus in the memory range M '- m < M our results will improve. This is J ot o because, as mentioned previously, in this range f (m) is decreased while f(m) will not drop much. We do expect, however, a drop in M which is 199 more appreciable than the change in M . Note that for some untrans- ot formed programs M might not change or changes slightly depending on how well the program is behaving. Thus the general conclusion is that, when the page size is increased then the difference between the f (m) and f(m) curves in the region M < m < M is expected to increase (or at least not to decrease) while the width of this region might in general decrease. Similar remarks apply to the ST(m) and ST (m) curves. To check the validity of our arguments we have changed the page size (with- out changing the array sizes) and obtained the page faults and space- time cost data for 4 of our programs: BIGEN, FIELD, MAMOCO, and TWOWAY. The results were in agreement with our expectations. As an example we show in Figures 30 and 31 the faults and space-time cost curves with a page size of 256 words for program MAMOCO and its transformed version. We also show the curves for a page size of 64 words. Note that M has dropped from 31 to 17. In BIGEN, with similar changes in page sizes (from 64 to 256) M did not change. The untransformed program of BIGEN is much better behaved than MAMOCO. Also, for BIGEN M did not change while in MAMOCO M increased from 6 to 8 (though ST (6) and ST (8) for ot e t t Z = 64 are not very different). Note the increase in the improvement in the page faults and space-time cost when Z was increased to 256. The conclusion we reached in the previous paragraph is really relevant to the validity of our results under the worst possible condi- tions, namely Z increasing without any increase in the array sizes. A more realistic approach would be to allow both Z and the array sizes to increase. As we have indicated previously, the sizes of arrays can easily grow by a factor of 10 for some of our programs. This is comparable 200 Kaul ts 10x10 Original Program (Z = 64) Original Program (Z = 256) 1*10" Transformed Program (Z = 64) Transformed Program (Z = 256) 1*10 3 _L J_ 10 15 20 Pages of Main Memory 25 30 40 Figure 30. The Page Faults Curves for Program MAMOCO 201 Space-Time Cost (Pages-Page Faults) 100.10 - 10*10- 1 ^10 Transformed Program (7. = 256) _L J_ 10 15 20 Pages of Main Memory 25 30 I Figure 31. The Space-Time Cost Curves for Program MAMOCO 202 or even in many cases more than the expected growth of the page size. If the sizes of the arrays grow more than the page size, our results will be improved, and depending on the program, the improvement can be drastic. By an argument similar to what we made previously» M is not expected to change much. M , however, will increase. Thus, the range of memory allotment which is of concern to us (M < m <. M ) will be increased. ot o In this memory range the page faults of the untransf ormed program will increase in a manner which is roughly proportional to the increase in the number of words per array. The page faults of the transformed pro- gram, however, will increase in a manner which is roughly proportional to the increase in the number of pages per array. In other words if we have only one dimensional arrays in a program, the page faults of the untransf ormed program, f (m) , in the range M < m < M are in the best case 0(N), while f is 0(N/Z). Thus if the array sizes grow faster than the page size, the region of improvement will increase (M £ m < M ) and the degree of improvement will increase. If the page size is increased more than the increase in the array sizes, then we have the situation discussed in the previous paragraph. What happens if both the page size is increased and the array sizes are increased such that the number of pages per array stays the same? In this case it is easy to see that neither M nor M ^ will change, o ot Moreover, f (m) in the range M < m < M will not change. However, f(m) t ot o in this range will increase in a manner which is roughly comparable to the increase in the number of words per array. Hence our results will be improved. We believe that this case, where both the array sizes and the page size grow in a comparable way, represents the most realistic 203 situation as far as existing virtual memory machines and the programs which cause problems for these machines are concerned. To check our conclusions for this latter case we have changed the page size and the array sizes of 6 of our programs such that the number of pages per array stays unchanged. These programs are: CD, FLR, GE, LUD, MATMUL, and MATTRP . Our experimental findings agreed precisely with our expectations. As an example we show in Figures 32 and 33 the curves for program MATMUL. For this matrix multiply program the page sizes are 64 words and 512 words. In both cases each two-dimensional array in the program spanned 25 pages. Thus DP in both cases is 75. For Z = 64 the dimensions of the arrays were 40x40. When we increased Z to 512 we chose the dimensions of the arrays to be 101x101. These dimensions were chosen particularly to be identical to those used by Elshoff for the same program in [ELSH74]. This is because we wanted to compare our results in Figure 32 to the best results obtained by Elshoff when he used all his rules to improve the locality of the same matrix multiplication program. However, this choice of the array dimensions reduces the improvement of our results as Z in changed from 64 to 512. From this point of view it would have been more fair to choose the dimensions to be 110x110. This is because with Z = 64 and array dimensions of 40x40 all points of the 25 pages of each array are referenced (remem- ber we are using the submatrix storage scheme) . With Z = 512 and 101x101 arrays only 79.7% of the words in the 25 pages of an array will be referenced. With 110x110 pages 94.5% of the words in the 25 pages of an array will be referenced. Since f(m) for M ' m < M increases with ot ~ o the number of words referenced while f (m) in this range is dependent on 204 I'age FauLts Ml- !0 1 -10 1 ' 10 .01 '10 J Elshoff's Original Program (Z = 512) Our Original Program (Z - 512) Our Original Program (Z = 64) Our Transformed Program (Z = 512 and Z = 64) I V V Elshoff Combination of All Rules (Z = 512) 10 15 20 Pages of Main Memory Ar J L 25 30 35 40 Figure 32. The Page Faults Curves for Program MATMUL 205 Space-Time Cost (Pages-Page Faults) Elshoff's Original Program (Z = 512) 1*10 100*10" 10*10- 1' 10" Our Original Program (Z = 512) Our Original Program (Z = 64) Our Transformed Program (Z = 64 and Z = 512) Elshof f Combination of All Rules (Z = 512) 1*10 10 _L 15 20 Pages of Main Memory A/- 25 30 35 40 50 Figure 33. The Space-Time Cost Curves for Program MATMUL 206 the number of pages referenced, changing the array dimensions from 101x101 to 110x110 would have left f (m) unchanged and would have increased f (m) by more than 14.8% (94.5% - 79.7%). We note that, as expected, the curves of the transformed program are identical for Z = 64 and Z = 512. For the untransformed pro- gram the number of page faults and the space- time cost have increased when Z was increased. The increase in Z is a factor of 8. For m < 10, f(m) is increased by a factor of 6.17 (for 110x110 arrays the increase in f (m) would be greater than 7.3). Thus, the increases in f(m) and Z are comparable in this memory range. We note that the difference between the f(m) curves decreases as the memory allotment is increased. For m > M =41, f(m) is independent of the page size. The data for the Elshoff curves was obtained from [ELSH74] (in this paper there is no data for m > 20 pages) . We observe that our original program produced fewer page faults than Elshoff 's original pro- gram (for 3 < m < 10 the reduction factor is 2 and for 12 < m < 20 it is 66.7). We have achieved this improvement simply by storing multi- dimensional arrays using the submatrix storage scheme. Elshoff, however, coded his program in PL1 which stores multi-dimensional arrays by rows. Comparing the curve of our transformed program to the curve of the program using the combination of all Elshoff 's rules we note that our automatic transformation techniques (combined with the submatrix storage scheme) are as powerful as Elshoff 's rules (for m < 16 our transformed program produces even fewer page faults). 207 5. CONCLUSIONS AND EXTENSIONS We hope that this thesis has been successful in drawing the attention of the computer manufacturers and scientists to the fact that compilers should use special transformations when compiling for virtual memory com- puters. It is very frustrating to find out that existing compilers do not make any distinction between compiling for a virtual memory machine or for a non-virtual memory machine. Although in the last decade a tremendous number of papers have been written about virtual memory systems, the behavior and control of these systems are still not well understood. We believe that this is due to the approach taken by many researchers in which programs were treated as black boxes that generate reference strings. More effort needs to be dedicated to studying what is in these boxes, namely the programs themselves. In this thesis we have shown that programs, as written by people, do not behave well in a virtual memory environment. We have also shown that simple compiler transformations can force programs to behave well (and hence be easy to model and manage) and cost less to be executed. We would like to use the rest of this final chapter to suggest some points for future research. We will discuss three main issues. First, we discuss possible improvements of some of the transformations of Chapter Three. Second, we will raise some questions concerning the implications of our results for the memory hierarchy design problem. Third, we will point 208 out the importance of extending our techniques to non-numeric programs (e.g., Cobol programs). From all the transformations presented in Chapter Three, the nonbasic to basic TT-block transformation seems to be the most costly. The algorithm used in this transformation is simple. However, the number of control instructions executed in the transformed program is increased drastically (for program LUD the increase is almost an order of magnitude — see Table 8). A more elaborate algorithm can be used to apply the page indexing transformation to a nonbasic Tr-block without first transforming it to a basic Tr-block. In what follows we illustrate this technique, the non- basic Tr-block breaking technique, by applying it to Program 16-a of Section 3.5.3. By definition, the statements of a nonbasic Tr-block fall at different nest depth levels. The general idea here is to identify the values of the different index variables which cause the recurrence in the TT-block and solve the ir-block for these values first. Then we will be left with a basic TT-block. Consider Program 16-a which is repeated below. Program 16-a. DO S 2 I = 1,N S ± B(I,1) = A(I,1) **.5 DO S 2 J = 1,N S 2 A(I+1,J) = B(I,J) + C(I,J) By examining the data dependences in this program we find that the recur- rence occurs when J = 1 (i.e., the dependence arcs going from S.. to S~ and from S ? to S are due to the fact that J takes the value 1. Thus if J never took the value 1 there will be no recurrence). Hence, this non- basic TT-block can be divided into two basic ones as follows: 209 Program 16-e . DO S I = 1,N S ±1 B(I,1) = A(I,1) **-5 5 21 ACl+1,1) = B(I,1) + C(I,1) DO S 22 I = 1,N DO S 22 J = 2,N 5 22 A(I+1,J) = B(I,J) + C(I,J) This program can now be page indexed as follows: Program 16-f . DO S 22 1=1, [N/RZl ILB = 1 + (IP-1) *RZ IUB = MIN (IP*RZ,N) DO S I = ILB, IUB S 11 B(I,1) = A(I,1) **.5 S 21 A(I+1,1) - B(I,1) + C(I,1) DO S 22 JP = 1, [N/RZl JLB = MAX(2,(1 + (JP-1)*RZ)) JUB = MIM(JP*RZ,N) DO S 2 I = ILB, IUB DO S 22 J = JLB, JUB S 22 A(I+1,J) = B(I,J) + C(I,J) We have used this concept of breaking nonbasic recurrences in programs CD, GE, and LUD. We obtained the same curves of page faults and space-time cost versus memory allotment as before (for program LUD we got better results here because loop fusion is not used as it was in the nonbasic to basic transformation. M is reduced from 6 to 3). Table 23 ot compares the number of instructions executed when using the recurrence 210 breaking technique and the nonbasic to basic --block transformation. The advantages of the recurrence breaking technique are obvious. However, more work needs to be done to determine the complexity of this technique and its implementation problems. Another transformation technique which needs further investigation is one we used in the Fast Fourier Transform program, FOURTR. Basically what we did can be illustrated by the following example. Program 19-a . DO S I = 1,N1 DO S J = I,N2,DELT S A(J) = B(J) + C(J) Table 23. Comparing the Two Techniques of Transforming Nonbasic TT-Blocks . Program Original Program Number of Instructions Executed Nonbasic to Basic iT-Block Transformation Used Recurrence Breaking Transformation Used CD GE LUD 234211 494314 507543 2202748 1619039 2247035 295547 567741 676576 In this program if DELT > Nl then its locality can be improved by trans- forming it as follows (the mean time between references to the same page will be smaller) : 211 Program 19-b . DO S 1=1, r(N2-l)/DELTl JLB = 1 + (I-1)*DELT JUB = Nl + (I-1)*DELT DO S J = JLB, JUB S A (J) = B(J) + C(J) We have chosen not to discuss this technique in Chapter Three because we did not encounter this situation except once in the programs we examined. More work needs to be done to investigate how important this case is and develop the needed general transformation algorithm. Before leaving the subject of improving the transformations we want to mention that some of the rules we adopted in some transformations were rather strict. For example, to fuse two NP's we required that their control structure be identical. This rule does not have to be so rigid. Loops of slightly different control structure can be fused if the difference in the control structure is taken care of by appropriate statements (IF state- ments, for example). Thus the loop fusion transformation might need some tuning. The second area which has a great potential for further research is investigating the implications of our results for the memory hierarchy design problem. For example, pages of large sizes are currently favored over small pages because of the page fault service time overhead. However, the larger the page the worse the internal fragmentation problem becomes [DENN70]. Currently, with CCD technology, people are building smart (expensive) controllers which reduce the latency time to zero. In [FULL78] and [SITE78], a cheaper approach is suggested which cuts the average 212 latency time to about .1 of the rotation cycle. Thus it seems that the latency problem of the rotating paging devices is going to disappear one way or another. Hence, the page fault service time will be reduced. Since transformed programs have excellent behavior even with small page sizes, then a reconsideration and re-evaluation of the best page size needs to be done. If small page sizes prove to be better, as we expect, then this leads to a considerable reduction in the amount of physical primary memory needed in a machine. This thesis invites an investigation of another important subject. In the last few years research has been going on at the University of Illinois to design transformations for enhancing the parallelism of ordinary programs to execute efficiently on parallel machines. Not much attention was given to the effect of these transformations on the memory space requirements and I/O activities of programs. The challenging question which we are raising here is how can programs be transformed to run faster on a parallel machine which is supervised by a virtual memory operating system? When transforming programs for vector machines the goal is to maximize the number of operations which can be executed simultaneously. The larger the number of data items which can be processed simultaneously, the higher is the speedup achieved by a vector machine. In other words, parallel and pipelined machines are most effective when they process long vectors. This necessitates that these long vectors will be accessible in main memory. From a paging operating system point of view, however, the goal is to minimize the space- time cost, the primary memory requirements, and the I/O activity of programs. In serial machines the success of virtual memory systems is based on the locality property, i.e., only a small portion (small 213 number of pages) of the data (and code) of a program need to be in main memory at one time. The transformations presented in this thesis are aimed at enhancing this locality property. Thus it seems that our virtual memory enhancement transformations and the parallelism enhancement trans- formations are at odds. The parallelism transformations assume that all the elements of large arrays will be in main memory, while the virtual memory transformations are designed to make programs execute with as little data in main memory as possible! It is interesting to find out whether some compromise transformations can be designed to achieve both goals: enhancing the parallelism and locality of programs. Last, but not least, the design of transformations for improving the locality of nonnumeric programs (Cobol programs for example) is another possible area for future research. This is important because the majority of machine cycles in the world are spent on such nonnumerically oriented calculations. 214 REFERENCES [ARVI73] Arvid, R. Y. Kain, and E. Sadeh, "On Reference String Generation Processes," Proc. 4th ACM Symp . on Operating Systems Principles, October 1973, pp. 80-87. [BAB077] Babonneau, J. Y., M. S. Achard, G. Morisset, and M. B. Mounajjed, "Automatic and General Solution to the Adapta- tion of Programs in a Paging Environment," Proc. 6th ACM Symposium on Operating Systems Principles, November 1977, pp. 109-116. [BANE76] Banerjee, U., "Data Dependence in Ordinary Programs," Department of Computer Science, University of Illinois at Champaign-Urbana, Report No. 837, November 1976. [BANE78] Banerjee, U., "Detection of Array Variables in Data Flow Analysis," in Preparation. [BATS76a] Batson, A. P. and A. W. Madison, "Characteristics of Pro- gram Localities," CACM, Vol. 9, No. 5, May 1976, pp. 285- 294. [BATS76b] Batson, A. P. and A. W. Madison, "Measurements of Major Locality Phases in Symbolic Reference Strings," Proc. International Symposium on Computer Performance Modeling, Measurement, and Evaluation, Cambridge, Mass., 1976, pp. 75-84. [BATS76c] Batson, A. P., "Program Behavior at the Symbolic Level," Computer, November 1976, pp. 21-26. [BELA66] Belady, L. A., "A Study of Replacement Algorithms for Virtual Storage Computers," IBM Systems J., Vol. 5, No. 2, 1966, pp. 78-101. [BELA69] Belady, L. A. and C. J. Kuehner, "Dynamic Space Sharing in Computer Systems," CACM, Vol. 12, No. 5, May 1969, pp. 282- 288. [BOBR67] Bobrow, D. G. and D. L. Murphy, "Structure of a LISP System Using Two-Level Storage," CACM, Vol. 10, No. 3, March 1967, p. 155. [BRAW68] Brawn, B. and F. Gustavson, "Program Behavior in a Paging Environment," AFIPS FJCC , Vol. 33, 1968, pp. 1019-1032. 215 [BRAW70] Brawn, B. and F. Gustavson, "Sorting in a Paging Environ- ment," CACM, Vol. 13, No. 8, August 1970, p. 483. [BUDZ77] Budzinski, R. L., "Dynamic Memory Allocation for a Virtual Memory Computer," Coordinated Science Laboratory, University of Illinois at Champaign-Urbana, Report No. R-754, January 1977. [CHU72] Cbu, W. W. and H. Opderbeck, "The Page Fault Frequency Replacement Algorithm," AFIPS FJCC, Vol. 41, 1972, pp. 597-609. [COME67] Comeau, L. W., "A Study of the Effect of User Program Optimization is a Paging System," Proc. ACM Symposium on Operating Systems Principles, Gatlinburg, Tenn. , 1967. [DENN68] Denning, P. J., "The Working Set Model for Program Behavior," CACM, Vol. 11, No. 5, May 1968, pp. 323-333. [DENN70] Denning, P. J., "Virtual Memory," Computing Surveys, Vol. 2, No. 3, September 1970, pp. 153-189. [DENN72a] Denning, P. J., "On Modeling Program Behavior," Proc. AFIPS SJCC, 1972, pp. 937-945. [DENN72b] Denning, P. J. and J. R. Spirn, "Experiments with Program Localities," AFIPS FJCC, 1972, pp. 611-621. [DENN75] Denning, P. J. and K. C. Kahn, "A Study of Program Locality and Lifetime Functions," Proc. 5th ACM Symposium on Operating System Principles, Austin, Texas, 1975, pp. 207-216. [DUBR72] Dubrulle, A. A., "Solution of the Complete Symmetric Eigenvalue Problem in a Virtual Memory Environment," IBM JRD, November 1972, pp. 612-615. [ELSH74] Elshoff, J. L., "Some Programming Techniques for Processing Multi-Dimensional Matrices in a Paging Environment," Proc. NCC, 1974, pp. 185-193. [FERR74] Ferrari, D., "Improving Program Locality by Strategy- Oriented Restructuring," Information Processing 74 (Proc. IFIP Congress 74), North-Holland, Amsterdam, 1974, pp. 266-270. [FERR75] Ferrari, D., "Tailoring Programs to Models of Program Behavior," IBM JRD, Vol. 19, No. 3, May 1975, pp. 244-251. 216 [FERR76a] Ferrari, D. and E. Lau, "An Experiment in Program Restruc- turing for Performance Enhancement," Proc. 2nd Int. Conf. on Software Engineering, San Francisco, Calif, October 1976. [FERR76b] Ferrari, D., "The Improvement of Program Behavior," Computer, November 1976, pp. 39-47. [FINE66] Fine, G. H., P. V. Mclsaac, and C. W. Jackson, "Dynamic Program Behavior under Paging," Proc. ACM 21st Nat. Conf. 1966, Thompson Book Co., Washington, D. C, pp. 223-228. [FRAN78] Franklin, M. A., G. S. Graham, and R. K. Gupta," Anomalies with Variable Partition Paging Algorithms," CACM, Vol. 21, No. 3, March 1978, pp. 232-236. [FULL78] Fuller, S. H. and P. F. McGehearty, "Minimizing Latency in CCD Memories," IEEETC, Vol. C-27, No. 3, March 1978, pp. 252-254. [GLAS65] Glaser, E. L. and J. B. Dennis, "The Structure of On-Line Information Processing Systems," Proc. Second Congress on Information Systems Sciences, 1965, pp. 5-14. [HATF71] Hatfield, D. J. and J. Gerald, "Program Restructuring for Virtual Memory," IBM Systems Journal, Vol. 10, No. 3, 1971, pp. 168-192. [IBM73] "Introduction to Virtual Storage in System/370," IBM Publica- tion GR20-4260-1, February 1973, pp. 50-51. [JONE72] Jones, P. D., "Implicit Storage Management in the Control Data STAR-100," COMPCON 72 Digest, 1972, pp. 5-7. [JULI78] Juliussen, J.E., "Bubbles and CCD Memories-Solid State Mass Storage," Proc. NCC, 1978, pp. 1067-1075. [KILB62] Kilburn, T., D. B. G. Edwards, M. J. Lanigan, and F. H. Sumner, "One-Level Storage System," IRE Transactions, EC-11- Vol. 2, April 1962, pp. 223-235. [KUCK70] Kuck, D. J. and D. H. Lawrie , "The Use and Performance of Memory Hierarchies: A Survey," Software Engineering, Vol. 1, 1970, pp. 45-77. [KUCK78] Kuck, D. J., "The Structure of Computers and Computation," John Wiley and Son Inc., 1978. [LEAS76] Leasure, B. R. , "Compiling Serial Languages for Parallel Machines," Department of Computer Science, University of Illinois at Champaign-Urbana, Report No. 805, November 1976. 217 [MASU74] Masuda, T., H. Shiota, K. Noguchi, and T. Ohki, "Optimi- zation of Program Organization by Cluster Analysis," Information Processing 74 (Proc. IFIP Congress 74), North- Holland, Amsterdam, 1974, pp. 261-265. [McKE69] McKeller, A. C. and E. G. Coffman, "The Organization of Matrices and Matrix Operations in a Paged Multiprogramming Environment," CACM, Vol. 12, No. 3, 1969, pp. 153-165. [MOLE72] Moler, C. B., "Matrix Computation with Fortran and Paging," CACM, Vol. 15, No. 4, 1972, p. 268. [PRIE76] Prieve, B. G. and R. S. Fabry, "VMIN-An Optimal Variable Space Replacement Algorithm," CACM, Vol. 19, No. 5, May 1976, pp. 295-297. [ROGE73] Rogers, L. D., "Optimal Paging Strategies and Stability Considerations for Solving Large Linear Systems," Ph.D. Thesis, University of Waterloo, Canada, 1973. [SAYR69] Sayre, D., "Is Automatic Folding of Programs Efficient Enough to Displace Manual?" CACM, Vol. 12, December 1969, pp. 656- 660. [SCHE73] Scherr, A. L., "The Design of IBM 0S/VS2 Release 2," AFIPS Conf. Proc, Vol. 42, 1973, pp. 387-394. [SHED72] Shedler, G. S. and C. Tung, "Locality in Page Reference Strings," SIAM J. Computing, Vol. 1, No. 3, September 1972, pp. 218-241. [SITE78] Sites, R. L., "Optimal Shift Strategy for a Block-Transfer CCD Memory," CACM, Vol. 21, No. 3, May 1978, pp. 423-425. [SMIT67] Smith, J. L., "Multiprogramming under a Page on Demand Strategy," CACM, Vol. 10, No. 10, October 1967, pp. 636- 646. [SMIT76] Smith, A. J., "A Modified Working Set Paging Algorithm," IEEETC, Vol. C-25, No . 9 , September 1976, pp. 907-914. [SPIR76] Spirn, J., "Distance String Models for Program Behavior," Computer, November 1976, pp. 14-20. [SPIR77] Spirn, J., "Program Behavior: Models and Measurements," Elsevier-North Holland, N. Y., 1977. 218 [TOWL76] Towle, R. A., "Control and Data Dependence for Program Transformations," Department of Computer Science, University of Illinois at Champ aign-Urb ana, Report No. 788, March 1976. [WOLF78] Wolfe, M. J., "Techniques for Improving the Inherent Parallelism in Programs," M. S. Thesis, Department of Computer Science, University of Illinois at Champaign- Urbana, February 1978. 219 APPENDIX In this appendix, we show the page faults and the space-time cost curves for our untrans formed and transformed programs. The replacement algorithm used is the LRU algorithm and the page size is 256 bytes. The space-time cost is measured in pages-page faults (see Section 4.2.2). 220 lxicr Original Program 3xi0' Transformed Program 10 15 20 25 Pages of Real Memory 30 35 Figure 34-a. The Page Faults Curves for Program ADVECT 221 5x10" Space-_ Time Cost 1X10- 1X10 5*10" Original Program 10 15 20 25 30 35 Pages of Real Memory Figure 34-b. The Space-Time Cost Curves for Program ADVECT 222 3xio 1x10 Page Faults 1X10 3 1*10 Original Program Transformed Program 10 15 20 25 i^\^i 1 -V 30 40 50 Figure 35-a. The Page Faults Curves for Program BASE 223 2xl(T r lxiO- Space-Time Cost 1X10 1X10- Trans formed Program -L -L J_ J_ 10 15 20 Pages of Real Memory J-A^l 1 25 30 40 50 Figure 35-b. The Space-Time Cost Curves for Program BASE 3x10 1x10 Page Faults ix 10" 3x10 224 Transformed Program Original Program Pages of Real Memory Figure 36-a. The Page Faults Curves for Program BIGEN 225 6x10 ^ Space- Time Cost 1x10- 7x10 Original Program lxlCT - 3 4 Pages of Real Memory Figure 36-b. The Space-Time Cost Curves for Program BIGEN lxiO Page Faults 1*10 4 lxlO" 1x10 226 Original Program Transformed Program 1x10 i i i i i i i i i i i i i i i 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 Pages of Real Memory Figure 37-a. The Page Faults Curves for Program CD lxlO" Space- Time Cost 1x10 1x10" Original Program 1x10 227 1 2 10 11 12 13 14 15 Pages of Real Memory Figure 37-b. The Space-Time Cost Curves for Program CD 228 2x10 r 4 1x10 Page Faults lxlO" 6x10 I Original Program ■-V 10 15 20 40 Pages of Real Memory 60 80 Figure 38-a. The Page Faults Curves for Program DISPERSE 229 lxiCT Space-Time Cost 1x10 2X10' Original Program Transformed Program _L -V 10 15 20 40 Pages of Real Memory 60 80 Figure 38-b. The Space-Time Cost Curves for Program DISPERSE lxlO Page Faults 1X10" 1X10 _ 4x10" — Transformed _L 10 15 Pages of Real Memory 20 30 Figure 39-a. The Page Faults Curves for Program FIELD 231 3x10 1*10 Space- Time Cost 1*10' 4x10 Original Program J_ Transformed Program ± 10 15 Pages of Real Memory 20 25 Figure 39-b. The Space-Time Cost Curves for Program FIELD 232 3x10" 1X10 - Page Faults 1x10 1X10" Original Program 3 4 5 Pages of Real Memory Figure 40-a. The Page Fault Curves for Program FLR 233 3x10" 1X10" Space- Time Cost 1x10 4x10 Original Program 3 4 5 6 Pages of Real Memory Figure 40-b. The Space-Time Cost Curves for Program FLR lxHT r Page Faults - 1x10 - 1x10 10 234 Original Program 20 30 40 Pages of Real Memory 50 60 70 Figure 41-a. The Page Faults Curves for Program FOURTR 235 lxlO Space- Time Cost 1x10" 1X10 4x10" 10 20 30 40 Pages of Real Memory 50 60 70 Figure 41-b. The Space-Time Cost Curves for Program FOURTR 1X10 ~ Page Faults lxlO 4 lxlO" 1X10 lxlO J Transformed Program -L -L X 10 15 20 25 Pages of Real Memory 236 30 35 Figure 42-a. The Page Faults Curves for Program GE 237 2x10" 1*10" Space-Time Cost 1x10 l x 10" 1X10' Original Program Transformed Program 10 15 20 25 Pages of Real Memory 30 35 Figure 42-b. The Space-Time Cost Curves for Program GE 238 2x10 1x10 Page Faults lxlO" 1X10 Original Program Transformed Program L 1 J. _L -L 10 15 20 25 Pages of Real Memory J-A^J 1 A/ 30 50 60 Figure 43-a. The Page Faults Curves for Program INIT 239 3x10 1x10 Space- Time Cost 1x10- lxlO Transformed Program _L -L 10 15 20 25 Pages of Real Memory J- ^- L 30 50 60 Figure 43-b. The Space-Time Cost Curves for Program INIT 1X10 _ - i \ 240 Page Faults 4 1x10 Original Program 1X10 3 - l x 10 2 Transf< Drmed Program lxlO 1 i i i i i 1 I 10 15 20 25 Pages of Real Memory 30 35 Figure 44-a. The Page Faults Curves for Program LUD 241 4xi(T _ lxlQ- Space-Time Cost 1X10 1X10 - 5x10' 10 15 20 25 Pages of Real Memory 30 35 Figure 44-b. The Space-Time Cost Curves for Program LUD 242 1X10- Page Faults 1x10 lxl(T 4X10 i. -L 1 10 15 20 25 Pages of Real Memory - L A- j 1 30 40 50 Figure 45-a. The Page Faults Curves for Program MAIN 243 1x10" Space-Time Cost lxlO 3x10" Original Program Transformed Program _L _L J_ J_ 10 15 20 25 Pages of Real Memory j i 30 40 50 Figure 45-b. The Space-Time Cost Curves for Program MAIN 244 2xl0 5 r lxiO 3 U Page Faults lxiO - 1*10 5X10 2 Original Program Transformed Program V 10 15 20 25 Pages of Real Memory 30 35 Figure 46-a. The Page Faults Curves for Program MAMOCO 245 5xlO J Space-Time Cost ixi0 5 l- lxlO Transformed Program 10 15 20 25 30 35 Pages of Real Memory Figure 46-b. The Space-Time Cost Curves for Program MAMOCO 246 2xl0 4 r 1X10 . Page Faults - 3 1x10 L 1X10 6X1Q- Original Program Transformed Program 10 15 20 25 X. i-A^-l 1 1 30 35 40 45 Pages of Real Memory Figure 47-a. The Page Faults Curves for Program MATMUL 247 5x10" 1x10 Space-Time Cost 1x10 1x10" 5x10 2 - Transformed Program _L _L 10 15 20 25 Pages of Real Memory -V LA ^ L 30 35 40 45 Figure 47-b. The Space-Time Cost Curves for Program MATMUL 3x10' 1x10 Page Faults 1X10 Transformed Program Original Program 248 1x10 _L _L 5 6 7 8 Pages of Real Memory J I 10 11 12 Figure 48-a. The Page Faults Curves for Program MATTRP 3x10" 1x10- Space-Time Cost 1x10 1*10" 249 Original Program Transformed Program 12 3456789 10 11 12 Pages of Real Memory Figure 48-b. The Space-Time Cost Curves for Program MATTRP 250 4*10* Page Faults A 1*10 1x10" Original Program Transformed Program J L _1_ /\^J 1 I I L_ 10 30 50 70 90 110 130 150 170 Figure 49-a. The Page Faults Curves for Program PAPUAL 251 2x10 - 1x10 . Space- Time Cost 1X10" - 1x10 Transformed Program i_A J I I 1 I I I 10 30 50 70 90 110 130 150 170 Figure 49-b. The Space-Time Cost Curves for Program PAPUAL 252 2x10 r Original Program Transformed Program AH- j i 10 20 30 40 50 60 70 Pages of Real Memory Figure 50-a. The Page Faults Curves for Program TWOWAY 253 5x10 .. 20 30 40 50 60 70 Pages of Real Memory Figure 50-b. The Space-Time Cost Curves for Program TWOWAY 254 VITA Walid A. Abu-Sufah was born in Amman, Jordan, on the 1st of October 1949. In 1967 he was one of the top five from the fifteen thousand students who took the National General High School Examination in Jordan. Thereupon, he received a United States Agency for Inter- national Development scholarship to study at the American University of Beirut, Lebanon. Throughout his undergraduate study he was on the Dean's Honor List. In 1972 he received his B.E. with distinction in electrical engineering. During the academic year 1972-1973 he was a visiting graduate student from the American University of Beirut to the University of Pittsburgh, PA in the exhcange program between the two universities. At the University of Pittsburgh he taught an electronics laboratory course for senior students. During the academic year 1973-1974 he was a Teaching Assistant at the American University of Beirut where he wrote his M.S. thesis about modifying the design of the Hewlett-Packard 3721A Correlator. From June 1974 to April 1975 he was with Geophysical Service International, a subsidiary of Texas Instruments in Dallas, TX. At TI he was involved in the system maintenance and diagnostic programs develop- ment for the TIMAP system. During the summers of 1972, 1973, and 1975 he worked for the Royal Scientific Society of Jordan. There he was involved in several projects including the logic design for a laser character rec- ognition machine, the design of a hyprid calucalting unit for a speech intelligibility system, and laser distance meters. 255 From August 1975 until May 1977 he was a Teaching Assistant at the electrical engineering department of the University of Illinois at Urbana-Champaign. He has been a Research Assistant with the Digital Computer Laboratory since May 1977. He is a member of the ACM and the IEEE. BIBLIOGRAPHIC DATA SHEET 1- Report No. uiucdcs-R-78-945 2. 3. Recipient's Accession No. 4. Title and Subtitle Improving the Performance of Virtual Memory Computers 5. Report Date November, 1978 6. 7. Author(s) Walid Abdul-Karim Abu-Sufah 8. Performing Organization Rept. No 'UIUCDCS-R-78-945 9. Performing Organization Name and Address University of Illinois at Urbana-Champaign Department of Computer Science Urbana, Illinois 61801 10. Project/Task/Work Unit No. 11. Contract/Grant No. US NSF MCS77-27910 12. Sponsoring Organization Name and Address National Science Foundation Washington, D. C. 13. Type of Report & Period Covered Doctoral Dissertation 14. 15. Supplementary Notes 16. Abstracts A model for the ideal behavior of a program in a virtual memory system is developed. Algorithms for compiler transformations are designed to force Fortran-like programs to follow this model. The transformations serve the additional purpose of reducing the cost of execution of a program after it is made to follow the model. Preliminary experimental results show that transformed programs are easier to model, simpler to manage, and cheaper to execute. The results show that the transformations have the potential of improving the throughput of a multiprogrammed system by an order of magnitude and the degree of multiprogramming by a factor of five. Very simple memory management policies (like the LRU policy) seem to do as well, and even better than elaborate policies (like the Working Set policy) in managing transformed programs. Some measurements of the Working Set anomalies in Fortran programs are also discussed. 17. Key Words and Document Analysis. 17a. Descriptors Memory hierarchy design Optimizing compilers Program behavior Program transformations Virtual memory 17b. Identif iers/Open-F.nded Terms 17c. COSATI Field/Group 18. Availability Statement 19. Security Class (This Report) UNr 1 A.SSIFIF.D 21. No. of Pa^es 260 Release Unlimited 20. Security (lass (This Page UNCLASSIFIED 22. Price FORM NTIS-3B (10-701 USCOMM-DC 40329-P71 FEB 8 1979 aj'..: _^ inm