. . ma terial is re- 
 
 The Perso- £$«£** f^JX 
 sponsible for >^.! hdraw n on or beto 
 
 " hic Vi£ stinted below. 
 
 Ll6l ,O-l096 
 
510 3H 
 
 ,rAi i/j^ Report No . UIUCDCS-R-83-1147 
 
 UILU-ENG 83 1727 
 
 h 
 
 ■i 
 
 PROGRAM BEHAVIOR UNDER VAX/VMS 
 
 by 
 
 Walid Abu-Sufah and Roland L. Lee 
 
 November 1983 
 
 US NSF-MCS 83-00981 
 US DOE-AC02 81-ER10822 
 
 DEPARTMENT OF COMPUTER SCIENCE 
 UNIVERSITY OF ILLINOIS AT URBANA-CHAMPAIGN 
 
 URBANA, ILLINOIS 
 
Digitized by the Internet Archive 
 in 2013 
 
 http://archive.org/details/programbehavioru1147abus 
 
PROGRAM BEHAVIOR UNDER VAX/VMS 
 
 by 
 
 Walid Abu-Sufah and Roland L. Lee 
 
 Department of Computer Science 
 
 University of Illinois at Urbana-Champaign 
 
 Urbana, Illinois 61801 
 
 This work was supported in part by the Department of Computer Science, University of 
 Illinois at Urbana-Champaign; and in part by the National Science Foundation under 
 Grant No. US NSF-MCS 83-00981, and the U.S. Dept. of Energy under Grant No. 
 DOE-AC02 81-ER 10822. 
 
ABSTRACT 
 
 Direct measurements on a VAX/VMS system reveal that program behavior has a 
 significant effect on the performance of this system. For a monoprogrammed batch 
 workload the turnaround time of a job can be reduced by up to 50% if its behavior is 
 improved. This is for jobs with virtual space that can fit in physical memory. For larger 
 jobs the improvement can reach a factor of 100. 
 
 In a multiprogramming batch environment improving the behavior of programs 
 increased the throughput of the system by up to 64% for balanced workloads, up to 
 400% for I/O bound workloads and up to 419% for mixes of balanced and I/O bound 
 workloads. Improving the program behavior also reduces the overhead time of 
 automatic memory management. This was measured to reach up to 83%. 
 
 This case study points towards the more general conclusion that program behavior 
 has a significant influence on computer system performance even with the abundance of 
 hardware resources available now and in the future. 
 
 Keywords: 
 
 program behavior, automatic memory management, system performance 
 
1. Introduction 
 
 Recent trends in implementing virtual memory operating systems in the real world 
 assume an abundance of hardware resources -- mainly physical memory and secondary 
 storage. With the drastic drop of memory systems' cost through the last decade, this 
 assumption is valid for mini and more powerful computers [PoAg83]. This is relative to 
 the resources available in the late sixties or early seventies. System designers consider 
 these hardware resources to be the primary constraints on system performance (for a 
 given CPU speed) [LeEc80]. 
 
 Early studies pointed out that another factor controlling the performance of virtual 
 memory computers is program behavior [BrGu68], [Denn68a], [KuLa70], [Elsh74], 
 [Ferr76]. However, in the sixties and seventies, the concern was to have a satisfactory 
 performance with the least possible physical memory on the machine. With the drastic 
 drop in memory cost one wonders if program behavior is of any significance at the 
 present or in the future. This paper is an attempt to answer this question. 
 
 Traditionally most empirical studies of program behavior would use trace-driven 
 simulations (for example see [A1HK80], [ALMY82], [HaPo83], [LaFe83]). A program of 
 interest is executed interpretively, a record is made of each of its memory references, 
 and the address trace that results is used to drive a simulator of a certain environment. 
 The simulation approach has many advantages; mainly exact reproducibility and ease in 
 changing the parameters of the environment being simulated. However, simulations have 
 a major drawback. It would be difficult to use trace-driven simulation to accurately 
 account for the effects of various aspects of modern and future computer systems in our 
 
study. Examples of these aspects are: multiprogramming, the execution of operating 
 system routines for a nontrivial percentage of CPU time, and I/O interference. Our 
 investigation is done by direct measurements on a real machine. A similar approach 
 was followed recently in a study to evaluate the performance of a cache system [Clar83]. 
 The study showed a noticable difference between the performance of the cache systems 
 as anticipated by simulation studies [Smit82] and those obtained by direct measurement 
 on a real machine. 
 
 The case study we use will be the DEC VAX/VMS system [Stre78], [LeEc80|. 
 Although the exact reproducibility of results using simulation will be lost in our 
 approach, the margin of error is small (we discuss this issue in more detail in section 2). 
 The advantage of our approach is that conclusions will be based on measurements of a 
 real machine with all the complexities of its architecture and operating system. Choos- 
 ing a specific machine and its operating system may seem to limit the generality of our 
 conclusions. This may be true for the quantitative parts of our conclusions, however, a 
 case study is one legitimate way of exploring the effect of program behavior on system 
 performance. This is specifically because the designers of this system explicitly declare 
 that they assume an abundance of hardware resources. This environment is suitable for 
 exploring the answer to our question: what is the degree of influence which program 
 behavior still has on the performance of a system with an abundance of resources? 
 
 We performed experiments on the VAX/VMS using two simple programs. The 
 behavior of these programs can be easily changed through one simple transformation, 
 the loop interchange transformation. The improvement of the behavior of these pro- 
 grams varies from drastic to moderate. Due to the simplicity of the programs, we 
 
transformed them manually. However, the automation of this and other transforma- 
 tions has been implemented in the PARAFRASE system of the University of Illinois 
 [Leas76],[Wolf78],[AbKL79], [KKLW80], [AbKL81], [Wolf83]. 
 
 In Section Two we discuss our experimental process; the programs used, their 
 transformed versions, the performance measures used and the experiments conducted. In 
 Section Three we present and discuss the results. In Section Four we present our con- 
 cluding remarks concerning the questions raised in this paper. 
 
 2. The Experimental Environment 
 
 The following is a brief description of the environment in which these experiments 
 were performed. The computer used is a VAX 11/750 running version 3.3 of DEC's 
 operating system, VMS. The virtual memory page size of this system is 512 bytes and 
 the total number of pages of real memory in the system is 4096 pages. The operating 
 system itself occupies approximately 900 pages of main memory. VMS uses a local 
 FIFO replacement algorithm. 
 
 Each user process on the system is given a certain set of main memory pages called 
 the resident set on which to execute (in DEC literature this is called the Working Set of 
 the process). VMS gives an initial resident set of size WSDEFAULT (this parameter 
 can be set by the user) to each process from which it dynamically changes the amount 
 of memory of a process in response to the process' paging rate and the amount of free 
 memory left in the entire system. If the paging rate is above a certain level, PFRATH, 
 the operating system increases the size of the process' resident set by WSINC; if how- 
 
ever, the paging rate is below a certain level, PFRATL, the operating system decreases 
 the amount memory for the process by WSDEC. The maximum size of the resident set 
 of any user process is upper bounded by the system parameter WSMAX. Additionally, 
 each user process has its own upper limits to the size of its resident set. WSEXTENT is 
 the maximum possible resident set size for the process while WSQUOTA is the max- 
 imum guaranteed resident set size for the process (WSQUOTA must be less than or 
 equal to WSEXTENT). The resident set of a process may exceed its WSQUOTA only 
 when there are more than BORROWLIM number of free pages in the system. The size 
 of the virtual space (in pages) associated with each process is denoted by PVWS. The 
 maximum number of physical pages occupied by a process during its lifetime is denoted 
 by PWSS. 
 
 In addition to the pages allocated to user processes, the operating system keeps a 
 certain amount of memory free in the free page list and the modified page list to act as 
 a page cache. When a page is released from the resident set of a process and if the page 
 was modified (and thus requiring a disk write), it goes into the modified page list; how- 
 ever, if the page was not modified it goes into the free page list. The operating system 
 keeps the size of the free page list above FREELIM pages and makes sure it is at least 
 FREEGOAL pages large after each freeing of pages from user processes (freeing in 
 response to a memory shortage). The maximum size of the modified page list is 
 MPWJflLIMIT and the minimum size is MPWJ.OLIMIT. If a process faults and the 
 page is in either list, the page is returned to the process' resident set. Such a page fault 
 is relatively uncostly since the page fault can be satisfied without a disk I/O request. 
 Besides being part of a paging cache the modified page list acts as a staging buffer for 
 
the clustering of disk writes. This clustering serves to reduce the amount of disk I/O. 
 Pages from the modified page list are written out of memory in clusters of 
 MPW_WRTCLUSTER pages. For each page fault requiring a disk read, a cluster of 
 PFCDEFAULT virtually contiguous pages are read into the faulting process' resident 
 set. 
 
 VMS also swaps entire working sets between memory and disk. VMS checks the 
 nonresident executable queues to find the highest priority process to be swapped in. 
 Once a process is selected, the operating system must find enough free pages to hold the 
 process' resident. There are three ways to obtain these free pages, the first is to take 
 them from the free list, the second is to do a disk write from the modified page list and 
 thus free those pages, and the third is to swap out a process of lower or equal priority. 
 Once swapped in the process is guaranteed at least one quantum before it becomes eligi- 
 ble to be swapped out. 
 
 For a more detailed description of the memory management in VMS see [LeLi82]. 
 Table 1 shows the values of the system parameters used in this installation. The values 
 assigned to these parameters are those used by the system manager of the site to suit 
 the workload of the machine. The site is a software house with a day workload consist- 
 ing mainly of the interactive development of vectorizing compilers for supercomputers 
 while at night the production runs consists of compilation jobs. We were not free to 
 vary the values of the system parameters nor did we intend to change them. We felt 
 that for the investigation we are doing in this paper, we should not be concerned with 
 tuning issues. The designers of the system do not advocate the idea of putting a lot of 
 effort in tuning the system. Instead, they advocate the use of the default values for the 
 
system parameters while adding more hardware resources whenever the workload out- 
 grows the system [DEC82]. This implies that with sufficient hardware resources, the sys- 
 tem performance is satisfactory with the default system parameter values. This paper 
 examines this claim from the point of view that considers the effect of program 
 behavior. Other researchers have shown that tuning an early version of this system 
 drastically improved its performance [Lazo79]. 
 
 Table 1. System Parameters 
 
 Parameter 
 
 Value 
 
 BORROWLIM 
 
 300 pages , 
 
 FREEGOAL 
 
 500 pages 
 
 FREELIM 
 
 100 pages 
 
 MPW_HILIMIT 
 
 500 pages 
 
 MPW LOLIMIT 
 
 100 pages 
 
 MPW WRTCLUSTER 
 
 96 pages 
 
 PFCDEFAULT 
 
 32 pages 
 
 PFRATH 
 
 200 Faults/10 sees 
 
 PFRATL 
 
 100 Faults/10 sees 
 
 QUANTUM 
 
 200 ms 
 
 WSINC 
 
 150 pages 
 
 WSDEC 
 
 35 pages 
 
 The following two programs (coded in FORTRAN) were used for the experiments. 
 The first program, ADD, sums up the values of each row of a square matrix. The second 
 program, MAD, is a matrix addition of two square matrices. In these programs, the 
 matrices are referenced by rows. Additionally, we have transformed versions of each of 
 the programs called TADD and TMAD. The transformation applied to these programs is 
 loop interchange. Due to the loop interchange transformation, the matrices are refer- 
 enced by columns. Each of these programs were compiled with eight versions, 
 
distinguished by the problem size. <PROG>l is the version with a 128 by 128 matrix, 
 <PROG>2 for 256 by 256, <PROG>3 for 384 by 384, up to <PROG>8 for 1024 by 
 1024 where <PROG> is one of {ADD, MAD, TADD, TMAD}. 
 
 The information about the resource usage of each of the programs is reported 
 through the accounting log files. This log contains the time at which the process ter- 
 minated, the number of I/O requests serviced, the number of page faults, the peak size 
 of the resident set during the execution of the process (PWSS), the peak virtual memory 
 space in pages (PVWS) allocated to the process, the elapsed CPU time, and the elapsed 
 real time. 
 
 The first experiment performed was to run each program in a monoprogramming 
 batch mode. Each program was run once for each size of the data array both for the 
 original version and for the transformed version. The elapsed time for each program 
 was noted and the ratio between the original and transformed versions of the 
 corresponding sizes were compared. The purpose of this experiment is to illustrate the 
 effectiveness of the transformations on programs that would normally be monopro- 
 grammed on a VAX machine. The processes created for this experiment have the fol- 
 lowing parameter settings: the WSDEFAULT is set to 250 pages, the WSQUOTA is set 
 to 1500, and the WSEXTENT is set to 1500. Programs with large array sizes seem to 
 reflect such a workload. 
 
 The second experiment examines the effectiveness of the transformation on pro- 
 grams that would normally be run in a multiprogramming environment. The original 
 and transformed versions of the programs at a smaller data array size were used in these 
 experiments. Each version of the program was multiprogrammed with 
 
multiprogramming level (MPL) varying from one to six. MPL copies of the program 
 were started at the same time. The time per job at each of the MPL's was compared 
 between original and transformed versions of the program. We also multiprogrammed 
 the two programs ADD and MAD by submitting three copies of each to the system. A 
 similar run was done for the transformed programs. The processes created for this 
 experiment have the same parameter settings as those used for experiment one. 
 
 The third experiment attempts to measure the reduction of automatic memory 
 management cost due to improved program behavior. This is done by measuring the 
 time of running an original program versus a transformed program when physical 
 memory allocated to the process of the original program is greater than the virtual space 
 needed by the process. The original version of the program was run with the WSDE- 
 FAULT = WSQUOTA = WSEXTENT set to a value greater than the PVWS of the 
 program. The transformed program was run three times once for each of the following 
 conditions: (i) WSDEFAULT = WSQUOTA = WSEXTENT = PVWS (ii) WSDE- 
 FAULT = WSQUOTA = WSEXTENT = PWSS and (iii) restricted memory (WSDE- 
 FAULT = WSQUOTA = WSEXTENT = 213). ADD3 and its transformed counter- 
 part were picked for these experiments. This is the largest version of ADD whose PVWS 
 can fit entirely in physical memory. 
 
 In order to make the conditions for all experiments as controllable as possible, the 
 experiments were run only when no other user processes were executing. However, some 
 measure of uncertainty is unavoidable. One of the reasons comes from the interference 
 caused by the execution of the operating system routines. The experiments were run 
 from a batch queue that operated when all users were logged off. Synchronization within 
 
10 
 
 this queue was done in order to insure that no other pending job would start until the 
 current experiment completed. 
 
 Besides the operating system processes, there is a process that executes at the time 
 of the experiments that controls the submitting of jobs to be run. This process intro- 
 duces a slight interference with the execution of the experiments. Also, the job control- 
 ling the simultaneous submitting of multiprogrammed jobs actually starts the jobs in 
 serial. The total error introduced from delays in submitting jobs and the interference of 
 the job submitting program account for some of the slight perturbations in the results 
 between repeated instances of the experiments. These differences have been computed in 
 several reruns for several of the experiments and have been found to be of no significant 
 consequence. 
 
 3. Experimental Results 
 
 3.1. Experiment One 
 
 As mentioned in Section Two the purpose of this experiment is to investigate the 
 effect of program behavior on the performance of a monoprogrammed VAX/VMS sys- 
 tem. It is the recommendation of the designers of the VAX/VMS to run large produc- 
 tion jobs in a batch monoprogramming mode [DEC82]. In fact, these types of jobs are 
 executed in this manner. The effect of program behavior on the performance of the sys- 
 tern in such an environment is probably best measured by its effect on the turnaround 
 time of jobs. Other measures of interest are the PWSS and the number of page faults 
 generated. Now we discuss the results we obtained for programs ADD and MAD. 
 
11 
 
 3.1.1. Program ADD 
 
 Table 2 shows the space statistics for this program for different array sizes. For the 
 version where the two dimensional array is 128 by 128 (version 1) the space occupied by 
 array elements is 129 pages (the page size in real words is 128, hence 128 pages are occu- 
 pied by the two dimensional array and one page by the result vector). For a matrix size 
 of 1024 by 1024 (version 8) we need 8*1024 + 8 = 8200 pages. Note that the code and 
 the scalar storage requirements are negligible. Hence, the 149 page difference between 
 the occupied virtual process space, PVWS, and its array data pages is the part of virtual 
 space allocated to the process for its control region (for user mode stack and other pro- 
 tected process specific data and code as well as for the stacks of higher access modes 
 such as the supervisor, executive, and kernel) [LeEc80]. This control space was measured 
 to be constant for all the programs in all the experiments at 149 pages. This is more 
 than 50 percent of the total process space for program ADD at array size of 128 by 128 
 and is independent of the problem size. 
 
 From Table 2 we notice that for a given problem size the difference between the 
 PWSS for ADD and TADD is less than 1% for versions 1 through 4 and between 6% 
 and 21% for versions 5 through 8. Moreover, for some versions TADD.PWSS > 
 ADD.PWSS and for others ADD.PWSS > TADD.PWSS. This indicates a failure of the 
 automatic techniques implemented in VMS to take advantage of the extreme difference 
 between the behavior of ADD and TADD and the difference in their space requirements. 
 Optimally, TAAD would only need a maximum of 16 (taken from analysis of version 8) 
 array data pages at any given time during its execution for a total of 165 pages of pro- 
 cess space. It is not within the scope of this paper to discuss why the mechanisms of 
 
12 
 
 VMS failed to distinguish between the behavior of the two programs. The important 
 observation is that it occurred irrespective of whether the process space can fit in physi- 
 cal memory or not and the fact that TADD's fault rate is much lower than ADD's fault 
 rate. This is shown in Table 3. 
 
 Table 3 also shows the turnaround time of each program and the ratio of tur- 
 naround times for the original and transformed programs. Note that the improvement in 
 turnaround time is up to 50 percent for version 1 to 4. For these programs the system 
 can keep the pages of the user process in physical memory (in the process resident set, 
 free page list, and modified page list). For versions 5 to 8, the main memory of the 
 machine is not large enough to hold most of the virtual memory space of program ADD. 
 Disk page faults are now more frequent than for the smaller versions of ADD and the 
 improvement ratio jumps to around 400 percent. The table also shows the improvement 
 in the number of page faults. Since the number of page faults include both cheap and 
 expensive ones, we feel that the page fault ratio is less indicative of the improvement 
 than the ratio of turnaround times. Figure 1 shows a plot of this ratio for program 
 ADD versus the result vector length (problem size). Notice the step in the curve when 
 most of the process' virtual space does not fit in core. 
 
 Our conclusion from Table 3 is that the behavior of programs like ADD is still a 
 major factor in VAX/VMS performance in a monoprogrammed batch mode. Improving 
 the behavior of programs through compile time transformations can reduce the tur- 
 naround time by a factor of 4 without the need for expanding the main memory size. 
 Even if the main memory size was increased to adequately hold almost all of the virtual 
 process space, program transformation can reduce turnaround time by an appreciable 
 
13 
 
 percentage (50% was measured). 
 
 3.1.2. Program MAD 
 
 This program adds two 2-dimensional matrices and stores the result in a third 
 matrix. The space needed for array elements is three times the space needed for ADD, 
 however, the amount of arithmetic is the same. Thus, one can consider this program to 
 be I/O (or paging) bound relative to ADD. Tables 4 and 5 show measurements for MAD 
 similar to Tables 2 and 3 for ADD. 
 
 Table 4 supports the conclusion reached in the previous section about the inability 
 of VMS to differentiate between MAD and a drastically better behaved TMAD. The 
 differences between the PWSS of the original and transformed programs are 
 insignificant. 
 
 As for the improvement in the turnaround time, we notice in Table 5 that it grows 
 rapidly from a factor of 3.3 to 64.88 as soon as the process virtual space is too large to 
 fit in core (versions 3 to 6). For programs that can fit in core the improvement is in the 
 range of 20%. Figure 2 illustrates the pronounced improvement that the transformed 
 version gives when the program does not fit in core. Note that the amount of additional 
 memory needed to make this program fit in core is three times what is needed to fit pro- 
 gram ADD in core. In contrast with the drastic memory expansion approach to making 
 running such a program practical on such a machine, the transformation approach 
 allows a drastic improvement in system performance without the need for expanding the 
 memory. This program should run with 167 pages with almost the same performance 
 (for version 6). 
 
14 
 
 3.2. Experiment Two 
 
 Most VAX/VMS systems are actually used in a multiprogramming interactive 
 environment for an appreciable percentage of the time. This kind of a workload would 
 consists of program development, debugging, running system facilities such as the mail- 
 ing system, and running interactive systems such as data base systems. In this section 
 we try to study the effect of program behavior on system performance when the 
 machine is multiprogrammed. For a workload with the two programs discussed in this 
 paper, one would expect that jobs used in a multiprogrammed environment to have 
 smaller array sizes than those used in a monoprogrammed environment. One can think 
 of the runs in the multiprogramming mode as test runs. When numerical programs pass 
 the test runs, they are scheduled for execution with realistically large problem sizes in a 
 batch monoprogrammed mode. 
 
 3.2.1. Program ADD 
 
 We choose ADD2 (256 by 256 matrix) for our multiprogramming experiment. The 
 total virtual process space for this program is 663 pages which is reasonable for such an 
 environment; it is not trivial and at the same time not too large to require monopro- 
 gramming. 
 
 Table 6 shows the time per job for MPL varying from one to six for ADD and 
 TADD. The table also shows the total virtual space of user processes for each MPL. 
 
 Examining the time per job for ADD, we notice that the system is being multipro- 
 grammed in a rather optimal way. The system is operating at the maximal flat portion 
 of the throughput (and CPU utilization) versus degree of multiprogramming model 
 
15 
 
 curve [DKLP76] (see Figure 3 and Figure 4). We see that the system is multiprogram- 
 ming six processes with a total of 4000 pages of virtual space while the time per job is 
 approximately equal to the time for monoprogramming the job. Table 3 shows that ver- 
 sion 5 of ADD (with PVWS = 3354 pages) is thrashing. From this we conclude that 
 multiprogramming is very effective under VAX/VMS. The same conclusion is reached by 
 examining the time per job for multiprogramming MPL copies of the transformed pro- 
 gram TADD. 
 
 Notice that the throughput of the system improves when the transformed workload 
 is run. The ratio of time/job of ADD to TADD ranges from 1.24 to 1.35 with an average 
 and median of 1.29. Table 7 and Figure 5 presents the results for this experiment with 
 version 3 of ADD. The improvements are more pronounced in this case. This workload 
 is an I/O-CPU balanced workload. Next we examine multiprogramming an I/O bound 
 workload. 
 
 3.2.2. Program MAD 
 
 As discussed earlier this program generates more page faults and does the same 
 amount of computation as ADD. Table 8 shows our findings for multiprogramming up 
 to six copies of MAD2 and TMAD2. 
 
 We notice that for the untransformed workload the system is thrashing [Denn68b]. 
 We are seeing the negative slope part of the throughput (CPU utilization) versus mul- 
 tiprogramming level function. The time per job increases from .201 minutes in the 
 monoprogramming case to .819 minutes at MPL=6. This is a factor of four. Improving 
 the behavior of the workload reduces the thrashing effect significantly. Figure 6 shows 
 
le 
 
 that MAD2 is in the negative slope region of the model multiprogramming curve (Figure 
 3) while TMAD2 is at the flat region. The ratio of time per job at MPL = 6 to the 
 uniprogrammed case is 1.6 for TMAD2. Moreover, we see that the ratio of the time per 
 job for the original workload to that of the transformed workload ranges between 1.20 
 to 3.82 with an average of 2.63 and median of 2.82. This is a very significant improve- 
 ment in the throughput and utilization of the machine. 
 
 3.2.3. Mix of ADD and MAD 
 
 In this experiment three copies of ADD5 and three copies of MAD3 were multipro- 
 grammed together. We repeated this experiment using transformed programs with three 
 copies of TADD5 and three copies of TMAD3. The total time to finish the original ver- 
 sions of the programs was 8.25 minutes while the total time to finish the transformed 
 versions was 1.97 minutes; giving an improvement by a factor of 4.19. Notice that this 
 improvement is greater than either of the improvements in independently monopro- 
 gramming TADD5 over ADD5 (3.63) or TMAD3 over MAD3 (3.30). 
 
 3.3. Experiment Three 
 
 In this experiment we consider the question of whether program behavior affects 
 the overhead associated with automatic memory management. To be conservative we 
 picked the more balanced program, ADD, for this experiment rather than the paging 
 bound program MAD. To obtain results whose experimental error percentage is negligi- 
 ble, we chose the largest version of ADD with memory requirements that can fit in core. 
 This version of ADD is ADD3 with PVWS=1304 pages. 
 
17 
 
 We ran ADD3 in a monoprogrammed batch mode with WSDEFAULT = 
 WSQUOTA = WSEXTENT = 1500 pages. Thus, it was allocated real space greater 
 than its logical space. Then we ran TADD3 with three different settings of these param- 
 eters. 
 
 (A) WSDEFAULT = WSQUOTA = WSEXTENT = 1500 PAGES 
 
 In this case, both the original and transformed programs were allocated the same 
 amount of real memory, enough to exceed their logical space size. The turnaround time 
 of TADD3 was .153 minutes while for ADD3 it was .280 minutes. The ratio of the two 
 times is 1.83. This is a very significant reduction in automatic memory management 
 overhead. 
 
 (B) WSDEFAULT = WSQUOTA = WSEXTENT = PVWS (1304 PAGES) 
 
 The real space allocated to TADD3 is exactly equal to its virtual space. The tur- 
 naround time was practically identical to that in the previous case and hence the 
 improvement was also identical. 
 
 (C) WSDEFAULT = WSQUOTA = WSEXTENT = 213 PAGES 
 
 In this case we reduce the memory allocation of TADD3 to 213 pages. This is 
 enough space to hold the 149 pages used by the system and a cluster of 32 pages when a 
 fault to the two dimensional array occurs plus another 32 pages when a fault to the 
 result vector occurs. The turnaround time was .155 minutes. The ratio of turnaround 
 time of the orginal program to this time is 1.81. Note that the space has been reduced 
 by a factor of 1500/213 = 7.0. 
 
18 
 
 From these results we note that the reduction in the automatic memory manage- 
 ment overhead due to improved program behavior is very significant. 
 
 4. Concluding Remarks 
 
 The experimental data presented in this paper shows that program behavior can 
 still have a major influence on computer system performance. The point that this paper 
 makes is that program behavior can not be simply ignored by system designers. It is cer- 
 tainly true that having an abundance of hardware resources - large fast main memories 
 and high bandwidth I/O - helps to improve the performance of computer systems. How- 
 ever, these hardware resources do not eliminate the influence of program behavior. 
 Moreover, user demands on memory space and processor speed of computer systems are 
 always greater than what manufacturers supply. With time this demand is growing in a 
 more rapid pace than the growth of hardware resources. 
 
 The conclusions presented in this paper are based on a case study of a VAX 
 750/VMS system. Two simple example programs were used. Both do the same amount 
 of arithmetic, however, one requires more space and has higher paging activity. The 
 behavior of these programs can be drastically improved through simple compile time 
 transformations. In a production, monoprogrammed, batch environment, the turnaround 
 time of untransformed programs can be 1.5 times the turnaround time of the 
 transformed programs. This is assuming that there is enough main memory on the 
 machine to hold all of the program's virtual space. When the logical space of programs 
 outgrow the main memory, this ratio can jump to 4.0 for balanced jobs and approach 
 100 for paging bound jobs. 
 
19 
 
 In a multiprogrammed batch environment improving the behavior of programs 
 achieves two goals. The system was moved from a thrashing state to a maximum 
 resource utilization state. Second, a pronounced improvement in throughtput was 
 achieved. A factor of 1.64 was measured for a balanced load, 3.82 for an I/O bound load 
 and 4.19 for a mixed load. 
 
 Improving program behavior also leads to reducing the system overhead associated 
 with automatic memory management. A factor of 1.83 was measured when both the ori- 
 ginal and transformed versions of the program where given the same amount of 
 memory. Additionally, a factor of 1.81 was measured when the transformed version used 
 only one seventh of the amount of memory. 
 
 Much more extensive measurements and experimentation needs to be done to 
 present results which can apply to a wide class of application programs. However, the 
 results presented in this paper and the more comprehensive work discussed in [AbKL79] 
 and [AbKL81] show that compile time transformations have a real substantial potential 
 in improving the performance of paged computer systems. 
 
 Acknowlegements 
 
 The authors of this paper would like to thank the following people for contributing 
 to this study. Pen-Chung Yew for his input in the early stages of this study. KAI for 
 providing the machine time to perform the experiments. Thomas Macke, Mike Wolfe, 
 and especially Jim Davies, all of KAI, provided assistance in performing the experi- 
 ments. And special thanks to David Kuck for his valuable support throughout the 
 course of this study. 
 
20 
 
 Table 2 Memory Requirements for Programs ADD and TADD 
 
 Version 
 
 Size 
 
 data pages (DP) 
 
 PVWS 
 
 PVWS - DP 
 
 ADD PWSS 
 
 TADD PWSS 
 
 1 
 
 128 
 
 129 
 
 278 
 
 149 
 
 230 
 
 232 
 
 2 
 
 256 
 
 514 
 
 663 
 
 149 
 
 583 
 
 550 
 
 3 
 
 384 
 
 1155 
 
 1304 
 
 149 
 
 1221 
 
 1226 
 
 4 
 
 512 
 
 2052 
 
 2201 
 
 149 
 
 1080 
 
 1095 
 
 5 
 
 640 
 
 3205 
 
 3354 
 
 149 
 
 1300 
 
 1093 
 
 6 
 
 768 
 
 4614 
 
 4763 
 
 149 
 
 1241 
 
 1388 
 
 7 
 
 896 
 
 6279 
 
 6428 
 
 149 
 
 1300 
 
 1387 
 
 8 
 
 1024 
 
 8200 
 
 8349 
 
 149 
 
 1404 
 
 1103 
 
 Table 3 Execution Statistics for Programs ADD and TADD 
 
 
 
 ADD | TADD 
 
 
 ADD | TADD 
 
 
 Version 
 
 Size 
 
 Time (min.) 
 
 Ratio 
 
 Pagefaults 
 
 Ratio 
 
 1 
 
 128 
 
 .058 
 
 .059 
 
 .98 
 
 343 
 
 343 
 
 1.00 
 
 2 
 
 256 
 
 .115 
 
 .087 
 
 1.32 
 
 918 
 
 849 
 
 1.08 
 
 3 
 
 384 
 
 .213 
 
 .141 
 
 1.51 
 
 2016 
 
 1811 
 
 1.11 
 
 4 
 
 512 
 
 .363 
 
 .254 
 
 1.43 
 
 4742 
 
 4343 
 
 1.12 
 
 5 
 
 640 
 
 1.55 
 
 .427 
 
 3.63 
 
 7342 
 
 4013 
 
 1.83 
 
 6 
 
 768 
 
 2.29 
 
 .601 
 
 3.80 
 
 10374 
 
 5297 
 
 1.96 
 
 7 
 
 896 
 
 3.21 
 
 .806 
 
 3.98 
 
 13682 
 
 6961 
 
 1.97 
 
 8 
 
 1024 
 
 3.98 
 
 1.03 
 
 3.88 
 
 17859 
 
 8949 
 
 2.00 
 
21 
 
 Table 4 Memory Requirements for Programs MAD and TMAD 
 
 Version 
 
 Size 
 
 data pages (DP) 
 
 PVWS 
 
 PVWS - DP 
 
 MAD PWSS 
 
 TMAD PWSS 
 
 1 
 
 128 
 
 384 
 
 533 
 
 149 
 
 451 
 
 451 
 
 2 
 
 256 
 
 1536 
 
 1685 
 
 149 
 
 1500 
 
 1293 
 
 3 
 
 384 
 
 3456 
 
 3605 
 
 149 
 
 1500 
 
 1387 
 
 4 
 
 512 
 
 6144 
 
 6293 
 
 149 
 
 1500 
 
 1500 
 
 5 
 
 640 
 
 9600 
 
 9749 
 
 149 
 
 1477 
 
 1396 
 
 6 
 
 768 
 
 13824 
 
 13973 
 
 149 
 
 1500 
 
 1500 
 
 Table 5 Execution Statistics for Programs MAD and TMAD 
 
 
 
 MAD | TMAD 
 
 
 MAD | TMAD 
 
 
 Version 
 
 Size 
 
 Time (min.) 
 
 Ratio 
 
 Pagefaults 
 
 Ratio 
 
 1 
 
 128 
 
 .086 
 
 .078 
 
 1.09 
 
 871 
 
 692 
 
 1.26 
 
 2 
 
 256 
 
 .199 
 
 .167 
 
 1.19 
 
 3248 
 
 1945 
 
 1.67 
 
 3 
 
 394 
 
 1.69 
 
 .512 
 
 3.30 
 
 8467 
 
 3804 
 
 2.23 
 
 4 
 
 512 
 
 17.14 
 
 .903 
 
 18.70 
 
 1427244 
 
 6777 
 
 210.6 
 
 5 
 
 640 
 
 52.55 
 
 1.28 
 
 41.20 
 
 2456614 
 
 10472 
 
 234.6 
 
 6 
 
 768 
 
 141.43 
 
 2.18 
 
 64.88 
 
 3539429 
 
 14930 
 
 237.1 
 
22 
 
 Table 6 Time Per Job for Multiprogramming ADD2 and TADD2 
 
 MPL 
 
 ADD2 t (min.) 
 
 TADD2 t (min.) 
 
 Ratio 
 
 Total PVWS 
 
 1 
 
 .116 
 
 .086 
 
 1.35 
 
 663 
 
 2 
 
 .109 
 
 .083 
 
 1.31 
 
 1326 
 
 3 
 
 .114 
 
 .089 
 
 1.28 
 
 1989 
 
 4 
 
 .113 
 
 .091 
 
 1.24 
 
 2652 
 
 5 
 
 .115 
 
 .088 
 
 1.31 
 
 3315 
 
 6 
 
 .111 
 
 .083 
 
 1.25 
 
 3978 
 
 Table 7 Time Per Job for Multiprogramming ADD3 and TADD3 
 
 MPL 
 
 ADD3 t (min.) 
 
 TADD3 t (min.) 
 
 Ratio 
 
 Total PVWS 
 
 1 
 
 .236 
 
 .144 
 
 1.64 
 
 1304 
 
 2 
 
 .268 
 
 .178 
 
 1.51 
 
 2608 
 
 3 
 
 .252 
 
 .160 
 
 1.58 
 
 3912 
 
 4 
 
 .263 
 
 .162 
 
 1.62 
 
 5216 
 
 5 
 
 .273 
 
 .170 
 
 1.61 
 
 6520 
 
 6 
 
 .276 
 
 .169 
 
 1.63 
 
 7824 
 
23 
 
 Table 8 Time Per Job for Multiprogramming MAD2 and TMAD2 
 
 MPL 
 
 MAD2 t (min.) 
 
 TMAD2 t (min.) 
 
 Ratio 
 
 Total PVWS 
 
 1 
 
 .201 
 
 .168 
 
 1.20 
 
 1685 
 
 2 
 
 .591 
 
 .216 
 
 2.74 
 
 3370 
 
 3 
 
 .534 
 
 .246 
 
 2.17 
 
 5055 
 
 4 
 
 .932 
 
 .244 
 
 3.82 
 
 6740 
 
 5 
 
 .822 
 
 .281 
 
 2.93 
 
 8425 
 
 6 
 
 .819 
 
 .283 
 
 2.89 
 
 10110 
 
24 
 
 4 . -r 
 
 R 
 A 
 
 T 
 I 
 3 
 
 1 
 
 O 
 
 0. 400. BOO. 1200. 
 
 200. 600. 1000. 
 
 ADD AND TADD 
 
 SIZE 
 
 Figure 1 Turnaround Time Ratio for Original and Transformed ADD 
 
25 
 
 BO . 
 
 R 
 A 
 T 
 I 
 D 60 
 
 40 
 
 20 . - 
 
 
 
 400 
 
 200 . 
 MAD AND TMAD 
 
 BOO 
 
 600 
 
 SIZE 
 
 Figure 2 Turnaround Time Ratio for Original and Transformed MAD 
 
26 
 
 3 
 a. 
 
 -a 
 
 bO 
 
 P 
 O 
 
 «- 
 
 -a 
 H 
 
 MPL 
 
 Figure 3 Throughput vs. MPL Model Curve 
 
27 
 
 T 
 / 
 J 
 
 M 
 I 
 N 
 
 ) 
 
 . 12 
 
 ADD2 
 
 
 
 
 "\ 
 
 . — - • ~~. 
 
 """"■"—- 
 
 . 1 - 
 
 
 
 
 . OB - 
 
 —~^<^ 
 
 . TADD2 
 
 
 . 06 - 
 
 
 
 
 . 04 - 
 
 - 
 
 
 
 . 02 - 
 
 i 
 
 i ..... j 
 
 _ — f 
 
 - 
 
 n 
 
 MPL 6 
 
 Figure 4 Time/ Job vs. MPL for Programs ADD2 and TADD2 
 
28 
 
 M 
 I 
 N 
 ) 
 
 . 3 
 
 . 25 -■ 
 
 2 - 
 
 15 - 
 
 1 - 
 
 5 " 
 
 
 
 ADD3 
 
 
 
 TADD3 
 
 MPL e 
 
 Figure 5 Time/ Job vs. MPL for Programs ADD3 and TADD3 
 
zy 
 
 1 . T 
 
 T 
 / 
 J 
 
 ( 
 
 M 
 I 
 N 
 
 ) 
 
 B " 
 
 . 6 
 
 2 
 
 
 
 MAD2 
 
 o 
 
 MPL 
 
 6 
 
 Figure 6 Time/ Job vs. MPL for Programs MAD2 and TMAD2 
 
30 
 
 REFERENCES 
 
 [AbKL79] W. Abu-Sufah, D. J. Kuck, and D. H. Lawrie, "Automatic Program 
 Transformations for Virtual Memory Computers", Proc. of the 1979 
 National Computer Con/., pp. 969-974, June 1979. 
 
 [AbKL8l] W. Abu-Sufah, D. J. Kuck, and D. H. Lawrie, "On the Performance 
 Enhancememt of Paging Systems through Program Analysis and Transfor- 
 mations," IEEE Trans, on Computers, Vol. C-30, No. 5, pp. 341-356, May 
 1981. 
 
 [A1HK80] T. O. Alanko, H. J. Haikala, and P. H. Kutvonen, "Methodology and 
 Empirical Results of Program Behavior Measurements," Performance 80, 
 ACM Sigmetrics Performance Evaluation Review, Vol. 9, No. 2, pp. 55-66, 
 Summer 1980. 
 
 [ALMY82] W. Abu-Sufah, R. Lee, M. Malkawi, and P-C. Yew, "Experimental 
 Results on the Paging Behavior of Numerical Programs," Proc. of the 6th 
 International Conf. on Software Engineering, pp. 110-117, September 1982. 
 
 [BrGu68] B. S. Brawn and F. G. Gustavson, "Program Behavior in a Paging 
 Environment", Fall Joint Computer Conference, pp. 1019-1032, 1968. 
 
 [Clar83] D. W. Clark, "Cache Performance in the VAX-11/780," ACM Trans, on 
 
 Computer Systems, Vol. 1, No. 1, pp. 24-37, Feb. 1983. 
 
 [DEC82] "VAX/VMS System Management and Operations Guide", Digital Equip- 
 
 ment Corporation, Maynard, Massachusetts, Order #AAM547A-TE, May 
 1982. 
 
 [Denn68a] P. J. Denning, "Working Set Model for Program Behavior", Comm. of the 
 ACM, Vol. 11, No. 5, pp. 323-333, May 1968. 
 
 [Denn68b] P. J. Denning, "Thrashing: Its Causes and Prevention", Proc. of 1968 
 FJCC, pp. 915-922, 1968. 
 
 [Denn80] P. J. Denning, "Working Sets Past and Present," IEEE Trans, on 
 Software Engineering, Vol. SE-6, No. 1, pp. 64-84, Jan. 1980. 
 
 [DKLP76] P. J. Denning, K. C. Kahn, J. Leroudier, D. Potier and R. Duri, "Optimal 
 Multiprogramming," Acta Informatica, Vol. 7, pp. 197-216, 1976. 
 
 [Elsh74] J. L. Elshoff, "Some Programming Techniques for Processing Multi- 
 
 Dimensional Matrices in a Paging Environment," Proc. of the National 
 
31 
 
 Computer Conf., pp. 185-193, 1974. 
 
 [Ferr76] D. Ferrari, "The Improvement of Program Behavior," Computer, Vol. 9, 
 
 No. 11, pp. 39-47, Nov. 1976. 
 
 [HaPo83] H. J. Haikala and H. Pohijanlahti, "On the BLI-Model of Program 
 Behavior," Proc. of the 1983 ACM SIGMETRICS Conf. on Measurement 
 and Modeling of Computer Systems, pp. 28-38, August 1983. 
 
 [KKL VV80] D. J. Kuck, R. H. Kuhn, B. Leasure, and M. Wolfe, "The Structure of an 
 Advanced Vectorizer for Pipelined Processors," Proc. of the ^th Interna- 
 tional Computer Software and Applications, pp. 709-715, Oct. 1980. 
 
 [KuLa70] D. J. Kuck and D. H. Lawrie, "The Use and Performance of Memory 
 Hierarchies: A Survey," Software Engineering, Vol. 1, J. T. Tou, ed., pp. 
 45-77, Academic Press, New York, 1970. 
 
 [LaFe83] E. J. Lau and D. Ferrari, "Program Restructuring in a Multilevel Virtual 
 Memory," IEEE Trans, on Soft. Eng., Vol. SE-9, No. 1, pp. 69-79, Jan. 
 1983. 
 
 [Lazo79] E. D. Lazowska, "The Benchmarking, Tuning and Analytic Modeling of 
 
 VAX/VMS," Proc. of the 1979 Conf. on Simulation, Measurement and 
 Modeling of Computer Systems, pp. 57-63, 1979. 
 
 [Leas76] B. R. Leasure, "Compling Serial Languages for Parallel Machines," M.S. 
 
 thesis, Univ. of Illinois, Dept. of Computer Science, DCS Rpt. No. 76-805, 
 Nov. 1976. 
 
 [LeEc80] H. M. Levy and R. H. Eckhouse, Jr., "Computer Programming and Archi- 
 
 tecture - The VAX-11", Digital Press, 1980. 
 
 [LeLi82] H. M. Levy and P. H. Lipman, "Virtual Memory Management in the 
 
 VAX/VMS Operating System," Computer, Vol. 15, No. 5, pp. 35-41, 
 March 1982. 
 
 [PoAg83] A. V. Pohm and O. P. Agrawal, "High-Speed Memory Systems," Reston 
 Publishing Company, Virginia, 1983. 
 
 [Smit82] A. J. Smith, "Cache Memories," ACM Computer Surv., Vol. 14, No. 3, pp. 
 
 473-530, Sept. 1982. 
 
 [Stre78] W. D. Strecker, "VAX-11/780 - A Virtual Address Extension to the DEC 
 
 PDP-11 Family," Proc. of the 1978 National Computer Conf, pp. 967-980, 
 1978. 
 
32 
 
 [Wolf78] M. J. Wolfe, "Techniques for Improving the Inherent Parallelism in Pro- 
 
 grams," M.S. thesis, Univ. of Illinois, Dept. of Computer Science, DCS 
 Rpt. No. 78-929, July 1978. 
 
 [Wolf83] M. J. Wolfe, "Optimizing Supercompilers for Supercomputers", Phd. 
 
 thesis, Dept. of Computer Science, Univ. of Illinois at Urbana-Champaign, 
 1983. 
 
BIBLIOGRAPHIC DATA 
 SHEET 
 
 1. Report No. 
 
 UIUCDCS-R-83-1147 
 
 3. Recipient's Accession No. 
 
 4. Title and Subtitle 
 
 Program Behavior Under VAX/VMS 
 
 5. Report Date 
 
 7. Author(s) 
 
 Walid Abu-Sufan and Roland L. Lee 
 
 8. Performing Organization Rept. 
 
 " UIUCDCS-R-83-1147 
 
 No 
 
 9. Performing Organization Name and Address 
 
 University of Illinois at Urbana-Champaign 
 Department of Computer Science 
 Urbana, IL 61801-2987 
 
 10. Project/Task/Work Unit No. 
 
 11. Contract /Grant No. 
 
 US NSF-MCS 83-00981 
 US DOE-AC02 81-ER10822 
 
 12. Sponsoring Organization Name and Address 
 
 Dept. of Computer Science, University of Illinois at 
 Urbana-Champaign; the National Science Foundation and 
 the U.S. Dept. of Energy, Washington, DC 
 
 13. Type of Report & Period 
 Covered 
 
 Technical Report 
 
 14. 
 
 15. Supplementary Notes 
 
 16. Abstracts 
 
 Direct measurements on a VAX/VMS system reveal that program behavior has a 
 
 significant effect on the performance of this system. For a monoprogrammed batch 
 
 workload the turnaround time of a job can be reduced by up to 50% if its behavior 
 
 is improved. This is for jobs with virtual space that can fit in physical memory. 
 
 For larger jobs the improvement can reach a factor of 100. 
 
 In a multiprogramming batch environment improving the behavior of programs increased 
 
 the throughput of the system by up to 64% for balanced workloads, up to 400% for 
 
 I/O bound workloads and up to 419% for mixes of balanced and I/O bound workloads. 
 
 Improving the program behavior also reduces the overhead time of automatic memory 
 
 management. This was measured to reach up to 83%. 
 
 This case study points towards the more general conclusion that program behavior 
 
 has a significant influence on computer system performance even with the abundance 
 of hardware resources available now and in the future. 
 
 17. Key Words and Document Analysis. 17o. Descriptors 
 
 program behavior 
 
 automatic memory management 
 
 system performance 
 
 17b. Identifiers/Open-Ended Terms 
 
 17c. COSATI Field/Group 
 
 18. Availability Statement 
 
 Release Unlimited 
 
 FORM NTIS-3S < 10-70) 
 
 19. Security Class (This 
 Report) 
 
 UNCLASSIFIED 
 
 20. Security Class (This 
 Page 
 
 UNCLASSIFIED 
 
 21. No. of Pages 
 
 35 
 
 22. Price 
 
 USCOMM-DC 40329-P7!