. . ma terial is re- The Perso- £$«£** f^JX sponsible for >^.! hdraw n on or beto " hic Vi£ stinted below. Ll6l ,O-l096 510 3H ,rAi i/j^ Report No . UIUCDCS-R-83-1147 UILU-ENG 83 1727 h ■i PROGRAM BEHAVIOR UNDER VAX/VMS by Walid Abu-Sufah and Roland L. Lee November 1983 US NSF-MCS 83-00981 US DOE-AC02 81-ER10822 DEPARTMENT OF COMPUTER SCIENCE UNIVERSITY OF ILLINOIS AT URBANA-CHAMPAIGN URBANA, ILLINOIS Digitized by the Internet Archive in 2013 http://archive.org/details/programbehavioru1147abus PROGRAM BEHAVIOR UNDER VAX/VMS by Walid Abu-Sufah and Roland L. Lee Department of Computer Science University of Illinois at Urbana-Champaign Urbana, Illinois 61801 This work was supported in part by the Department of Computer Science, University of Illinois at Urbana-Champaign; and in part by the National Science Foundation under Grant No. US NSF-MCS 83-00981, and the U.S. Dept. of Energy under Grant No. DOE-AC02 81-ER 10822. ABSTRACT Direct measurements on a VAX/VMS system reveal that program behavior has a significant effect on the performance of this system. For a monoprogrammed batch workload the turnaround time of a job can be reduced by up to 50% if its behavior is improved. This is for jobs with virtual space that can fit in physical memory. For larger jobs the improvement can reach a factor of 100. In a multiprogramming batch environment improving the behavior of programs increased the throughput of the system by up to 64% for balanced workloads, up to 400% for I/O bound workloads and up to 419% for mixes of balanced and I/O bound workloads. Improving the program behavior also reduces the overhead time of automatic memory management. This was measured to reach up to 83%. This case study points towards the more general conclusion that program behavior has a significant influence on computer system performance even with the abundance of hardware resources available now and in the future. Keywords: program behavior, automatic memory management, system performance 1. Introduction Recent trends in implementing virtual memory operating systems in the real world assume an abundance of hardware resources -- mainly physical memory and secondary storage. With the drastic drop of memory systems' cost through the last decade, this assumption is valid for mini and more powerful computers [PoAg83]. This is relative to the resources available in the late sixties or early seventies. System designers consider these hardware resources to be the primary constraints on system performance (for a given CPU speed) [LeEc80]. Early studies pointed out that another factor controlling the performance of virtual memory computers is program behavior [BrGu68], [Denn68a], [KuLa70], [Elsh74], [Ferr76]. However, in the sixties and seventies, the concern was to have a satisfactory performance with the least possible physical memory on the machine. With the drastic drop in memory cost one wonders if program behavior is of any significance at the present or in the future. This paper is an attempt to answer this question. Traditionally most empirical studies of program behavior would use trace-driven simulations (for example see [A1HK80], [ALMY82], [HaPo83], [LaFe83]). A program of interest is executed interpretively, a record is made of each of its memory references, and the address trace that results is used to drive a simulator of a certain environment. The simulation approach has many advantages; mainly exact reproducibility and ease in changing the parameters of the environment being simulated. However, simulations have a major drawback. It would be difficult to use trace-driven simulation to accurately account for the effects of various aspects of modern and future computer systems in our study. Examples of these aspects are: multiprogramming, the execution of operating system routines for a nontrivial percentage of CPU time, and I/O interference. Our investigation is done by direct measurements on a real machine. A similar approach was followed recently in a study to evaluate the performance of a cache system [Clar83]. The study showed a noticable difference between the performance of the cache systems as anticipated by simulation studies [Smit82] and those obtained by direct measurement on a real machine. The case study we use will be the DEC VAX/VMS system [Stre78], [LeEc80|. Although the exact reproducibility of results using simulation will be lost in our approach, the margin of error is small (we discuss this issue in more detail in section 2). The advantage of our approach is that conclusions will be based on measurements of a real machine with all the complexities of its architecture and operating system. Choos- ing a specific machine and its operating system may seem to limit the generality of our conclusions. This may be true for the quantitative parts of our conclusions, however, a case study is one legitimate way of exploring the effect of program behavior on system performance. This is specifically because the designers of this system explicitly declare that they assume an abundance of hardware resources. This environment is suitable for exploring the answer to our question: what is the degree of influence which program behavior still has on the performance of a system with an abundance of resources? We performed experiments on the VAX/VMS using two simple programs. The behavior of these programs can be easily changed through one simple transformation, the loop interchange transformation. The improvement of the behavior of these pro- grams varies from drastic to moderate. Due to the simplicity of the programs, we transformed them manually. However, the automation of this and other transforma- tions has been implemented in the PARAFRASE system of the University of Illinois [Leas76],[Wolf78],[AbKL79], [KKLW80], [AbKL81], [Wolf83]. In Section Two we discuss our experimental process; the programs used, their transformed versions, the performance measures used and the experiments conducted. In Section Three we present and discuss the results. In Section Four we present our con- cluding remarks concerning the questions raised in this paper. 2. The Experimental Environment The following is a brief description of the environment in which these experiments were performed. The computer used is a VAX 11/750 running version 3.3 of DEC's operating system, VMS. The virtual memory page size of this system is 512 bytes and the total number of pages of real memory in the system is 4096 pages. The operating system itself occupies approximately 900 pages of main memory. VMS uses a local FIFO replacement algorithm. Each user process on the system is given a certain set of main memory pages called the resident set on which to execute (in DEC literature this is called the Working Set of the process). VMS gives an initial resident set of size WSDEFAULT (this parameter can be set by the user) to each process from which it dynamically changes the amount of memory of a process in response to the process' paging rate and the amount of free memory left in the entire system. If the paging rate is above a certain level, PFRATH, the operating system increases the size of the process' resident set by WSINC; if how- ever, the paging rate is below a certain level, PFRATL, the operating system decreases the amount memory for the process by WSDEC. The maximum size of the resident set of any user process is upper bounded by the system parameter WSMAX. Additionally, each user process has its own upper limits to the size of its resident set. WSEXTENT is the maximum possible resident set size for the process while WSQUOTA is the max- imum guaranteed resident set size for the process (WSQUOTA must be less than or equal to WSEXTENT). The resident set of a process may exceed its WSQUOTA only when there are more than BORROWLIM number of free pages in the system. The size of the virtual space (in pages) associated with each process is denoted by PVWS. The maximum number of physical pages occupied by a process during its lifetime is denoted by PWSS. In addition to the pages allocated to user processes, the operating system keeps a certain amount of memory free in the free page list and the modified page list to act as a page cache. When a page is released from the resident set of a process and if the page was modified (and thus requiring a disk write), it goes into the modified page list; how- ever, if the page was not modified it goes into the free page list. The operating system keeps the size of the free page list above FREELIM pages and makes sure it is at least FREEGOAL pages large after each freeing of pages from user processes (freeing in response to a memory shortage). The maximum size of the modified page list is MPWJflLIMIT and the minimum size is MPWJ.OLIMIT. If a process faults and the page is in either list, the page is returned to the process' resident set. Such a page fault is relatively uncostly since the page fault can be satisfied without a disk I/O request. Besides being part of a paging cache the modified page list acts as a staging buffer for the clustering of disk writes. This clustering serves to reduce the amount of disk I/O. Pages from the modified page list are written out of memory in clusters of MPW_WRTCLUSTER pages. For each page fault requiring a disk read, a cluster of PFCDEFAULT virtually contiguous pages are read into the faulting process' resident set. VMS also swaps entire working sets between memory and disk. VMS checks the nonresident executable queues to find the highest priority process to be swapped in. Once a process is selected, the operating system must find enough free pages to hold the process' resident. There are three ways to obtain these free pages, the first is to take them from the free list, the second is to do a disk write from the modified page list and thus free those pages, and the third is to swap out a process of lower or equal priority. Once swapped in the process is guaranteed at least one quantum before it becomes eligi- ble to be swapped out. For a more detailed description of the memory management in VMS see [LeLi82]. Table 1 shows the values of the system parameters used in this installation. The values assigned to these parameters are those used by the system manager of the site to suit the workload of the machine. The site is a software house with a day workload consist- ing mainly of the interactive development of vectorizing compilers for supercomputers while at night the production runs consists of compilation jobs. We were not free to vary the values of the system parameters nor did we intend to change them. We felt that for the investigation we are doing in this paper, we should not be concerned with tuning issues. The designers of the system do not advocate the idea of putting a lot of effort in tuning the system. Instead, they advocate the use of the default values for the system parameters while adding more hardware resources whenever the workload out- grows the system [DEC82]. This implies that with sufficient hardware resources, the sys- tem performance is satisfactory with the default system parameter values. This paper examines this claim from the point of view that considers the effect of program behavior. Other researchers have shown that tuning an early version of this system drastically improved its performance [Lazo79]. Table 1. System Parameters Parameter Value BORROWLIM 300 pages , FREEGOAL 500 pages FREELIM 100 pages MPW_HILIMIT 500 pages MPW LOLIMIT 100 pages MPW WRTCLUSTER 96 pages PFCDEFAULT 32 pages PFRATH 200 Faults/10 sees PFRATL 100 Faults/10 sees QUANTUM 200 ms WSINC 150 pages WSDEC 35 pages The following two programs (coded in FORTRAN) were used for the experiments. The first program, ADD, sums up the values of each row of a square matrix. The second program, MAD, is a matrix addition of two square matrices. In these programs, the matrices are referenced by rows. Additionally, we have transformed versions of each of the programs called TADD and TMAD. The transformation applied to these programs is loop interchange. Due to the loop interchange transformation, the matrices are refer- enced by columns. Each of these programs were compiled with eight versions, distinguished by the problem size. l is the version with a 128 by 128 matrix, 2 for 256 by 256, 3 for 384 by 384, up to 8 for 1024 by 1024 where is one of {ADD, MAD, TADD, TMAD}. The information about the resource usage of each of the programs is reported through the accounting log files. This log contains the time at which the process ter- minated, the number of I/O requests serviced, the number of page faults, the peak size of the resident set during the execution of the process (PWSS), the peak virtual memory space in pages (PVWS) allocated to the process, the elapsed CPU time, and the elapsed real time. The first experiment performed was to run each program in a monoprogramming batch mode. Each program was run once for each size of the data array both for the original version and for the transformed version. The elapsed time for each program was noted and the ratio between the original and transformed versions of the corresponding sizes were compared. The purpose of this experiment is to illustrate the effectiveness of the transformations on programs that would normally be monopro- grammed on a VAX machine. The processes created for this experiment have the fol- lowing parameter settings: the WSDEFAULT is set to 250 pages, the WSQUOTA is set to 1500, and the WSEXTENT is set to 1500. Programs with large array sizes seem to reflect such a workload. The second experiment examines the effectiveness of the transformation on pro- grams that would normally be run in a multiprogramming environment. The original and transformed versions of the programs at a smaller data array size were used in these experiments. Each version of the program was multiprogrammed with multiprogramming level (MPL) varying from one to six. MPL copies of the program were started at the same time. The time per job at each of the MPL's was compared between original and transformed versions of the program. We also multiprogrammed the two programs ADD and MAD by submitting three copies of each to the system. A similar run was done for the transformed programs. The processes created for this experiment have the same parameter settings as those used for experiment one. The third experiment attempts to measure the reduction of automatic memory management cost due to improved program behavior. This is done by measuring the time of running an original program versus a transformed program when physical memory allocated to the process of the original program is greater than the virtual space needed by the process. The original version of the program was run with the WSDE- FAULT = WSQUOTA = WSEXTENT set to a value greater than the PVWS of the program. The transformed program was run three times once for each of the following conditions: (i) WSDEFAULT = WSQUOTA = WSEXTENT = PVWS (ii) WSDE- FAULT = WSQUOTA = WSEXTENT = PWSS and (iii) restricted memory (WSDE- FAULT = WSQUOTA = WSEXTENT = 213). ADD3 and its transformed counter- part were picked for these experiments. This is the largest version of ADD whose PVWS can fit entirely in physical memory. In order to make the conditions for all experiments as controllable as possible, the experiments were run only when no other user processes were executing. However, some measure of uncertainty is unavoidable. One of the reasons comes from the interference caused by the execution of the operating system routines. The experiments were run from a batch queue that operated when all users were logged off. Synchronization within 10 this queue was done in order to insure that no other pending job would start until the current experiment completed. Besides the operating system processes, there is a process that executes at the time of the experiments that controls the submitting of jobs to be run. This process intro- duces a slight interference with the execution of the experiments. Also, the job control- ling the simultaneous submitting of multiprogrammed jobs actually starts the jobs in serial. The total error introduced from delays in submitting jobs and the interference of the job submitting program account for some of the slight perturbations in the results between repeated instances of the experiments. These differences have been computed in several reruns for several of the experiments and have been found to be of no significant consequence. 3. Experimental Results 3.1. Experiment One As mentioned in Section Two the purpose of this experiment is to investigate the effect of program behavior on the performance of a monoprogrammed VAX/VMS sys- tem. It is the recommendation of the designers of the VAX/VMS to run large produc- tion jobs in a batch monoprogramming mode [DEC82]. In fact, these types of jobs are executed in this manner. The effect of program behavior on the performance of the sys- tern in such an environment is probably best measured by its effect on the turnaround time of jobs. Other measures of interest are the PWSS and the number of page faults generated. Now we discuss the results we obtained for programs ADD and MAD. 11 3.1.1. Program ADD Table 2 shows the space statistics for this program for different array sizes. For the version where the two dimensional array is 128 by 128 (version 1) the space occupied by array elements is 129 pages (the page size in real words is 128, hence 128 pages are occu- pied by the two dimensional array and one page by the result vector). For a matrix size of 1024 by 1024 (version 8) we need 8*1024 + 8 = 8200 pages. Note that the code and the scalar storage requirements are negligible. Hence, the 149 page difference between the occupied virtual process space, PVWS, and its array data pages is the part of virtual space allocated to the process for its control region (for user mode stack and other pro- tected process specific data and code as well as for the stacks of higher access modes such as the supervisor, executive, and kernel) [LeEc80]. This control space was measured to be constant for all the programs in all the experiments at 149 pages. This is more than 50 percent of the total process space for program ADD at array size of 128 by 128 and is independent of the problem size. From Table 2 we notice that for a given problem size the difference between the PWSS for ADD and TADD is less than 1% for versions 1 through 4 and between 6% and 21% for versions 5 through 8. Moreover, for some versions TADD.PWSS > ADD.PWSS and for others ADD.PWSS > TADD.PWSS. This indicates a failure of the automatic techniques implemented in VMS to take advantage of the extreme difference between the behavior of ADD and TADD and the difference in their space requirements. Optimally, TAAD would only need a maximum of 16 (taken from analysis of version 8) array data pages at any given time during its execution for a total of 165 pages of pro- cess space. It is not within the scope of this paper to discuss why the mechanisms of 12 VMS failed to distinguish between the behavior of the two programs. The important observation is that it occurred irrespective of whether the process space can fit in physi- cal memory or not and the fact that TADD's fault rate is much lower than ADD's fault rate. This is shown in Table 3. Table 3 also shows the turnaround time of each program and the ratio of tur- naround times for the original and transformed programs. Note that the improvement in turnaround time is up to 50 percent for version 1 to 4. For these programs the system can keep the pages of the user process in physical memory (in the process resident set, free page list, and modified page list). For versions 5 to 8, the main memory of the machine is not large enough to hold most of the virtual memory space of program ADD. Disk page faults are now more frequent than for the smaller versions of ADD and the improvement ratio jumps to around 400 percent. The table also shows the improvement in the number of page faults. Since the number of page faults include both cheap and expensive ones, we feel that the page fault ratio is less indicative of the improvement than the ratio of turnaround times. Figure 1 shows a plot of this ratio for program ADD versus the result vector length (problem size). Notice the step in the curve when most of the process' virtual space does not fit in core. Our conclusion from Table 3 is that the behavior of programs like ADD is still a major factor in VAX/VMS performance in a monoprogrammed batch mode. Improving the behavior of programs through compile time transformations can reduce the tur- naround time by a factor of 4 without the need for expanding the main memory size. Even if the main memory size was increased to adequately hold almost all of the virtual process space, program transformation can reduce turnaround time by an appreciable 13 percentage (50% was measured). 3.1.2. Program MAD This program adds two 2-dimensional matrices and stores the result in a third matrix. The space needed for array elements is three times the space needed for ADD, however, the amount of arithmetic is the same. Thus, one can consider this program to be I/O (or paging) bound relative to ADD. Tables 4 and 5 show measurements for MAD similar to Tables 2 and 3 for ADD. Table 4 supports the conclusion reached in the previous section about the inability of VMS to differentiate between MAD and a drastically better behaved TMAD. The differences between the PWSS of the original and transformed programs are insignificant. As for the improvement in the turnaround time, we notice in Table 5 that it grows rapidly from a factor of 3.3 to 64.88 as soon as the process virtual space is too large to fit in core (versions 3 to 6). For programs that can fit in core the improvement is in the range of 20%. Figure 2 illustrates the pronounced improvement that the transformed version gives when the program does not fit in core. Note that the amount of additional memory needed to make this program fit in core is three times what is needed to fit pro- gram ADD in core. In contrast with the drastic memory expansion approach to making running such a program practical on such a machine, the transformation approach allows a drastic improvement in system performance without the need for expanding the memory. This program should run with 167 pages with almost the same performance (for version 6). 14 3.2. Experiment Two Most VAX/VMS systems are actually used in a multiprogramming interactive environment for an appreciable percentage of the time. This kind of a workload would consists of program development, debugging, running system facilities such as the mail- ing system, and running interactive systems such as data base systems. In this section we try to study the effect of program behavior on system performance when the machine is multiprogrammed. For a workload with the two programs discussed in this paper, one would expect that jobs used in a multiprogrammed environment to have smaller array sizes than those used in a monoprogrammed environment. One can think of the runs in the multiprogramming mode as test runs. When numerical programs pass the test runs, they are scheduled for execution with realistically large problem sizes in a batch monoprogrammed mode. 3.2.1. Program ADD We choose ADD2 (256 by 256 matrix) for our multiprogramming experiment. The total virtual process space for this program is 663 pages which is reasonable for such an environment; it is not trivial and at the same time not too large to require monopro- gramming. Table 6 shows the time per job for MPL varying from one to six for ADD and TADD. The table also shows the total virtual space of user processes for each MPL. Examining the time per job for ADD, we notice that the system is being multipro- grammed in a rather optimal way. The system is operating at the maximal flat portion of the throughput (and CPU utilization) versus degree of multiprogramming model 15 curve [DKLP76] (see Figure 3 and Figure 4). We see that the system is multiprogram- ming six processes with a total of 4000 pages of virtual space while the time per job is approximately equal to the time for monoprogramming the job. Table 3 shows that ver- sion 5 of ADD (with PVWS = 3354 pages) is thrashing. From this we conclude that multiprogramming is very effective under VAX/VMS. The same conclusion is reached by examining the time per job for multiprogramming MPL copies of the transformed pro- gram TADD. Notice that the throughput of the system improves when the transformed workload is run. The ratio of time/job of ADD to TADD ranges from 1.24 to 1.35 with an average and median of 1.29. Table 7 and Figure 5 presents the results for this experiment with version 3 of ADD. The improvements are more pronounced in this case. This workload is an I/O-CPU balanced workload. Next we examine multiprogramming an I/O bound workload. 3.2.2. Program MAD As discussed earlier this program generates more page faults and does the same amount of computation as ADD. Table 8 shows our findings for multiprogramming up to six copies of MAD2 and TMAD2. We notice that for the untransformed workload the system is thrashing [Denn68b]. We are seeing the negative slope part of the throughput (CPU utilization) versus mul- tiprogramming level function. The time per job increases from .201 minutes in the monoprogramming case to .819 minutes at MPL=6. This is a factor of four. Improving the behavior of the workload reduces the thrashing effect significantly. Figure 6 shows le that MAD2 is in the negative slope region of the model multiprogramming curve (Figure 3) while TMAD2 is at the flat region. The ratio of time per job at MPL = 6 to the uniprogrammed case is 1.6 for TMAD2. Moreover, we see that the ratio of the time per job for the original workload to that of the transformed workload ranges between 1.20 to 3.82 with an average of 2.63 and median of 2.82. This is a very significant improve- ment in the throughput and utilization of the machine. 3.2.3. Mix of ADD and MAD In this experiment three copies of ADD5 and three copies of MAD3 were multipro- grammed together. We repeated this experiment using transformed programs with three copies of TADD5 and three copies of TMAD3. The total time to finish the original ver- sions of the programs was 8.25 minutes while the total time to finish the transformed versions was 1.97 minutes; giving an improvement by a factor of 4.19. Notice that this improvement is greater than either of the improvements in independently monopro- gramming TADD5 over ADD5 (3.63) or TMAD3 over MAD3 (3.30). 3.3. Experiment Three In this experiment we consider the question of whether program behavior affects the overhead associated with automatic memory management. To be conservative we picked the more balanced program, ADD, for this experiment rather than the paging bound program MAD. To obtain results whose experimental error percentage is negligi- ble, we chose the largest version of ADD with memory requirements that can fit in core. This version of ADD is ADD3 with PVWS=1304 pages. 17 We ran ADD3 in a monoprogrammed batch mode with WSDEFAULT = WSQUOTA = WSEXTENT = 1500 pages. Thus, it was allocated real space greater than its logical space. Then we ran TADD3 with three different settings of these param- eters. (A) WSDEFAULT = WSQUOTA = WSEXTENT = 1500 PAGES In this case, both the original and transformed programs were allocated the same amount of real memory, enough to exceed their logical space size. The turnaround time of TADD3 was .153 minutes while for ADD3 it was .280 minutes. The ratio of the two times is 1.83. This is a very significant reduction in automatic memory management overhead. (B) WSDEFAULT = WSQUOTA = WSEXTENT = PVWS (1304 PAGES) The real space allocated to TADD3 is exactly equal to its virtual space. The tur- naround time was practically identical to that in the previous case and hence the improvement was also identical. (C) WSDEFAULT = WSQUOTA = WSEXTENT = 213 PAGES In this case we reduce the memory allocation of TADD3 to 213 pages. This is enough space to hold the 149 pages used by the system and a cluster of 32 pages when a fault to the two dimensional array occurs plus another 32 pages when a fault to the result vector occurs. The turnaround time was .155 minutes. The ratio of turnaround time of the orginal program to this time is 1.81. Note that the space has been reduced by a factor of 1500/213 = 7.0. 18 From these results we note that the reduction in the automatic memory manage- ment overhead due to improved program behavior is very significant. 4. Concluding Remarks The experimental data presented in this paper shows that program behavior can still have a major influence on computer system performance. The point that this paper makes is that program behavior can not be simply ignored by system designers. It is cer- tainly true that having an abundance of hardware resources - large fast main memories and high bandwidth I/O - helps to improve the performance of computer systems. How- ever, these hardware resources do not eliminate the influence of program behavior. Moreover, user demands on memory space and processor speed of computer systems are always greater than what manufacturers supply. With time this demand is growing in a more rapid pace than the growth of hardware resources. The conclusions presented in this paper are based on a case study of a VAX 750/VMS system. Two simple example programs were used. Both do the same amount of arithmetic, however, one requires more space and has higher paging activity. The behavior of these programs can be drastically improved through simple compile time transformations. In a production, monoprogrammed, batch environment, the turnaround time of untransformed programs can be 1.5 times the turnaround time of the transformed programs. This is assuming that there is enough main memory on the machine to hold all of the program's virtual space. When the logical space of programs outgrow the main memory, this ratio can jump to 4.0 for balanced jobs and approach 100 for paging bound jobs. 19 In a multiprogrammed batch environment improving the behavior of programs achieves two goals. The system was moved from a thrashing state to a maximum resource utilization state. Second, a pronounced improvement in throughtput was achieved. A factor of 1.64 was measured for a balanced load, 3.82 for an I/O bound load and 4.19 for a mixed load. Improving program behavior also leads to reducing the system overhead associated with automatic memory management. A factor of 1.83 was measured when both the ori- ginal and transformed versions of the program where given the same amount of memory. Additionally, a factor of 1.81 was measured when the transformed version used only one seventh of the amount of memory. Much more extensive measurements and experimentation needs to be done to present results which can apply to a wide class of application programs. However, the results presented in this paper and the more comprehensive work discussed in [AbKL79] and [AbKL81] show that compile time transformations have a real substantial potential in improving the performance of paged computer systems. Acknowlegements The authors of this paper would like to thank the following people for contributing to this study. Pen-Chung Yew for his input in the early stages of this study. KAI for providing the machine time to perform the experiments. Thomas Macke, Mike Wolfe, and especially Jim Davies, all of KAI, provided assistance in performing the experi- ments. And special thanks to David Kuck for his valuable support throughout the course of this study. 20 Table 2 Memory Requirements for Programs ADD and TADD Version Size data pages (DP) PVWS PVWS - DP ADD PWSS TADD PWSS 1 128 129 278 149 230 232 2 256 514 663 149 583 550 3 384 1155 1304 149 1221 1226 4 512 2052 2201 149 1080 1095 5 640 3205 3354 149 1300 1093 6 768 4614 4763 149 1241 1388 7 896 6279 6428 149 1300 1387 8 1024 8200 8349 149 1404 1103 Table 3 Execution Statistics for Programs ADD and TADD ADD | TADD ADD | TADD Version Size Time (min.) Ratio Pagefaults Ratio 1 128 .058 .059 .98 343 343 1.00 2 256 .115 .087 1.32 918 849 1.08 3 384 .213 .141 1.51 2016 1811 1.11 4 512 .363 .254 1.43 4742 4343 1.12 5 640 1.55 .427 3.63 7342 4013 1.83 6 768 2.29 .601 3.80 10374 5297 1.96 7 896 3.21 .806 3.98 13682 6961 1.97 8 1024 3.98 1.03 3.88 17859 8949 2.00 21 Table 4 Memory Requirements for Programs MAD and TMAD Version Size data pages (DP) PVWS PVWS - DP MAD PWSS TMAD PWSS 1 128 384 533 149 451 451 2 256 1536 1685 149 1500 1293 3 384 3456 3605 149 1500 1387 4 512 6144 6293 149 1500 1500 5 640 9600 9749 149 1477 1396 6 768 13824 13973 149 1500 1500 Table 5 Execution Statistics for Programs MAD and TMAD MAD | TMAD MAD | TMAD Version Size Time (min.) Ratio Pagefaults Ratio 1 128 .086 .078 1.09 871 692 1.26 2 256 .199 .167 1.19 3248 1945 1.67 3 394 1.69 .512 3.30 8467 3804 2.23 4 512 17.14 .903 18.70 1427244 6777 210.6 5 640 52.55 1.28 41.20 2456614 10472 234.6 6 768 141.43 2.18 64.88 3539429 14930 237.1 22 Table 6 Time Per Job for Multiprogramming ADD2 and TADD2 MPL ADD2 t (min.) TADD2 t (min.) Ratio Total PVWS 1 .116 .086 1.35 663 2 .109 .083 1.31 1326 3 .114 .089 1.28 1989 4 .113 .091 1.24 2652 5 .115 .088 1.31 3315 6 .111 .083 1.25 3978 Table 7 Time Per Job for Multiprogramming ADD3 and TADD3 MPL ADD3 t (min.) TADD3 t (min.) Ratio Total PVWS 1 .236 .144 1.64 1304 2 .268 .178 1.51 2608 3 .252 .160 1.58 3912 4 .263 .162 1.62 5216 5 .273 .170 1.61 6520 6 .276 .169 1.63 7824 23 Table 8 Time Per Job for Multiprogramming MAD2 and TMAD2 MPL MAD2 t (min.) TMAD2 t (min.) Ratio Total PVWS 1 .201 .168 1.20 1685 2 .591 .216 2.74 3370 3 .534 .246 2.17 5055 4 .932 .244 3.82 6740 5 .822 .281 2.93 8425 6 .819 .283 2.89 10110 24 4 . -r R A T I 3 1 O 0. 400. BOO. 1200. 200. 600. 1000. ADD AND TADD SIZE Figure 1 Turnaround Time Ratio for Original and Transformed ADD 25 BO . R A T I D 60 40 20 . - 400 200 . MAD AND TMAD BOO 600 SIZE Figure 2 Turnaround Time Ratio for Original and Transformed MAD 26 3 a. -a bO P O «- -a H MPL Figure 3 Throughput vs. MPL Model Curve 27 T / J M I N ) . 12 ADD2 "\ . — - • ~~. """"■"—- . 1 - . OB - —~^<^ . TADD2 . 06 - . 04 - - . 02 - i i ..... j _ — f - n MPL 6 Figure 4 Time/ Job vs. MPL for Programs ADD2 and TADD2 28 M I N ) . 3 . 25 -■ 2 - 15 - 1 - 5 " ADD3 TADD3 MPL e Figure 5 Time/ Job vs. MPL for Programs ADD3 and TADD3 zy 1 . T T / J ( M I N ) B " . 6 2 MAD2 o MPL 6 Figure 6 Time/ Job vs. MPL for Programs MAD2 and TMAD2 30 REFERENCES [AbKL79] W. Abu-Sufah, D. J. Kuck, and D. H. Lawrie, "Automatic Program Transformations for Virtual Memory Computers", Proc. of the 1979 National Computer Con/., pp. 969-974, June 1979. [AbKL8l] W. Abu-Sufah, D. J. Kuck, and D. H. Lawrie, "On the Performance Enhancememt of Paging Systems through Program Analysis and Transfor- mations," IEEE Trans, on Computers, Vol. C-30, No. 5, pp. 341-356, May 1981. [A1HK80] T. O. Alanko, H. J. Haikala, and P. H. Kutvonen, "Methodology and Empirical Results of Program Behavior Measurements," Performance 80, ACM Sigmetrics Performance Evaluation Review, Vol. 9, No. 2, pp. 55-66, Summer 1980. [ALMY82] W. Abu-Sufah, R. Lee, M. Malkawi, and P-C. Yew, "Experimental Results on the Paging Behavior of Numerical Programs," Proc. of the 6th International Conf. on Software Engineering, pp. 110-117, September 1982. [BrGu68] B. S. Brawn and F. G. Gustavson, "Program Behavior in a Paging Environment", Fall Joint Computer Conference, pp. 1019-1032, 1968. [Clar83] D. W. Clark, "Cache Performance in the VAX-11/780," ACM Trans, on Computer Systems, Vol. 1, No. 1, pp. 24-37, Feb. 1983. [DEC82] "VAX/VMS System Management and Operations Guide", Digital Equip- ment Corporation, Maynard, Massachusetts, Order #AAM547A-TE, May 1982. [Denn68a] P. J. Denning, "Working Set Model for Program Behavior", Comm. of the ACM, Vol. 11, No. 5, pp. 323-333, May 1968. [Denn68b] P. J. Denning, "Thrashing: Its Causes and Prevention", Proc. of 1968 FJCC, pp. 915-922, 1968. [Denn80] P. J. Denning, "Working Sets Past and Present," IEEE Trans, on Software Engineering, Vol. SE-6, No. 1, pp. 64-84, Jan. 1980. [DKLP76] P. J. Denning, K. C. Kahn, J. Leroudier, D. Potier and R. Duri, "Optimal Multiprogramming," Acta Informatica, Vol. 7, pp. 197-216, 1976. [Elsh74] J. L. Elshoff, "Some Programming Techniques for Processing Multi- Dimensional Matrices in a Paging Environment," Proc. of the National 31 Computer Conf., pp. 185-193, 1974. [Ferr76] D. Ferrari, "The Improvement of Program Behavior," Computer, Vol. 9, No. 11, pp. 39-47, Nov. 1976. [HaPo83] H. J. Haikala and H. Pohijanlahti, "On the BLI-Model of Program Behavior," Proc. of the 1983 ACM SIGMETRICS Conf. on Measurement and Modeling of Computer Systems, pp. 28-38, August 1983. [KKL VV80] D. J. Kuck, R. H. Kuhn, B. Leasure, and M. Wolfe, "The Structure of an Advanced Vectorizer for Pipelined Processors," Proc. of the ^th Interna- tional Computer Software and Applications, pp. 709-715, Oct. 1980. [KuLa70] D. J. Kuck and D. H. Lawrie, "The Use and Performance of Memory Hierarchies: A Survey," Software Engineering, Vol. 1, J. T. Tou, ed., pp. 45-77, Academic Press, New York, 1970. [LaFe83] E. J. Lau and D. Ferrari, "Program Restructuring in a Multilevel Virtual Memory," IEEE Trans, on Soft. Eng., Vol. SE-9, No. 1, pp. 69-79, Jan. 1983. [Lazo79] E. D. Lazowska, "The Benchmarking, Tuning and Analytic Modeling of VAX/VMS," Proc. of the 1979 Conf. on Simulation, Measurement and Modeling of Computer Systems, pp. 57-63, 1979. [Leas76] B. R. Leasure, "Compling Serial Languages for Parallel Machines," M.S. thesis, Univ. of Illinois, Dept. of Computer Science, DCS Rpt. No. 76-805, Nov. 1976. [LeEc80] H. M. Levy and R. H. Eckhouse, Jr., "Computer Programming and Archi- tecture - The VAX-11", Digital Press, 1980. [LeLi82] H. M. Levy and P. H. Lipman, "Virtual Memory Management in the VAX/VMS Operating System," Computer, Vol. 15, No. 5, pp. 35-41, March 1982. [PoAg83] A. V. Pohm and O. P. Agrawal, "High-Speed Memory Systems," Reston Publishing Company, Virginia, 1983. [Smit82] A. J. Smith, "Cache Memories," ACM Computer Surv., Vol. 14, No. 3, pp. 473-530, Sept. 1982. [Stre78] W. D. Strecker, "VAX-11/780 - A Virtual Address Extension to the DEC PDP-11 Family," Proc. of the 1978 National Computer Conf, pp. 967-980, 1978. 32 [Wolf78] M. J. Wolfe, "Techniques for Improving the Inherent Parallelism in Pro- grams," M.S. thesis, Univ. of Illinois, Dept. of Computer Science, DCS Rpt. No. 78-929, July 1978. [Wolf83] M. J. Wolfe, "Optimizing Supercompilers for Supercomputers", Phd. thesis, Dept. of Computer Science, Univ. of Illinois at Urbana-Champaign, 1983. BIBLIOGRAPHIC DATA SHEET 1. Report No. UIUCDCS-R-83-1147 3. Recipient's Accession No. 4. Title and Subtitle Program Behavior Under VAX/VMS 5. Report Date 7. Author(s) Walid Abu-Sufan and Roland L. Lee 8. Performing Organization Rept. " UIUCDCS-R-83-1147 No 9. Performing Organization Name and Address University of Illinois at Urbana-Champaign Department of Computer Science Urbana, IL 61801-2987 10. Project/Task/Work Unit No. 11. Contract /Grant No. US NSF-MCS 83-00981 US DOE-AC02 81-ER10822 12. Sponsoring Organization Name and Address Dept. of Computer Science, University of Illinois at Urbana-Champaign; the National Science Foundation and the U.S. Dept. of Energy, Washington, DC 13. Type of Report & Period Covered Technical Report 14. 15. Supplementary Notes 16. Abstracts Direct measurements on a VAX/VMS system reveal that program behavior has a significant effect on the performance of this system. For a monoprogrammed batch workload the turnaround time of a job can be reduced by up to 50% if its behavior is improved. This is for jobs with virtual space that can fit in physical memory. For larger jobs the improvement can reach a factor of 100. In a multiprogramming batch environment improving the behavior of programs increased the throughput of the system by up to 64% for balanced workloads, up to 400% for I/O bound workloads and up to 419% for mixes of balanced and I/O bound workloads. Improving the program behavior also reduces the overhead time of automatic memory management. This was measured to reach up to 83%. This case study points towards the more general conclusion that program behavior has a significant influence on computer system performance even with the abundance of hardware resources available now and in the future. 17. Key Words and Document Analysis. 17o. Descriptors program behavior automatic memory management system performance 17b. Identifiers/Open-Ended Terms 17c. COSATI Field/Group 18. Availability Statement Release Unlimited FORM NTIS-3S < 10-70) 19. Security Class (This Report) UNCLASSIFIED 20. Security Class (This Page UNCLASSIFIED 21. No. of Pages 35 22. Price USCOMM-DC 40329-P7!