?£ 3 Report No. 363 
 
 1 a. 
 
 THE USE AND PERFORMANCE 
 OF MEMORY HIERARCHIES: A SURVEY 
 
 by 
 
 D. J. Kuck 
 D. H. Lawrie 
 
 December h, 1969 
 
 THE LIBRARY 01: t THB 
 
 DEPARTMENT OF COMPUTER SCIENCE 
 UNIVERSITY OF ILLINOIS AT URBANA-CHAMPAIGN 
 
 URBANA, ILLINOIS 
 
REPORT NO. 363 
 
 THE USE AND PERFORMANCE 
 
 OF MEMORY HIERARCHIES: 
 
 A SURVEY 
 
 by 
 
 D. J. Kuck 
 D. H. Lawrie 
 
 December k, I969 
 
 Department of Computer Science 
 
 University of Illinois at Champ aign-Urbana 
 
 Urbana, Illinois 61801 
 
TABLE OF CONTENTS 
 
 Page 
 
 I. Introduction 1 
 
 II. Page Fault Rate 3 
 
 2.1 EFFECT OF PRIMARY MEMORY ALLOTMENT ON PAGE FAULT RATE ... 3 
 
 2.2 EFFECT OF PAGE SIZE AND PRIMARY MEMORY ALLOTMENT 
 
 ON PAGE FAULT RATE 13 
 
 2.2.1 FRAGMENTATIONS AND PAGE SIZE lk 
 
 2.2.2 SUPERFLUITY VS. PAGE SIZE 17 
 
 2.2.3 PRIMARY MEMORY ALLOTMENT AND PAGE SIZE 18 
 
 2.3 REPLACEMENT ALGORITHMS 22 
 
 2.U PROGRAM ORGANIZATION 25 
 
 2.5 SUMMARY 26 
 
 III. Mult iprogranming 26 
 
 IV. Average Tine Per I/O Request 3U 
 
 i+.l PHYSICAL LATENCY OF SECONDARY MEMORY 35 
 
 -.2 EFFECTIVE LATENCY OF SECONDARY MEMORY 36 
 
 U.3 REQUEST QUEUEING 36 
 
 h.k MINIMIZATION OF EXPLICIT I/O REQUEST TIME 37 
 
 V. Summary and Extensions Ul 
 
 LIST OF FOOTNOTES hk 
 
 BIBLIOGRAPHY 1+7 
 
 ii 
 
LIST OF FIGURES 
 
 Figure P a 8 e 
 
 1. Mean time to reference p pages as a function of p. 7 
 
 2a. E vs. (p,T) surface for q = 2 x 10 , a = 3.8, p ■ 2.U. 11 
 2b. E vs. (p,T) surface for q = 5 x 10 , a = 3.6, 3 = 2.U 12 
 3« Memory fragmentation with four pages of size b, = Uq, 
 
 = 1.5Q, b 3 = 3-2Q and b^ = Uq. B = i+Q. 16 
 
 Page fault rate a as a function of primary allotment m and 
 page size B. Data for a FORTRAN compiler is from Anacker 
 and Wane [ ^ ]• Note B scale is logarithmic. 20 
 
 Page fault rate \ vs. M and B. Data for a SN0B0L compiler 
 from Varian and Coffman [135]. Note \ scale is different 
 than Figure 2a. Dashed lines indicate locus of equal ,\. 21 
 5a. CPU efficiency as a function of the number of jobs J and 
 average I/O completion time T. Average page rate is 
 V(3.£ (uk/j) * ) and explicit I/O interrupts occur every 10K 
 instructions on the average. 29 
 
 CPU efficiency as a function of J and T. Average page rate 
 is t[3." (6V ; J) )and explicit I/O interrupts occur every 
 
 instructions on the average. 30 
 
 5c. CPU efficiency as a function of J and T. Average page rate 
 
 is 1/^3.- (32/j) - ' ) and explicit I/O interrupts occur every 10K 
 instructions on the average. 31 
 
 Relative gain G in efficiency over monoprogramming for optimal 
 
 oer of jobs vs. average I/O completion time (normalized). 
 a. = 3»S> 3 = 2.U. Numbers on curves indicate optimal number 
 of jobs. 33 
 
 iii 
 
Digitized by the Internet Archive 
 in 2013 
 
 http://archive.org/details/useperformanceof363kuck 
 
LIST OF TABLES 
 
 Table Page 
 
 I. Summary of Results from Varian and Coffman [135]. 6 
 
 iv 
 
I. Introduction 
 
 The fundamental reason for using memory hierarchies in computer 
 systems is to reduce the system cost. System designers must balance the 
 system cost savings accruing from a memory hierarchy against the system 
 performance degradation sometimes caused by the hierarchy. Since modern 
 computers are being used for a great variety of applications in diverse 
 user environments, the hardware and software systems engineers' task is 
 becoming quite complex. In this paper we shall discuss a number of the 
 hardware and software elements of a memory hierarchy in a computer system. 
 Included are several models and attempts at optimization. 
 
 Computer engineers may choose from a number of optimization 
 criteria in designing a computer system. Examples are system response 
 time, system cost, and central processing unit (CPU) utilization. We 
 shall primarily discuss CPU utilization and then relate this to system 
 cost. Such considerations as interrupt hardware and scheduling algorithms 
 determine response time and are outside the scope of this paper. 
 
 In order to discuss CPU utilization, let us list a number of 
 reasons for non-utilization of the CPU. That is, assuming that a user 
 or system program is being executed by the CPU, what may be the causes 
 of subsequent CPU idleness. 
 
 1) The computation is completed. 
 
 2) A user program generates an internal interrupt due 
 to e.g., an arithmetic fault. 
 
 - 1 - 
 
3) A user program generates an explicit I/O request to 
 
 secondary storage. 
 h) The system generates an internal interrupt due to 
 
 e.g., a page fault. 
 
 5) The system generates a timer interrupt. 
 
 6) The system receives an external interrupt from e.g., 
 a real time device. 
 
 We are using "system" here to mean hardware, firmware, or software. 
 Point l) will be implicitly included in some of our discussions by 
 assuming a distribution of execution times. Point 2) will not be 
 discussed. Point 3) will be discussed in some detail and point h) 
 will be given a thorough discussion. Points 5) and 6) fall under 
 system response time and will note be explicitly discussed. 
 
 If a program (instructions and data) is being executed, let 
 us del'ine a page fault to be the generation by the system of an address 
 outside the machine's primary memory. This leads to the generation by 
 the system of an I/O request to the secondary memory. Now we can des- 
 cribe the CPU idle time for both points 3) and h) above, by 
 
 CPU I/O idle time = 
 number of I/O requests x average time per I/O request 
 
 - 2 - 
 
In this equation, "average tiirj per I/O request" is the interval from 
 when an I/O request occurs until some user program is again started. 
 Notice that we are including both the case of explicit, user initiated 
 I/O requests and the case of implicit system generated page faults which 
 lead to I/O requests to the secondary memory. Much of our discussion 
 will be centered on the minimization of one or the other of the terms on 
 the richt hand side of this equation. 
 
 It should be observed that this equation holds for multiprogrammed 
 as well as monoprogrammed systems. In a monoprogrammed system, the "average 
 time per I/O request" is defined as the interval frcm when an I/O request 
 occurs for some program until that program is again started. We regard 
 the execution of operating system instructions as CPU idle time. In a 
 multiprogramming situation, the average time per I/O request is decreased 
 by allowing several users to interleave their I/O requests and we shall 
 also deal with this case. 
 
 II. Page Fault Rate 
 
 In this section we will deal with the first term on the right 
 hand side of the equation of Section I. In particular, we will restrict 
 our attention to the rate of generation of page fault I/O requests, explicit 
 I requests being ignored. We consider only demand paging where one page 
 at a tir.e is obtained from secondary memory. 
 
 2.1 EFFECT OF PRIMARY MEMORY ALLOTMENT ON PAGE FAULT RATE 
 
 Obviously, the page fault rate will be zero if all of a program's 
 instructions and data are allowed to occupy primary memory. On the other 
 hand, it has been demonstrated that a small memory allotment can lead to 
 
 - 3 - 
 
disastrous paging rates. The relationship between primary memory allot- 
 ment and page faults has been studied by a number of workers [ 12 , 40 , 
 
 Ul, 95, 109, 125, 127, 128, 132] and many experiments have been conducted 
 to determine program paging behavior [ k, 9, 11, 18, 27, 55, 62, 
 
 95 , 108, 111, 133 > 135]' 
 
 One of the statistics which is of interest is the length of the 
 average execution burst. We will define an execution burst <p to be the 
 number of instructions executed by a program between its successive page 
 faults. The average execution burst is measured by allowing a program 
 an initial allotment of pages p (usually 1 page) and then allowing the 
 program to accumulate more pages of memory until it acquires p„ pages. 
 At this point, any new pages required by a program must be swapped for 
 pages already in primary memory. In addition, an upper bound, q, on the 
 i.otal number of instructions executed is sometimes imposed on the program. 
 This q may be thought of as a time slice or as any condition which causes 
 the program to be swapped out of primary memory. 
 
 Fine, et al . [ 55] presents the results of experiments for 
 
 p n = 1, p = «, q < 8 x 10 , and a page size of 1024 words which indicate 
 
 2 
 1 hat almost ^9% °f all execution bursts were less than 20. However, this 
 
 uata includes the results of explicit I/O, and it was assumed that all. of 
 
 a program's pages were swapped out of primary memory when an explicit I/O 
 
 3 
 request was made." This would tend to lower the average cp since cp would 
 
 include the effects of a lot of short execution bursts which occur when 
 
 It 
 
 a program is trying to acquire a sufficient working set of pages. 
 
 Varian and Coffman [135 J also presented this type of data and their 
 results are broken down by program and instructions vs. data for several 
 
 - U - 
 
values of p . These results are summarized in Table 1. They apparently 
 do not include the results of explicit I/O, and are based on a page size 
 cf 102*+ words. These statistics indicate, as did those of Fine, et al ., 
 that execution bursts tend to be quite small. They also indicate that the 
 average execution burst is quite sensitive to p„ as we might expect. 
 
 Another statistic in which we are interested is the mean time 
 required to reference p distinct pages t = f(p). Data cited by Fine, 
 et al indicates the shape of the t = f(p) curve is as shown in Figure 1. 
 Fine's data indicates that a significant number of pages are referenced 
 within a very short tine. For example, in half of the cases measured, 
 tiie first ten pages were required within 500 instructions (median). This 
 has obvious implications for any virtual memory system which does not have 
 sufficient primary memory. 
 
 Shcmer and Sl.ippey [I2y] show that if the arrival of page faults 
 car. be modeled by a Poisson process where \, is the probability of a page 
 fault during At given that i distinct pages have already been referenced, 
 then the average time t to reference p pages is just 
 
 p: 1 ± 
 t \. , p > i . 
 
 p i=i x 
 
 : ir.( T an empirical curve for t = f(p) (Figure l) we can determine 
 
 P, P" 1 
 
 - t - t = \ i/a. - T. i/x. = i/> 
 
 A. 
 
 p i=l ' * i=l p 
 
 - 5 - 
 
P T3 
 
 o 
 
 to >< 
 
 01 
 
 CO P 
 
 to CO 
 P fx< 
 CO 
 « 
 
 OJ 
 
 c co w 
 
 <V P-l +i 
 
 0) H 
 
 p -p co 
 
 o ah 
 
 n c 
 
 3 
 
 o 
 
 •H 
 P 
 O 
 
 P 
 
 w 
 c 
 
 p 
 c 
 
 CD 
 
 cl> e 
 
 M-P 
 CO o 
 
 "3 
 
 a 
 
 cr 
 
 CO 
 P 
 
 CO 
 
 Q 
 
 P 
 to 
 
 c. 
 
 O 
 
 (X, 
 
 COOIIA 
 
 • 00 ir\ 
 
 r-H CO 
 
 H-J- CO 
 
 0> OJ 
 OJ 
 
 H -d- 
 
 OJ 
 
 VO 
 
 P 
 H 
 
 03 
 fa 
 
 2 C 
 O CO 
 
 p 
 
 on t^-co 
 
 CD r-l 
 
 OJ -3" 00 
 
 oDOtn 
 
 ir\_* o 
 
 H ro 
 
 OJ J" (JO 
 
 OJ 
 
 H 
 
 oj 
 C 
 •H 
 
 1 
 
 o 
 
 CO 
 
 p 
 
 CO 
 
 « 
 c 
 
 CO 
 
 p 
 
 co 
 C 
 H 
 
 co H 
 
 OJ j- 
 
 8 
 
 ITS 
 
 m 
 
 H 
 
 OJ 
 
 CO 
 
 c 
 o 
 
 •H 
 
 CO C O P 
 C CD -H CO 
 •H ^ p »-, 
 
 tj cj co ho 
 
 f~. Cm p| oj 
 
 O Cw O* p 
 
 •H W G 
 
 - 6 - 
 
O 
 
 a 
 
 o 
 
 
 u 
 
 a) 
 
 w 
 cd 
 
 u 
 oj 
 
 CO 
 ft 
 
 ft 
 
 CD 
 CJ 
 
 C 
 
 CJ 
 
 u 
 
 0) 
 
 o 
 •p 
 
 0) 
 
 .9 
 
 •p 
 
 e 
 
 cd 
 
 H 
 0) 
 
 •H 
 
 a. 
 
 ■p 
 
 
 •P 
 
 ITS 
 •P 
 
 4* 
 
 CO 
 ■P 
 
 -p 
 
 t 
 
 
 J 
 
 
 
 
 
 V 
 
 
 
 
 
 
 
 
 
 
 - 7 - 
 
Since t p+1 - t p - f(p+l) - f(p) - ^^Pl 4? 
 
 we have 
 
 1\ S ^El ^ . (!) 
 
 Thus, we can determine the X. probabilities by examining empirical t 
 
 curves. 
 
 We will model the t function of a program with the formula 
 
 f(p) = 6P 7 . (2) 
 
 This formula has been applied to the f(p) data presented by Fine, et al 
 and it was determined that 5 = 1.1 and y = 3»^» Using Eqs. (l) and (2) 
 
 vmere Ap = 1, we find 
 
 !A P = ^ = r^ 1 « ^ (3) 
 
 or l/\ = 3.8 p 2,U . 
 
 Given we are in state p (p most recently referenced pages in primary 
 
 armory) the probability of referencing a new. page (page fault) at time t, 
 
 -k t 
 assuming a Poisson distribution is given by p(t|p) = 1-e P . Now, if 
 
 we assume that we force the system to remain in state p by replacing the 
 
 least recently used page with the new page each time a page fault occurs, 
 
 then we might expect the system to continue to behave as before; i.e., 
 
 t..e system will continue to generate faults according to 1-e * . It 
 
 - 8 " 
 
can then be shown that the mean time between page faults in state p 
 is just 
 
 <p.(p) = i/\p = op p (k) 
 
 where we define <p (p) to be the steady state average execution burst given 
 
 00 
 
 p pages in core. 
 
 The average execution burst over time q, <P a (p)> given a program 
 starts with one page and is allowed a maximum of p pages should be 
 derived using distributions of q (see Smith [132] and Freibergs [ 62] 
 for q distributions)! but is beyond the scope of this paper. We shall 
 settle for the approximations 
 
 V p) " ?VJ ' q -*p (5) 
 
 -i ft 
 
 where f (q) is the average number of pages referenced in time q < t . 
 
 In case q > t 
 P 
 
 V p) : F^fey ' q>t P 
 
 (6) 
 
 where p + >. (q - t ) is the total number of page faults generated in 
 
 time a > t . If a » t , then X q » p and we have: 
 p P P 
 
 <p q (p) - iA p q>>t P < T) 
 
 which is Eq. (U) as q •* °°. 
 
 - 9 " 
 
Each time a page fault occjrs, we have to pay an average time 
 T to make space for and make present a page from secondary memory. Thus, 
 we can define the (rnonoprogrammed) CPU efficiency factor as 
 
 <P (p) 
 
 (8) 
 
 where T is measured in average instruction times. Figures 2a and 2b show 
 
 several E vs. (p,t) surfaces for q = 2 x 10 and 5 x 10 J where approxi- 
 
 2.1+ 
 nations (5) and (6) were used to compute <p (p), and l/\ = 3*8 P • 
 
 Looking at Figure 2a we notice that in the region of low p, 
 the only way to get higher efficiency is to use a fast secondary memory. 
 However, secondary memories with T < 1000 average instruction times would 
 correspond to extended core storage. In the region of larger T correspond- 
 ing to drums and fast, head per track disks, the only way to achieve rea- 
 sonable rnonoprogrammed efficiency is by providing sufficient primary memory. 
 
 Figure 2b shows the effect of a smaller time quantum. In this 
 case, efficiency is sensitive to T and insensitive to p over almost the 
 entire surface. This is due to the fact that programs corresponding to 
 a = 3«8 and £ = 2.k seldom reference more than about 12 pages within the 
 quantum time q = 5000. While this surface was computed using a constant 
 q instead of using a statistical distribution of q, it still indicates 
 what can happen to individual program efficiency when programs are swapped 
 out of primary memory for a (non-page fault) 1/0 interrupt or a small system 
 imposed time quantum. The actual degradation will, of course, depend on 
 the characteristics of the program (a,0) as well as the system's ability 
 to mask 1/0 with multiprogramming techniques. 
 
 10 
 
UJ 
 
 J- 
 
 CM 
 II 
 
 ax 
 
 CO 
 CO 
 
 II 
 
 o 
 
 H 
 X 
 
 CVJ 
 II 
 
 o< 
 
 6 
 
 0) 
 O 
 
 «fi 
 
 EH 
 
 ft 
 
 E 
 
 w 
 
 CO 
 
 CVJ 
 0) 
 
 £ 
 
 - 11 - 
 
CVJ 
 
 II 
 ca 
 
 CO 
 
 • 
 00 
 
 II 
 
 a 
 
 H 
 H 
 
 IT\ 
 II 
 
 0< 
 h 
 
 o 
 
 4-4 
 
 0) 
 
 o 
 
 g 
 
 ft 
 
 w 
 > 
 
 W 
 
 C\J 
 
 a> 
 
 - 12 - 
 
In this section we have presented a very simple model of program 
 paging behavior in terms of the average time required to reference p pages 
 
 t - Bp? 
 P 
 
 Then, under the assumption that paging is a Poisson process, we derived 
 the a /era^e execution burst as a function of the number pages in primary 
 memory 
 
 ~ d K ~ s 
 
 ^ (p) ■ -dT = °* 
 
 Using these relations and values X, O. and £ derived from Fine's results, 
 we showed the effect on monoprogrammed efficiency of a gross time charac- 
 teristic T of secondary memory, primary memory allotment, and time quantum 
 q. This was done under the assumption that the page size was 102U words 
 and that a least-recently-used page replacement algorithm was used. In 
 the following sections, we will examine the effects of different page 
 sizes, replacement algorithms, and the use of multiprogramming to mask 
 I/O time. 
 
 2.2 EFFECT OF PAGE SIZE AND PRIMARY MEMORY ALLOTMENT ON PAGE FAULT RATE 
 
 In the previous section we assumed that the page size was fixed 
 at 102U words. As we shall see in this section, the page size, b, will 
 effect the page fault rate \ for two reasons. First, primary memory may 
 be underutilized to some extent due to a) primary memory not being filled 
 with potentially useful words, i.e., fragmentation and b) the presence of 
 
 13 
 
words which are potentially useful but which are not referenced during a 
 period when the page is occupying primary memory, i.e., superfluity. Any 
 underutilization of primary memory tends to increase the page rate since 
 the effective memory allotment is decreased as analyzed in the last section 
 Second, more page faults may be generated when the page size is b than when 
 page size is 2b, simply because we only have to generate one page fault to 
 reference all words in the 2b page whereas to reference the same words we 
 have to generate two faults if the page size is b. 
 
 2.2.1 FRAGMENTATIONS AND PAGE SIZE 
 
 We assume ° that a program consists of a number of segments of 
 size s where s varies according to some statistical distribution with 
 mean s. These segments may contain instructions or data or both. The 
 words of a segment are lo gically contiguous, but need not be stored in a 
 physically contiguous way. Each segment is further divided into a number 
 of pages. The pages consist of b words which are stored in a physically 
 contiguous way. To allow for variable page size, we assume the system 
 imposes a size quantum Q < B on all storage requests such that requests 
 are always rounded up to the next multiple of Q. Page size b may be any 
 multiple of Q, but may not exceed B which is the largest number of neces- 
 sarily physically contiguous words which the system can handle. The ratio 
 B/Q may be thought of as an index of the variability of the page size. All 
 pages of a segment will be of size b = B except the last which will be some 
 multiple n of Q, b = nQ < B. The physical base address of a page may be 
 any multiple of Q; that is, it may be loaded beginning at any address which 
 
 is a multiple of Q. For example, if the maximum segment size s = B = 
 
 21 
 1024 and Q = 1, then we have the case corresponding to the Burroughs B5500. 
 
 - 14- 
 
If Q » B and s » B, then we have the case of more conventional paging 
 systems. 
 
 Thus, we night have several pages allocated in primary memory 
 as shown in Figure 3 where Q ■ B/U. Notice that there are two sources 
 of memory waste evident in Figure 3» First, memory is wasted because 
 every storage request must be rounded up to a multiple of Q as shown by 
 the wavy lines. We refer to this as internal fragmentation . Second, 
 memory is wasted because there are four blocks of Q words which cannot 
 be used to hold a full page because they are not contiguous. This is the 
 classical situation of checkerboarding, which we will refer to as external 
 fragmentation . Notice that as Q ■* 1, internal fragmentation diminishes 
 to zero, while as Q - B, external fragmentation disappears. The exact 
 amount of waste will be dependent on Q, B, and the distribution of segment 
 sizes. 
 
 Randell [113] has studied the effects on memory utilization 
 of variations in these parameters. His results indicate that: l) loss 
 of utilization due to external fragmentation when Q « B is not as great 
 as loss due to internal fragmentation when Q = B; and 2) utilization does 
 net change significantly with changes in the mean segment size if Q « B, 
 but it does change significantly with s if Q = B. It is also apparent 
 that if s » B, then Q makes little difference. 
 
 Tr.e conclusion from this is that if a program is to be segmented 
 where s = B, then small Q is definitely desirable. If the page size must 
 b<i fixed (Q = B),then it should be small. However, the increase in memory 
 u1 ilization due to smaller B must be offset by the possible cost increase 
 for the required paging hardware. For instance, if we can double memory 
 utilization by decreasing the block size by some factor, then we can 
 
 15 
 
Q 
 
 I I I I I J I l I l l , m l I I I i 
 
 / \ J \ / \ f 
 
 b,»B b 2 b 3 b 4 -B ■ 
 
 winfr internal fragmentation 
 
 external fragmentation 
 
 data or instructions 
 
 
 Figure 3. Memory fragmentation with four pages of 
 size b x o UQ, t> 2 = 1.5Q, b - 3.2Q and 
 
 b. » Uq. b = Uq. 
 
 - 16 - 
 
afford to spend 1/2 the total cost of primary memory on the increased 
 paging hardware. 
 
 Unfortunately, a small B or Q is not the entire answer. While 
 small B or Q increases memory utilization and thus reduces the page rate 
 for a given memory allotment, small B or Q may also result in u corres- 
 ponding increase in page rate for reasons we will discuss in 2.2.3* 
 
 2.2.2 SUPERFLUITY VS. PAGE SIZE 
 
 Another factor which leads to an effective underutilization 
 of prir.ary memory arises from instruction or data words which are loaded 
 into primary memory as part of a page but are never referenced during 
 that period of residency. We will refer to these words as superfluous . 
 
 We can obtain slower bound on the number of superfluous words 
 
 by examining the total primary memory requirements M of a program as a 
 
 12 
 function of page size . That is, assume primary memory is unlimited, 
 
 then M(B) is the total amount of primary memory occupied after a given 
 
 execution of the program with page size B. Now, given unlimited primary 
 
 memory, if the program is run with page sizes b = B and b = 1, then at 
 
 least MCE) - H(l) words must be superfluous. If we force the program to 
 
 run with primary memory m < M(B), then page faulting will occur and the 
 
 number of superfluous words may increase over M(B) - M(l) since some words 
 
 which are eventually referenced are not referenced during same period of 
 
 their page residency and are thus superfluous during that period. 
 
 O'Neill [108 ] 13 and Belady [ ll] 1 present M(B) statistics which 
 
 are remarkably linear over the ranges 256 < B < 2048 and 128 < B < 1024, 
 
 respectively. Even for larger page sizes M(B) is reasonably linear, but 
 
 for small B, M(B) drops off sharply. Thus, we can assume 
 
 " 17 
 
M(B) = a Q + a^ 256 < B < 1024 (9) 
 
 and a.B is a lower bound on the number of superfluous words. 
 
 Unfortunately Eq. (9) only establishes a lower bound on the 
 number of superfluous words. It does not tell us anything about the aver- 
 age number of superfluous words present when primary memory is less than 
 that absolutely required by the program. The authors know of no published 
 data which pertain directly to superfluous words in this case, so we shall 
 move on to determine the overall effect of block size on the paging rate \. 
 
 2.2.3 PRIMARY MEMORY ALLOTMENT AMD PAGE SIZE 
 
 In Section 2.1 we discussed the average execution burst cp(p) as 
 a function of memory allotment in units of p, the number of b=B=102U word 
 pages. In this section we will examine the paging rate X = l/cp as a func- 
 tion of primary memory allotment in words m = pB, for various values of 
 page size b=B. 
 
 We would expect that for small m, \ will vary considerably with 
 the page size. This is because for small m, the average time each page 
 is in primary memory will be relatively short, and so the extra words in 
 larger pages will tend to go unreferenced and will only take up space 
 which might better be occupied by new, smaller pages. On the other hand, 
 as m increases, we would expect to see page size have less effect since 
 the probability will be higher that more words in the page will be refer- 
 enced due to the longer expected page residence time. In addition, we 
 might also expect to see, for a given m, a B , such that any B.. > B 
 will only include superfluous words and any B p < B will not include 
 enough words. 
 
 - 18 - 
 
 
Figure 4a is a graph oJ X vs. B and m based on experimental 
 data from a FORTRAN compiler [ h ] . This graph clearly exhibits that 
 when a program is "compressed," i.e., run in a smaller memory, large page 
 sizes lead to excessive paging. When the page size is small, then the 
 program tends to be more compressible. As m gets larger, the paging 
 behavior becomes less a function of B, and for large enough m, small B 
 
 - even increase the page rate. Slight minimum points were observed at 
 the (rn,B) points (2K,64), (MC,256), (8k,256). This illustrates that if 
 talninw exist, then they are not necessarily independent of m. 
 
 Figure 4b is another graph of \ vs. m and B data for a SNOBOL 
 compiler [135l« This program is evidently much less "compressible" than 
 the FORTRAN compiler in Figure 4a. However, it shows the same general 
 tendencies as Ficurc 4a except for the apparent lack of minima. 
 
 Another way to view the X vs. (m,B) relationship can be seen 
 observing in Figure 4b the dashed lines which pass through points of 
 eaual \, Notice that ,\(6K, 256) is only slightly lower than \(4k, 64). 
 Thus, we can affect an almost equal tradeoff between half as much primary 
 memory and 1/4 the page size; i.e., we double the number of pages but each 
 
 page is only 1/4 as large. However, we must also consider the increase 
 
 17 
 paging Hardware necessary to handle the larger number of pages. 
 
 Tr.e r.ain point to be had from these figures is that programs are 
 more cor.pressible when B is small; i.e., they will tolerate a much smaller 
 primary memory allotment if B is small. However, too small a B may lead 
 to a slight increase in paging activity. (See also a study performed on 
 the ATIAS system by Baylis, et al . [9 ].) 
 
 The above results further support arguments for variable page 
 sizes allowing logically dependent words (e.g., subroutines or array rows) 
 
 - 19- 
 
J- 
 
 on 
 
 CVJ 
 
 H 
 
 -d- 
 
 o 
 
 o 
 
 o 
 
 O 
 
 OJ 
 
 
 
 • 
 
 ' 
 
 o 
 
 H 
 
 m 
 
 S£ 
 
 - 20 - 
 
<< 
 
 ,3 
 
 - 21 - 
 
to be grouped in a page without leading to underutilization of memory 
 due to internal fragmentation or superfluity. Logical segmentation of 
 code and data will be taken up more generally in later sections. 
 
 2.3 REPLACEMENT ALGORITHMS 
 
 Whenever it is necessary to pull a new page, i.e., transfer 
 a new page from secondary to primary memory, it is also necessary to 
 select a replacement page in primary memory to be pushed (transferred 
 to secondary memory) or overlayed. If we assume that all programs are 
 in the form of pure procedures, then we never need to push program pages. 
 Data pages need to be pushed only if we have written into them. The 
 selection of a replacement page is done by a replacement algorithm. A 
 number of these algorithms have been proposed and evaluated [ 9, H> 
 12, 17, lo, 27, kO, kl, 86, 116, 125, 135] where Belady [ 11 ] has 
 produced the most extensive summary and evaluation to date. The various 
 algorithms can be classified according to the type of data which is used 
 by the replacement algorithm in choosing the replacement page. 
 
 Type l) The first type of information pertains to the length 
 of time each page has been in primary memory. The page (or 
 class of pages) which has been in memory the longest is 
 pushed or overlayed first. This information forms the basis 
 of what are usually referred to as FIFO algorithms. This is 
 the simplest type of information to maintain and it usually 
 requires no special hardware to implement. 
 
 Type 2) Type 2 information is similar to Type 1 information 
 but "age" is measured by the time since the last reference 
 to a page rather than how long the page has been in primary 
 
 - 22 - 
 
memory. This information is the basis of the so-called 
 least -recently-used replacement algorithms. Many variations 
 exist, e.g., based on the fineness of age measurement. Systems 
 which accumulate this type of information usually employ 
 some type of special hardware to record page use statistics. 
 Type 3) Information as to whether or not the contents of 
 a page have been changed is frequently used to bias the 
 selection towards pages which have not been changed and 
 thus do not have to be pushed (but simply overlayed) since 
 an exact copy is still available in secondary memory. 
 Special hardware is needed to record the read-only/write 
 status of each page in primary memory. 
 
 Type U) In the ATLAS system [ 9 , 86 ] the length of the 
 last period of inactivity is recorded for all pages in a 
 program. This information is used to predict how long the 
 current period of inactivity will be, i.e., how soon a page 
 will be referenced again. Replacement is biased towards 
 pages which, on the basis of this information, are expected 
 to be inactive for the longest time. This type of information 
 is particularly useful for detecting program loops as was 
 intended by the ATLAS designers. 
 Belady [ 11 ] has evaluated the performance in terms of page fault rate of 
 a nu".ber of algoritnr.s as functions of page size and primary memory allot- 
 nent, anu we will now discuss his results. 
 
 The simplest algorithm studied was the RAM)OM algorithm. This 
 uses no information about pages, but chooses a replacement page randomly 
 from those in primary memory. The use of Type 1 information (time in 
 
 - 23 - 
 
primary memory) never significantly improves performance relative to 
 RANDOM and in some cases performance is worse than RANDOM. 
 
 The use of Type 2 information (time since last read or write) 
 leads to the most significant and consistent improvement in performance. 
 With these algorithms the accuracy with which "age" is measured does not 
 seem to have much effect on performance, however. That is, performance 
 does not change significantly whether we keep a complete time history of 
 pages in primary memory, or just divide all pages into two classes— 
 recently used and not-so-recently used. The use of Type 3 information 
 (read only /write status) in addition to Type 2 information does not affect 
 the total number of page faults very much. However, it does increase per- 
 formance due to the fact that no push is required on 10 to 6of of all page 
 faults. 
 
 The ATLAS algorithm [86] which used both Type 2 and h informa- 
 tion is the most complex algorithm studied, and it is interesting to note 
 that it consistently leads to worse results than Type 2 algorithms and is 
 sometimes worse than RANDOM or FIFO. This result has been further sub- 
 stantiated by Baylis, etal. [ 9 ]• Apparently, the problem is that most 
 programs do not have a regular or small enough loop structure to warrant 
 the use of the ATLAS algorithm which is intended to take advantage of 
 program loops. 
 
 Thus, algorithms which make replacements are the basis of least 
 recently referenced pages and bias towards read-only pages would seem to 
 be best in terms of cost effectiveness. However, for existing systems 
 which do not have the hardware necessary to automatically maintain Type 2 
 and/or Type 3 information, RANDOM, FIFO or programmer directed schemes 
 must be used. 
 
 2k - 
 
2.1* PROGRAM ORGANIZATION 
 
 Coroeau [ 30] has shown that simply by reordering the assembler 
 deck of the Cambridge Monitor System to cause logically dependent routines 
 to be grouped together, paging of the monitor was reduced by as much as 
 bO%. Brawn and Gustavson [ 18] and McKellar and Coffman [103] have 
 shown that simple changes in computation algorithms, such as referencing 
 matrices by square partition instead of row or column, can also affect 
 large improvements in paging activity. (See also [ 36, 37 > 51> 7330 
 These studies indicate that: 
 
 1) Programmers need to be aware of the paged and/or segmented 
 environment in which their programs will be executed. 
 Program optimization by reducing page faults is more 
 important than classical optimization techniques (e.g., 
 common subexpression elimination). 
 
 2) Prorp-amners should be able to direct or advise the compiler 
 as to which code should be placed in which page/segment. 
 
 3) If possible, subroutine or procedure code should be placed 
 in the code segment where it is called. If this code is 
 smell and is used in several different segments, then 
 several copies of the subroutine could be generated, one 
 in each segment where it is called. 
 
 h) More emphasis should be placed on compiler optimization 
 of code through strategic segmentation. For example, by 
 analyzing the structure of a program (see Martin and Estrin 
 [99]) the compiler could make better segmentation decisions 
 and provide information which the operating system could 
 use to make replacement decisions, and to perform prepaging. 
 
 - 25 - 
 
In addition, compilers might be able to detect certain 
 cases of poor data referencing patterns and issue appro- 
 priate warnings to the programmer. 
 Thus, we can improve paging behavior both by changing the physi- 
 cal parameters of the system and by intelligent program organization. The 
 latter method would appear to have a higher cost effectiveness and should 
 not be overlookeu. 
 
 2.5 SUMMARY 
 
 As we have noted, CPU efficiency can be related to the page 
 fault rate and the average time T to satisfy these I/O requests. In 
 Section II we have tried to illustrate the relationships between page 
 fault rate and primary memory size, primary memory allotment, page size, 
 replacement algorithm, program organization, and secondary memory 
 characteristics. Our intent has only been to indicate trends and general 
 relationships, and with this in mind our models have not been very elaborate. 
 However, all our models have been based on observed program behavior and 
 are probably accurate, at least for the classes of programs studied. 
 
 III. Multiprogramming 
 
 Multiprogramming arises for two reasons: 
 
 1) In an attempt to overlap I/O time by having one program 
 be executed while other procrams are waiting for I/O 
 (implicit or explicit). 
 
 2) In order to provide quick response to several real 
 time .iobs (time sharing, process controls, etc.) 
 
 will concern ourselves only with the first of these functions. 
 
 - 26 - 
 
Whenever several concurrent programs share memory 
 in order to "mask" I/O time each program operates with less primary 
 memory than it would have if it were running alone. As we have seen, 
 this causes the paging rate for each program to increase. On the other 
 hand, by multiprogramming we are able to decrease the average time per 
 I/O request (both paging and explicit). Several questions now arise: 
 First, when does the degradation of efficiency due to increased page 
 traffic become greater than the increase in efficiency due to more I/O 
 masking. Second, how much of an improvement can we expect with multi- 
 programming over monoprogramming. 
 
 Gaver [ 65] has presented an analysis of multiprogramming based 
 on a probability model which relates CPU efficiency to the number of con- 
 current jobs J. where each job runs for an average of l/h instructions 
 (hyperexponentially distributed) before generating an I/O interrupt, and 
 I/O requires an average of T instruction times to complete (exponentially 
 distributed). Unfortunately, Gaver does not consider the fact that as 
 J increases, each job must be executed with less primary memory and thus 
 paging I/O increases. However, this is fairly easy to add to his model, 
 
 using the results of Section 2.1. 
 
 20 
 
 Suppose the total available primary memory is M pages and all 
 
 programs are identical and are allocated equal amounts of this memory. 
 
 21 
 
 Then the memory allotment for each program is just M/j. ' The paging rate 
 
 >. for each program as a function of J is then 
 
 >-< J > ■ mu ■ (10) 
 
 - 27 - 
 
where cp(p) was defined in Section 2.1. We will assume this is exponent- 
 ially distributed. As in Section 2.1 we will use the function a,^A/jy 
 to model <p(M/j). Thus 
 
 X(J) = — ^r (11) 
 
 We also assume that explicit (non-page) I/O interrupts are generated at 
 a rate r, so the total I/O interrupt rate for each program is 
 
 X(J) = r + ±— (12) 
 
 Using Eq. (12) and Gaver's equations, we have computed CPU 
 efficiency as a function of the number of identical jobs J and average 
 I/O completion time T for several values of r and M and for a = 3»8 and 
 (3 = 2.U (see Section 2.1, Eq. 3)« The results of these computations are 
 plotted in Fif^ires 5a, 5b, and 5c 
 
 Figure 5a shows CPU efficiency for M = 6k and l/r = 10,000 
 (e.g., primary memory is 6hK words and each program generates an explicit 
 I/O interrupt every 10,000 instructions on the average). Notice that effic- 
 iency increases with J up to an optimal value at J . due to interleaving 
 1/0 time. Thereafter, efficiency decreases because of increased paging. 
 Fir. re 5b corresponds to K = 6h and l/r = 5000. Efficiency is less than 
 that in Figure 5a due to the increased explicit 1/0, and the gain in 
 efficiency for monoprogrammed vs. multiprogrammed, E(j . ) - E(l), is 
 more pronounced. Figure 5c corresponds to M = 32 and l/r = 10,000. 
 Efficiency is again smaller than in Figure 5a; this time due to increased 
 
 - 28 - 
 
UJ 
 
 0) 
 
 - 29 - 
 
4h 
 
 - 30 - 
 
bJ 
 
 
 h> 
 
 
 
 
 
 
 
 
 
 
 08 
 
 
 
 
 
 
 B 
 
 
 
 < 
 
 P 
 
 
 • 
 
 
 
 fc< 
 
 
 
 'O 
 
 u 
 
 
 c 
 
 T1 
 
 
 2 
 
 
 
 •"3 
 
 
 
 <n 
 
 
 
 O 
 
 • + H 
 
 
 C 
 
 O 
 
 
 O 
 
 •H 
 
 
 •H 
 
 H 
 
 
 t5 S* 
 
 
 1 
 
 CD 
 
 at 
 i 
 
 <U 
 
 T) 
 
 Q< 
 
 
 C 
 
 
 
 
 
 <P 
 
 •A 
 
 • 
 
 i> 
 
 (4* 
 
 JC 
 
 
 • 
 
 +J 
 
 ^ 
 
 
 G 
 
 •"3 
 
 
 <u 
 
 
 
 •H 
 
 CO 
 
 w 
 
 CJ 
 
 ro 
 
 c 
 
 •H 
 
 
 o 
 
 e-t 
 
 
 
 <♦- 
 
 
 
 0) 
 
 
 
 i 
 
 
 
 
 
 
 
 
 
 V 
 
 
 
 C) 
 
 ai 
 
 
 Lf\ 
 
 a 
 
 
 4) 
 
 k 
 
 
 g. 
 
 
 
 •■-I 
 
 PL, 
 
 cc 
 
 
 - 31 - 
 
paging induced by smaller M. Notice that when T > 6000, there is no gain 
 to be had from multiprogramming. This does not mean that multiprogramming 
 with this system configuration is bad. It merely illustrates that for this 
 system it is not wise to multiprogram programs characterized by a = 3»8, 
 B = 2.k and l/r = 10,000. (if l/r = 5000, then running 2 jobs is advan- 
 tageous; see Figure 6). 
 
 This introduces the scheduling problem. That is, which jobs 
 should be run concurrently? A good scheduler whose purpose is to maximize 
 throughput should be able to use information about programs' working sets 
 or a,e characteristics to determine an optimal load. We will not pursue 
 this subject further here (see Denning [ ho, Ul ] and Heller [ TO]). 
 
 Figure a shows the relative gain in efficiency over monoprogram- 
 ming due to multiprogramming with an optimal number of jobs 
 
 E(J^) - E(l) 
 
 G = °F* (13) 
 
 as a function of T for several combinations of r and M (in all cases, 
 a = 3'b) B = 2.U). This figure illustrates that for multiprogramming to 
 yield a reasonable gain, there must be sufficient primary memory (note 
 the I: = 22 curves). 
 
 Literature on multiprogramming and tine -sharing is extensive 
 I we will not attempt to present a comprehensive bibliography here, 
 (instead, see Buchholz [20], Calingaert [22], McKinney [104] Trimble 
 [13*0 and Bell and Pirtle [ lU]. Some useful studies can be found in 
 [ 12, U9, 52, 56, 65, 107, 130, 131, 132, 136]. 
 
 - 32 - 
 
I- 
 
 o 
 
 w 
 
 cd 
 
 > 
 
 B 
 
 
 • M 
 
 w 
 
 -P 
 
 £> 
 
 ft 
 
 O 
 
 O 
 
 •f-i 
 
 
 
 a) 
 
 Cm 
 
 -p 
 
 O 
 
 ert 
 
 
 o 
 
 m 
 
 •• i 
 
 Q) 
 
 T) 
 
 Jg 
 
 C 
 
 b 
 
 w 
 
 
 <L> 
 
 
 c 
 
 E 
 
 M 
 
 .-i 
 
 ■L) 
 
 4J 
 
 
 ft 
 
 c 
 
 O 
 
 o 
 
 M 
 
 w 
 
 
 
 $-, 
 
 Cvh 
 
 0) 
 
 
 X) 
 
 bO 
 
 E 
 
 £ 
 
 o 
 
 
 
 ■H 
 
 
 M 
 
 
 bD-^ 
 
 o 
 
 • 
 
 ^ 
 
 
 ft 
 
 
 o 
 
 Ck 
 
 c 
 
 
 
 
 
 
 E 
 
 cc 
 
 u 
 
 i ■ 
 
 aj 
 
 
 > 
 
 ii 
 
 o 
 
 
 
 >. 
 
 
 o" 
 
 • 
 
 c 
 
 «*— >. 
 
 o; 
 
 'G 
 
 ■H 
 
 ai 
 
 rj 
 
 M 
 
 • H 
 
 •H 
 
 Cm 
 
 rH 
 
 <+-' 
 
 CtJ 
 
 a; 
 
 E 
 
 c 
 
 o 
 
 ■H 
 
 c 
 
 
 N«_^' 
 
 o 
 
 
 
 a; 
 
 
 E 
 
 c 
 
 
 •H 
 
 -*•-' 
 
 crj 
 
 
 bfl 
 
 t- 
 
 
 o 
 
 0) 
 
 
 > 
 
 -p 
 
 •rH 
 
 a; 
 
 -p 
 
 r i 
 
 a) 
 
 p< ,; 
 
 H 
 
 e » 
 
 CJ 
 
 o -5 
 
 PC 
 
 o o 
 
 
 •1-3 
 
 
 O r 
 
 • 
 
 M *"< 
 
 >o 
 
 C 
 
 
 U . 
 
 a> 
 
 M rM 
 
 M 
 
 S o> 
 
 a 
 
 H 
 
 •H 
 
 > s 
 
 fr< 
 
 crj C 
 
 - 33 - 
 
IV. Average Time Per I/O Request 
 
 In Section II we introduced T as the average interval between 
 the time when a program is forced to stop (due to a lack of instructions 
 or data in primary memory) and the time when the program could resume. 
 In 2.1 and III, we showed that CPU efficiency is highly correlated with 
 the magnitude of T (see Figures 2 and 5)« In the following sections we 
 will examine T in more detail. Specifically, we will discuss techniques 
 whereby T can be reduced. 
 
 Secondary storage devices range from extended core storage to 
 magnetic tape, but the most common device in use today is the disk file. 
 The time required for these devices to deliver a block of b words can be 
 generally characterized by 
 
 T = t + t + b/p (Ik) 
 
 q a 
 
 where t is queueing time before the disk logic recognizes an I/O request; 
 t is the sum of head positioning latency and rotational latency, and p 
 
 3 
 
 is the transmission rate between primary and secondary memory. Four ways 
 in which we can decrease the average T are : 
 
 1) Decrease t by making the disk spin faster using more 
 
 a 
 
 heads per surface or by using extended core storage. 
 
 2) Making the disk spin faster or using higher bit densi- 
 ties increases p. We might also increase p directly 
 by reading more heads simultaneously. 
 
 3) Use parallel queueing techniques so that the average 
 T over n requests is less than T. 
 
 *) Change the distribution of t by planning the layout 
 of data on the disk in such a way that the data is 
 
 - 3 U - 
 
almost under the read heads when it is needed (this 
 technique is only practical in systems doing large 
 calculations where a dedicated disk is available). 
 Alternately, we can prefetch data blocks (buffering). 
 We will now discuss some of these techniques. 
 
 -.1 PHYSICAL LATENCY OF SECONDARY MEMORY 
 
 Consider a disk system with one movable head per surface and 
 with all heads fixed to the same head positioner assembly. Now t , the 
 access time for this device, is the sum of two statistically distributed 
 
 * i~.es: t , the time to position a head, and t., the time required for 
 
 22 
 the desired sector to come under the heads: 
 
 t - t + t f . (15) 
 
 a p I v ' 
 
 Or.e way to make this disk faster is to add more heads to each arm so that 
 
 the arm does not have to move so far to position a head over the right 
 
 track. This tends to decrease t . 
 
 P 
 
 Another way to decrease t would be to have independent posi- 
 
 Jr 
 
 tioners for each surface. Fife and Smith [ 5*+ ] have presented a good 
 analysis of this technique. Several manufacturers have eliminated t 
 altogether by providing one fixed head per track. To provide further 
 speedup we could introduce multiple heads per track (a matter which pre- 
 sents technological difficulties) or use a drum which typically rotates 
 faster than a disk but does not have as large a capacity. Both of these 
 latter techniques reduce t. in Eq. (15). (See also [133]*) 
 
 - 35 - 
 
Any further improvement in the physical response of secondary 
 memory probably must come from the use of extended core storage (ECS). 
 This is potentially quite expensive (the cost per word being typically 
 more than one-tenth that of primary memory) but is considerably faster 
 as latency is on the order of ten microseconds as opposed to tens of 
 milliseconds for disks and drums. This could double CPU efficiency 
 (see Figures 2 and 5) but must be evaluated on the basis of cost effec- 
 tiveness. Several studies of the use of ECS can be found in [ 7> 63, 
 &S 79, 83, 101]. 
 
 k.2 EFFECTIVE LATENCY OF SECONDARY MEMORY 
 
 Several techniques can be used to decrease the effective latency 
 of a disk device without changing its physical characteristics. For 
 instance, if several requests for blocks from the disk are waiting for 
 service, then we can decrease the average latency over all requests by 
 servicing requests in the order in which the required blocks come under 
 the heads. Another possibility which can be used in certain special cases 
 is to coordinate the layout of blocks on the disk with the timing of the 
 program so that blocks will be almost under the heads when they are needed. 
 
 U.3 REQUEST QUEUEING 23 
 
 We will assume that at any given time, there are n requests for 
 service from secondary memory (these requests having been generated by 
 the several programs being multiprogrammed). We also assume that the 
 secondary memory is a rotating device divided into M tracks, each track 
 being further divided into N sectors. Each request is for access to a 
 particular track and sector. The rotation time of the device is T,.. 
 
 - 36 - 
 
Each request waiting for service will experience a delay T, 
 
 sum of t (time in queue), t (access time), and t (transmission time, 
 Go r 
 
 assured constant). 
 
 The simplest way to service these requests is to establish a 
 single queue which is serviced on a first in, first out (FIFO) basis. 
 A better strategy is to service requests according to which request can 
 be serviced next (FSFO), i.e., the request whose required track and sector 
 is due under the heads next is serviced first. Denning [ 39] shows that 
 for a fixed head per track device the ratio of delay time under FIFO to 
 delay tine under FSFO is 
 
 (FIFO) n(N + 2) (l6) 
 
 (FSFO) N+2(n+l) 
 
 For Y = 3*- sectors and n = 10 requests then the relative improvement by 
 Eo. (l6) is 7.66. That is, the response of a fixed head device with 6k 
 
 sectors and 10 waiting requests is J. 66 times better under FSFO than under 
 
 Zk 
 FIFO. An analysis of movable head devices shows that improvement can 
 
 also be affected by similar scheduling algorithms, but the improvement is 
 
 not as dramatic. 
 
 U.U MINIMIZATION OF EXPLICIT I/O REQUEST TIME 
 
 A number of large scale calculations require space for their 
 data and instructions which exceeds the available primary memory. These 
 calculations involve operations on very large arrays and may require 
 several tens of hours per production run on the fastest computers. In 
 such cases there is no point to interval time slicing of the computation 
 for user interaction, although system throughput can be enhanced by multi- 
 programming, as discussed in Section III. If we restrict our attention 
 
 - 37 - 
 
only to these kinds of large jobs, then one limiting case is a large 
 machine with one large job at a time, i.e., batch processing. We will 
 now turn to a discussion of preplanning the layout of a secondary storage 
 device in such a way that explicit I/O request time is minimized. The 
 interleaving of several jobs will not be discussed except to remark that 
 in such cases the execution time requirements become less stringent for 
 each job, but the sequencing of the interleaved steps presents new 
 difficulties. 
 
 Historically, there are many examples of preplanned drum layout. 
 When drums were used as primary memory, optimizing assemblers would locate 
 the sequence of instructions at appropriate intervals around the drum so 
 that (in jump free segments of code) the next instruction would always be 
 available when the previous one was finished [ll8]» For current machines 
 in monoprogramming mode, it is reasonable to assume that enough code resides 
 in primary memory at any time so that the time required to perform instruc- 
 tion overlays is negligible. However, data overlays may be extensive and 
 we might be able to decrease the latency involved in obtaining data blocks 
 from secondary memory by planning the layout of these data blocks and pre- 
 fetching data. 
 
 The question of overlaying data must be considered with respect 
 to the average amount of processing which may be performed on each data 
 element. Many matrix calculations (e.g., multiplication, inversion, eigen- 
 value calculation) require a N operations where a < 1 and N is the dimen- 
 sion of the matrix. Also, it can be empirically observed that a number of 
 partial differential equation solution techniques on N x N meshes require 
 a I operations per iteration, where cc is generally smaller than in the 
 matrix case but usually greater than 0.1. In the partial differential 
 
 - 38 - 
 
equation case it is sometimes possible to iterate several times on a 
 block in memory, thus increasing a. If we assume a fr operations on N 
 data elements, then each element requires a N operations, where an opera- 
 tion may be regarded as, e.g., a multiply, an add, and a memory fetch or 
 
 say, one microsecond on a current machine. Let us assume a machine with 
 
 2 
 N words of memory available for each block transmitted from the disk. 
 
 This allows a flr microseconds of computation per block. If a = .5 and 
 '.: ■ 6U, then we compute for about 125 milliseconds per block. This is 
 more time than is required for the rotation of any current large disk, 
 which is usually in the range of UO to 60 milliseconds. Thus, if we 
 can always keep one input request ahead in a disk queuer, it should be 
 possible to completely mask the I/O request time. 
 
 As the ratio of processor speed to disk rotation speed gets 
 larger, this problem becomes more difficult. Suppose we have a calcula- 
 tion with the same parameters as above, but we wish to use a processor 
 which is ten times faster. Then we have only 12 milliseconds of computa- 
 tion time per block and this is faster than the rotation time of any large 
 iisk. T:.ere are several obvious ways to avoid this problem. One is to 
 increase M; this may require a larger primary memory. Another is to sup- 
 ply the disk queuer with several requests, thereby decreasing the expected 
 time until some request is honored [ 39]. In some cases there are uniform 
 but intricate relationships between the data blocks and their processing 
 sequerce. To handle these cases, we can attempt a third solution, namely 
 the preplanning of block layout on the disk. 
 
 Consider the problem of 
 matrix multiplication using a head per track disk. Suppose that both 
 operand matrices are partitioned into square blocks, that the prernultiplier 
 
 - 39 - 
 
is stored by rows of partitions, and that the postmultiplier is stored 
 by columns of partitions. Let us also assume that the angle on the disk 
 between the positions of successive partitions represents the disk motion 
 time equal to the processor time required to multiply two square partitions. 
 Now if it happens that one row (ana column) of partitions ends just where 
 the next starts, then it is clear that such a disk storage scheme allows 
 matrix multiplication with no CPU time lost due to waiting for data from 
 the disk. It is also clear that if a sequence of matrix operations are 
 required, then the preplanning of the disk layout becomes more complex. 
 Ir. general, some I/O wait time will be required of the CPU. However, in 
 order to use any matrix as a premultiplier or postmultiplier, it is possi- 
 ble to store all matrices in such a way that they may be fetched by row 
 partitions or column partitions. This is achieved by storing the second 
 partition of the first row, say A p , in the same relative position on the 
 disk as the first partition in the second row, say A p ,. This skewing 
 pattern may be continued in the obvious way, given a sufficient number of 
 disk surfaces. Matrix inversion and eigenvalue calculations require much 
 more intricate disk storage schemes, but the problems are similar [ 91]* 
 A somewhat more difficult set of constraints is encountered in 
 some problems, e.g., explicit partial differential equation methods. In 
 these cases it is necessary to sweep through an array of data repeatedly. 
 When any partition of the array is being processed, it is necessary also 
 to have some data elements from neighboring partitions. For example, if 
 a five point finite difference operator is being applied to M element 
 partitions of an array, then vH border elements are required from each 
 of the four adjacent partitions. It should be possible to pack these 
 
 - 1+0 - 
 
border elements in separate arrays, then write and read them on and off 
 the disk at appropriate times. Assume the calculation on an M element 
 partition requires time T . Next assume it is possible to map partitions 
 
 of the array onto the disk such that the one-way transmission time for a 
 
 T -€ 
 
 c 
 
 partition is — = — . Now we can read a new block and write an old block 
 
 T -€ 
 in 2(— — ) = T -e. If the edge values of the neighboring blocks can be 
 
 transmitted in and out to the disk in e time units, then the scheme main- 
 tains a steady state balance between computation time and I/O transmission 
 time. 
 
 A somewhat weakened set of conditions are imposed in Bernott [15] 
 
 "re it is assumed that T is not less than five times the one-way trans- 
 mission time for a bloc';. Various depths of finite difference operators 
 and any rectangular mesh are allowed. Also, the number of variables being 
 computed is a parameter. In terms of several latency considerations and 
 the above mentioned parameters, a disk layout is computed which gives a 
 resulting computation scheme that has an overall expected CPU efficiency 
 
 ater than &0$. 
 
 V. Summary and Extensions 
 
 As computer systems become more complex and as user's require- 
 -_s become more specialized, the computer system designer must give 
 r.ore attention to overall system cost performance when he designs each 
 part of the system. In other words, he must study more and more trade- 
 offs between various parts of the system. 
 
 In this paper we have discussed some interrelations between 
 -tern parameters including: primary memory size, page size, secondary 
 
 - Ul - 
 
memory speed, I/O request queuers, and the number of jobs multiprogrammed* 
 These together with user program parameters including: mean time to access 
 p pages, number of instructions executed per datum and regularity of addres- 
 sing a data structure have a major influence on the CPU efficiency. 
 
 V/e limited our discussion to two -level memory hierarchies, but 
 the techniques mentioned can be applied to more levels by lumping several 
 levels and reducing the problem to one of two levels. This requires approx- 
 imating the parameters of a lumped level using the parameters of the levels 
 being combined. The use of a two-level primary memory is quite successful 
 in the IBM 360/85 [ 66}* It is also common to use a fast drum between 
 primary memory and a slow disk [3*+ ]• Machines which operate on arrays 
 of data and are organized as arrays of arithmetic processes are now being 
 designed. For example, the pipeline processors [12U] (which might be called 
 serial array processors) and ILLIAC IV [10 ] (which might be called a paral- 
 lel array processor) have many individual memory units, and this fact makes 
 it necessary to carefully plan the layout of data in primary memory for 
 maximum CPU utilization. The kinds of storage planning discussed below 
 might be regarded either as minimizing the number of data faults or the 
 time per data fault because the question is that of supplying data to the 
 processor from the primary memory at a maximum rate. 
 
 Serial array processors generally require a memory whose effec- 
 tive cycle time is equal to the CPU clock time. This is achieved by inter- 
 leaving many slower memory units in a large bank. Since, in general two 
 vectors are entering the processor and one is emerging, it is convenient 
 if at least three such banks are available. Clearly, serious memory con- 
 flicts can arise in this situation. If two argument vectors are stored 
 in the same bank, the processing speed may be cut in half. 
 
 - 42 - 
 
Since present serial array processors reach a speed limit due 
 to the fact that the pipeline length can be made no longer than the number 
 of elementary steps in an arithmetic operation, parallel array processors 
 see™ to be a logical necessity for more speed improvement. The memory 
 system of IT..LIAC IV consists of one memory unit per processor. Each mem- 
 ory unit is directly accessible by just one processor. A network of rout- 
 ing logic may be used to get data to other processors. If one -dimensional 
 arrays are stored with one element per processor, then the full speedup 
 over a single processor may be achieved. In two-dimensional arrays, row 
 operations are easy to perform with a straightforward mapping of an array 
 into the memory, e.g., rows are stored across the processors and each 
 column is within a processor. Similarly, column operations are easy with 
 a transposed array. However, if both row and column operations are required 
 with such a storage scheme using an n processor machine, then operations 
 in one direction will realize an n-fold speedup but operations in the other 
 direction will realize no speedup at all over a one processor machine. If 
 row and column operations are required, some kind of skewing scheme as out- 
 lined in Section IV will provide the full speedup [90]. It may be expected 
 that in the future, parallel arrays of pipeline processors will require even 
 rr.ore intricate primary storage mapping schemes. 
 
 It should be remembered that we have been discussing just one 
 underlying subject throughout this paper: the ratio of cost to performance 
 for en overall computer system. We have attempted to relate several memory 
 parameters and program characteristics to the system performance as meas- 
 ured by CPU utlization. 
 
 - 43 - 
 
LIST OF FOOTNOTES 
 
 1. Note we always measure time in instruction executions; i.e., we scale 
 time by the average instruction time. 
 
 2. The results of these experiments consisted of 1737 execution bursts 
 from lo2 service intervals for five programs: l) LISP, 2) an 
 interpretive meta compiler, 3) an interpretive, initially inter- 
 active, display generation system, h) an interactive JOVIAL compiler, 
 and 5) a concordance generation and reformatting program. Page size 
 was 1024 words. 
 
 3- This corresponds to imposing a variable q on the program. Smith [132] 
 
 -+■ IP- Pi 
 
 indicates this q had a hyper exponential distribution, w. e ' + 
 
 -t/U0.7 x 10 3 
 
 v 2 e 
 
 4. See Denninp [ 40, 4l ]. 
 
 5« We assume that the first page is referenced at t=0 with probability 
 
 1 (t =0) which accounts for the difference between this formula and 
 
 that of Shemer and Shippey. 
 
 Determined from a least-squares fit to the function, £ t = a + y&np 
 
 where 5 = e . Average error over 18 points was l6$. 
 
 It should be remembered that values of a and (3 are characteris- 
 tics of a given program or class of programs, and should not be used 
 
 to describe all programs. A similar study of results [135] from a 
 
 SIIOBOL compiler yielded, cp(p) = .51+ p . 
 
 Belady and Kuehner [12 ] suggest the function <p(p) = ap in their 
 
 paper although they fail to state their reasons. 
 
 he general Poisson process, the average number of pages referenced 
 
 in time t isnot f (t) where t = f(p). However, we are using this 
 
 only as an approximation. 
 
 - 44 - 
 
9- The following definitions follow Randell [113]. 
 
 10. Randell 1 s experiments consisted of generating storage requests in the 
 form of a segment size and time to be spent in memory. Time in memory 
 was always generated from a uniform distribution of 1, 2,3>4,5« Segment 
 size was generated from several distributions. B was 1024 and Q was 
 varied from 32 to 1024 in powers of 2. Total memory size was 32K. 
 
 It was assumed that requests for memory were always waiting to be 
 filled. 
 
 11. .For Q = B/32, utilization varied from over 95% for s = 4B to about 
 90^ for s ■ B/2. At Q = b, utilization varied from just under 90% 
 for s = 4B to about h% at s = B/2. 
 
 12. Until stated otherwise, we now assume b = Q = B, i.e., page size is 
 constant over a given experiment. 
 
 13* This data comes from two program loads: 1) "10 small FORTRAN compil- 
 ations and loads" and 2) "FORTRAN compilations, and executions, used 
 to debug the 44x FORTRAN compiler. " Apparently, there is negligible 
 internal and external fragmentation in this experiment. 
 
 14. This data is from an integer programming/calculation. 
 
 15- Since apparently M(l) < a Q + a . 
 
 16. Again in this and the following experiment, there is apparently neg- 
 ligible fragmentation. 
 
 17. See Rosene [119 3- 
 
 16. B<lady'c results are based on the simulated execution of 11,000,000 
 instructions of an integer programming code written in FORTRAN. 
 
 19. Gaver's model considers I channels with both I < J and I > J. We 
 
 will only consider the case where I > J; i.e., there are no conflicts 
 
 - 45 - 
 
for secondary memory. 
 
 The assumption of an exponentional distribution of I/O completion 
 time is not particularly realistic as Gaver admits. Since we are using 
 T to represent the average time required to complete all kinds of I/O 
 requests, paged or explicit, the density of T will probably consist of 
 a collection of exponential, Gaussian, and delta functions. However, 
 even with a simple exponential distribution, the total expectation 
 functions become quite complex, and a more complex distribution would 
 not be warranted here. See Smith [132] for a slightly different model. 
 
 20. Pzces are here assumed fixed at 102U words. 
 
 21. Actually, this could only be true if M were some multiple of J. However, 
 if M » J, this is not a bad approximation. We also assume here that 
 programs are not swapped out of primary memory while waiting for I/O. 
 
 22. See Frank [ 61] for an analysis of the statistical properties of disk 
 systems. 
 
 23. Our development in this section will follow Denning [ 39]. See also 
 [ 26, 132, 139, 1U0]. 
 
 'elk. The particular case of Gaver 's model which we used in Section III 
 
 assumed r.o conflicts for secondary memory, i.e., rate of I/O comple- 
 tion was not dependent on the number of jobs (requests). The tech- 
 niques discussed here are not as good as those assumed in Section III. 
 
 
 - h6 - 
 
BIBLIOGRAPHY 
 
 [1] Arden, B. W. Time sharing systems: a review. Michigan Summer 
 
 Conference on Computer and Program Organization , 19&7* 
 [2] Arden, B. W. Time sharing measurement and accounting. Michigan 
 
 Sumner Conference on Advanced System Programming , 1969* 
 [3] Arden, B. W., Galler, B. A., O'Brian, T. C. and Westervelt, F. H. 
 
 Program ani addressing structure in a time sharing environment. 
 
 JAC.: 13,1 (1/66), 1-16. 
 *[>] Anackcr W. and Wang, C. P. Performance evaluation of computing 
 
 systems with memory hierarchies. IEEE EC-l6,6 (12/67 ), 765-773. 
 
 Asp i nail, D. , Edwards, D.B.G. and Kinniment, D. J. Associative 
 
 rrer.ories in large computer systems. IFTP (1968\D8l-85. 
 
 Aspinall, D. , Edwards, D.B.G. and Kinniment, D. J. An integrated 
 
 associative memory matrix. IF IP (1968), D86-90. 
 
 Badges, G. F. Jr., Johnson, E. A. and Philips, R. W. The Pitt time 
 
 sharing system for the IBM system 36O: two years experience. 
 
 AFIPS FJCC, 33 (1968). 
 
 * 
 Referenced in text. 
 
 - ^1 - 
 
[8] Bairstow, J. n. Time sharing, Electronic Design , 16,9 (1968) C1-C22, 
 *[9l Baylis, M. H. J., Fletcher, D. G. and Howarth, D. J. Paging studies 
 made on the I.C.T. Atlas computer. IFIP (1968), D113. 
 *[10] Barnes, G. H. , et al . The ILLIAC IV computer. IEEE EC-17,8 (8/68), 
 
 7^6-757. 
 *[ll] Belady, L. A. A study of replacement algorithm for a virtual storage 
 
 coirputer. IBM S. J., 5,2 (1966), 78-IOI. 
 *[12 n Belady, L. A. and Kuehner, C. J. Dynamic space sharing in computer 
 
 systems. CACM 12,5 (5/69), 282-288. 
 [13] Belady, L. A., Nelson, R. A. and Shedler, G. S. An anomaly in space- 
 time characteristics of certain programs running in a paging machine. 
 
 CACM 12,6 (6/69), 3^9-353. 
 *[1U] Bell, G. and Pirtle, M. W. Time sharing bibliography. IEEE EC-15,12 
 
 (12/56), 1764-1765. 
 *[15l Bernott, B. A. Disk I/O For Non-Core-Contained P.D.E. Meshes and 
 
 Arrays . DCS Report No. 3H, Department of Computer Science, 
 
 Referenced in text. 
 
 - U8 - 
 
University of Illinois at ^hampaign-Urbana, Urbana, Illinois, (3/69). 
 f 16] Bobrov, D. G. and Murphy, D. L. Structure of a LISP system using 
 
 two-level storage. CACM 10,3 (3/67), 155. 
 *[17] Bovet, D. P. Memory allocation in computer systems. Department of 
 
 Engineering, UCLA Report 68-17 . 
 *[13] Brawn, B. and Gastavson, F. Program behavior in a paging environment. 
 
 AFIPS FJCC 33 (1968), Part 2, 1019. 
 
 Buchholz, W. File organization and addressing. IBM S.J. 2, (6/63), 
 5-111. 
 
 Buchholz, W. A selected bibliography on computer system performance 
 
 evaluation. Computer Group News , (3/69), 21-22. 
 [dl] Burroughs Corp. A .Narrative inscription of the Burroughs B5500 Disk 
 
 File taster Control Program . Burroughs Corp., Detroit, Michigan, 1966. 
 
 Calingaert, P. System performance evaluation: survey and appraisal. 
 
 CACM 10,1 (1967), 12-18. 
 
 Campbell, D. J. and Heffner, W. J. Measurement and analysis of large 
 
 Referenced in text. 
 
 - U 9 - 
 
operating systems during system development. AFIPS FJCC 33 (1968) 
 903-91U. 
 [2k] Chu, Y. Direct execution of programs in floating code by address 
 interpretation. IEEE EC-lU,3 (6/65), U17-U22. 
 
 [25] Coffman, E. G. Stochastic Models of Multiple and Time -Shared Computer 
 Operations . Department of Engineering, University of California, 
 Los Angeles, California, Report 66-38, I966. 
 *[26] Coffman, E. G. Analysis of a drum input /output queue under scheduled 
 
 operation in a paged computer system. JACM 16, 1 (I/69), 73-90. 
 *[27] Coffman, E. G. and Varian, L. C. Further experimental data on the 
 
 behavior of programs in a paging environment. CACM 11,7 (7/68), U7I-U7I+. 
 [28] Cohen, L. J. Stochastic evaluation of static storage allocation. 
 
 CACM U,10 (10/61), 1+60-U6U. 
 [29] Collins, G. 0. Jr. Experience in automatic storage allocation. CACM 
 
 U,10 (10/61), 436-M+o. 
 
 Referenced in text. 
 
 - 50 - 
 
*[30] Cameau, L. W. A study of the effects of user program optimization 
 
 in a pacing system. ACM Symposium on OS (IO/67). 
 [31] Conti, C. J. Concepts for buffer storage. Computer Group News 2,8 
 
 (3/69), 9-13. 
 
 Conti, C. J., Gibson, D. H. and Pitkowsky, S. H. Structural aspects 
 
 of the System 360 model 85: I* General organization. IBM S. J. 
 : (I960), 2. 
 
 Conway, M. E. A multiprocessor system design. AFIPS FJCC 2k (1963), 
 
 139-l 1 +6. 
 *t3^] Corbato, F. J., and Vyssotsky, V. A. Introduction and overview of 
 
 the multics system. AFIPS FJCC 27,1 (1965), I85-I96. 
 
 Daley, R. C. and Dennis, J. B. Virtual memory, processes, and 
 
 sharing in multics. ACM Symposium on OS (IO/67). Also CACM 11,5 
 
 (5/66), 306. 
 
 Deley, R. C. and Neumann, P. G. A general purpose file system for 
 
 secondary storage. AFIPS FJCC 27 (1965), 213. 
 
 erenced in text. 
 
 - 51 - 
 
*[37] Dearnley, F. H. and Newell, G. B. Automatic segmentation of programs 
 for a two level store computer. TCJ 7, 3 (10/61+), I85-I87. 
 [38] Denes, J. E. BROOKNET - an extended core storage oriented network 
 of computers at Brookhaven National Laboratory. IFIP (1968), I9U. 
 *[39] Denning, P. J. Effects of scheduling a file memory operation. AFIPS 
 
 SJCC 30 (19oT), 9-21. 
 *[Uo] Denning, P. J. The working model set for program behavior. ACM 
 
 Symposium on OS (IO/67). Also CACM 11,5 (5/68), 323. 
 *Ol] Denning, P. J., Thrashing and its cause and prevention. AFIPS FJCC 
 33 (1968), 915-922. 
 [-2] Denning, P. J. Resource Allocation in Multiprocessors Computer Systems. 
 
 MIT, MAC-TR-50 (1968). 
 [U3] Dennis, J. B. Segmentation and the design of multiprogrammed computer 
 
 systems. JACM 12, h (10/65), 589. 
 [•■•k] Dennis, J. B. and Glaser, E. L. The structure of on-line information 
 processing systems. Proc. Second Congress on Information Systems 
 
 erenced m text. 
 
 - 52 - 
 
Science*, .19o5, 5-1 1 *. 
 [45] Derrick, M. , Summer, F. H. and Wyld, M. T. An appraisal of the 
 
 Atlas supervisor, Proc. 22 Nat. ACM (1967), 67. 
 
 Dreyfus, P. L. System design of the Gamma 60. WJCC (1958), 130. 
 
 Elmore, W. B. and Evans, G. J. Jr. Dynamic control of core memory 
 
 in a real time system. IFIP (1965), 26l. 
 [<~ Estrin, G., Coggan, B., Crocker, S. D. and Hopkins, D. Snuper 
 
 Computer - a computer in instrumentation automaton. AFIPS SJCC 30 
 
 (1967), , ; 56. 
 )] Estrin, G. and Kleinrock, L. Measures, models and measurements of 
 
 time shared computer utilities. Proc. 22 Nat. ACM (1967), 85-96* 
 
 Evans.. D. C. and Leclerc, L. Y. Address mapping and the control of 
 
 access in an interactive computer. AFIPS SJCC 30 (1967), 23-32. 
 *[5l] Feldman, J. A. and Rovner, P. D. An ALGOL-based associative language, 
 
 CACM 12,8 (6/69), U39-UU9. 
 
 Referenced in text. 
 
 - 53 - 
 
•[52] Fenichel, R. R. and Grossman, A. J. An analytic model of multi- 
 programmed computing. AFIPS SJCC 3^ (1969), 717- 
 
 [ 53 ] Fife, D. W. An optimization model for time sharing. AFIPS SJCC 28 I 
 
 (1966), 97-10U. 
 *[ 5 1*] Fife, D. W. and Smith, J. L. Transmission capacity of disk storage 
 
 systems with concurrent arm positioning. IEEE EC-lM (8/65), 575-582. 
 •[55] Fine, G. H. , Jackson, C. W. and Mdsaac, P.-V. Dynamic program 
 
 behavior under paging. Proc. 21 Nat. ACM (1966), 223-228. 
 .[ 5 o] Fine, G. H. and Mclseac, P. V. Simulation of a time-sharing system. 
 
 Man. Sci. 12 (2/66), Bl8C-19^ 
 
 5 n^ ««j r r> Time sharing on a computer with a 
 [57] Fisher, R. 0. and Shepard, C. D. lime snamifc 
 
 small memory. CACM 10,2 (2,6?), 77-61. 
 [58] Flores, I. Derivation of a waiting-time factor for a multiple-bank 
 
 memory. JACM 11,3 (7/30, 26 5- 
 [59] Flores, I. Virtual memory and paging: Part I, Datamation 13,8 
 ( j/67) 31; Part II, Datamation 13,9 (9/67), l H- 
 
 Referenced in text. 
 
 - 5U - 
 
[60] Fotheringham, J. Dynamic storage allocation in the Atlas computer 
 
 including an automatic use of backing store. CACM U,10 (l0/6l), U35-U36. 
 *[6l] Frank, H. Analysis and optimization of disk storage devices for time 
 
 sharing. JACM l6,U (IO/69), 602-620. 
 
 Freibergs, I. F. The dynamic behavior of programs. AF3PS FJCC 33 
 
 (1966), II63-II08. 
 *L<j3] Fuchel, K. and Heller, S. Consideration in the design of a multiple 
 
 computer system with extended core storage. ACM Symposium on OS 
 
 (10/67). Also CACM 11,5 (5/68), 33^. 
 *i&*] Fuchel, K. and Heller, S. A statistical study of the 6600 operating 
 
 system. Brookhaven Nat. Lab. (Draft U/l/69). 
 
 Gaver, D. P. J. Probability models for multiprogramming computer 
 
 systems. JACK lU,3 (7/67), U23. 
 
 Gibson, D. H. Considerations in block-oriented systems design. 
 
 AFTPS SJCC 30 (1967) , 75-60. 
 
 Glaser, E. L. , Couler, J. and Oliver, G. System design of a computer 
 
 for the time sharing application. AFTPS FJCC 27 (1965), 197-202. 
 Referenced in text. 
 
 - 55 - 
 
[66] Gold, M. M. Time-sharing and batch processing: an experimental 
 
 comparison of their values in a problem- solving situation. CACM 
 
 12,5 (5/69), 21+9-259. 
 [69] Graham, R. F. Semiconductor memories: evolution or revolution. 
 
 Datamation , (6/69), 99. 
 *[70] Heller, J. Sequencing aspects of multiprogramming. JACM 8 (7/0I), 
 
 '-26-439. 
 [71] Hellerman, H« On the average speed of a multiple module storage 
 
 system. IEEE EC-15,U (u/66), 67O. 
 [72] Hellerman, L. and Hoernes, G. E. Control storage use in implementing 
 
 an associative memory for a time-shared processor. IEEE C-17,12 
 
 (12/oo), 11U4-1151. 
 *[73] Hoare, C. A. R. Data structures in two-level store. IFIP (1968), 
 
 III-II7. 
 
 Hobbs, L. C. Proscnt and future state of the art in computer memories. 
 
 IEEE EC-15 (b/66), 53^-550. 
 
 * 
 
 I erenced ir. text. 
 
 - 56 - 
 
[J5] Holt, A. W. Discuasior of the problem of definition of storage 
 
 allocation. CACM U,5 (5/6l), 210-211. 
 
 Holt, A. W. Program organization and record keeping for dynamic 
 
 storage allocation. IFIP (1962), 539- Also CACM 6,10 (10/63), U22. 
 
 Hovarth, D. J., Jonas, P. D. and Wyld, M. T. The Atlas scaeduling 
 
 system. AFIPS SJCC 23 (1963), 59-6?. 
 
 Hovarth, D. J., Payne, R. B. and Summer, F. H. The Manchester 
 
 University Atlas operating system, Part II: User's description. 
 
 TCJ 4 (196Y), 226. 
 * '9] Humphrey, A. LCS [large core storage] utilization in theory and 
 
 practice. AFIPS SJCC 30 (1967), 719-728. 
 [60] Huxtable, D. H. R. and Warwick, M. T. Dynamic supervisors - their 
 
 design and construction. ACM Symposium on OS , (10/67). 
 [6l] Iliffe, J. K. and Jodeit, J. G. A dynamic storage allocation scheme. 
 
 TCJ 5 (10/62), 200-209. 
 
 Irons, E. T. A philosophy for computer sharing. ACM Symposium on 
 
 Referenced in text. 
 
 - 57 - 
 
os, (10/67). 
 
 *[S3] Jallen, G. A. Extended core storage for the Control Data 64-6600 
 
 systems. AFIPS SJCC 30 (1967), 729 • 
 [8h] Katz, J. H. An experimental model of System/360. CACM 10,11 (II/67), 
 
 59^-702. 
 [65] Kntz, J. H. Simulation of a multiprocessor computer system. AFIPS 
 
 sjcc 28 (1966), 127-139. 
 
 *[86] Kilburn, T. and Edwards, D. G. B., Lanigan, M. J., Sumner, F. H. 
 
 One level storage system. IRE EC-11,2 (4/62), 22 3« 
 [3y] Kilburn, T., Howarth, D. J., Payne, R. B. and Sumner, F. H. The 
 
 Manchester University ATLAS Operating System, Part I: internal 
 
 organization. TJC h (1961), 222. 
 [66] Kilburn, T., Payne, R. B. and Howarth, D. J. The ATLAS supervisor. 
 
 EJCC (12/61). 
 [69] Kleinrock, L. Certain analytical results for time-shared processors. 
 
 IFIP (1968), D119-125. 
 
 * 
 Referenced in text. 
 
 - 58 - 
 
♦[90] Kuck, D. J. ILLIAC IV foftvara and tppllcttion programming. IEEE 
 
 EC-17,8 (8/68), 758-770. 
 *[91] Kuck, D. J. and Sameh, A. H. Parallel computation of symmetric 
 
 matrix eigenvalues. Unpublished. 
 [92] Kuehner, C. and Randell, B. Demand paging in perspective. AFIPS 
 
 SJCC 32 (1968), 1011. 
 C93] Lamp son, B. W. A scheduling philosophy for multiprocessing systems. 
 
 ACM Symposium, on OS (IO/67). Also CACM 11,5 (5/68), 3^7. 
 [9k] Lauer, H. C. Bulk core in a 36O/67 time-sharing system. AFIPS FJCC 
 
 31 (1967), 601-609. 
 *[95] Lehman, M. M. and Rosenfeld, J. L. Performance of a simulated multi- 
 programming system. AFIPS FJCC 33 (1968), 1U3I-IUU2. 
 
 Liptay, J. S. Structural aspects of the system 360/85: II* The cache. 
 
 IBM S.J. 7,1 (1968), 15. 
 [97] Maher, R. J. Problems of storage allocation in a multiprocessor 
 
 multiprogrammed system. CACM U,10 (l0/6l), U21-U22. 
 
 Referenced in text. 
 
 - 59 - 
 
[98] Martin, D. F. and Lstrin, G. Experiments on models of computations 
 and systems. IEEE EC-l6,l (2/Gj), 59-69* 
 *[99] Martin, D. P. and Estrin, G. Models of computations and systems. 
 
 JACK lU,2 (4/67), 22-26. 
 [100] Mattson, R. L. and Jacob, Jean-Paul. Optimization studies for computer 
 
 systems with virtual memory. IFIP (I968), l89« 
 *[101] MacDougall, M. H. Simulation of an ECS-based operating system. 
 
 AFIPS SJCC 30 (1967), 735. 
 [102] McGee, W. C. On dynamic program relocation. IBM S. J. U,3 (1965), 
 
 I0U199. 
 *[103] Coffman, E. G. and McKellar, A. C. Organizing matrices and matrix 
 
 operations for paged memory systems. CACM 12,3 (3/69)> 153-165. 
 *[10U] McKinney, J. M. A survey of analytical time-sharing models. Comp . 
 
 Surveys 1,2 (6/69), IO3-II0. 
 
 Holland, F. C. and Merikallio, R. A. Simulation design of a multi- 
 processing system. AFIPS FJCC 33 (1968), 1399* 
 
 
 Referenced in text. 
 
 - 60 - 
 
[10b] Naylor, T. H. , Wertz, K. ind Wonnacott, T. H. Methods for analyzing 
 
 data from cons>uter simulation experiaants. CACM 10,11 (H/67), 703-7IO. 
 *[107] Neilson, N. R. The simulation of time-sharing systems. CACM 10,7 
 
 (1967), 397-^12. 
 *[106] O'Neill, R. W. Experience using a time sharing multiprogramming 
 
 system with dynamic address relocation hardware. AFIPS SJCC 30 (1967), 
 
 611-621. 
 "■[109] Opperheimer, G. and Weizer, N. Resource management for a medium scale 
 
 tine sharing operating system. ACM Symposium on OS (10/67). Also 
 
 CACM 11,5 (5/68), 313. 
 [110] Penny, J. P. An analysis, both theoretical and by simulation, of a 
 
 time-shared computer system. TCJ 9 (5/66), 53-59* 
 [111] Pinkerton, T. Program behavior and control in virtual storage computer 
 
 systems. University of Michigan, CONCOMP Report h (1+/68). 
 [112] Pirtle, M. Intercommunication of processors and memory. AFIPS FJCC 
 
 31 (1967), 621-633. 
 
 Referenced in text. 
 
 - 61 - 
 
*[113] Randell, B. A note on storage fragmentation and program segmentation. 
 
 CACM 12,7 (7/69), 365. 
 [llU] Randell, B. and Kuehner, C. J. Dynamic storage allocation systems. 
 
 ACM Symposium on OS (IO/67). Also CACM 11,5 (5/68), 297- 
 [115] Rehmann, S. L. and Gangwere, S. G. Jr. A simulation study of resource 
 management in a time-sharing system. AFIPS FJCC 33 (1968), IUU-IU30. 
 *[ll6] Riskin, B. N. Core allocation based on probability. CACM U,10 
 (10/61), I+5U-U6O. 
 [117] Roberts, A. E. Jr. A general formulation of storage allocation. 
 CACM 4,10 (10/61), 419-U20. 
 *[ll8] Rosen, Saul. Programming Systems and Languages. McGraw-Hill Computer 
 
 Science Series. (1967), p. 6. 
 *[119] Rosene, A. F. Memory allocation for multiprocessors. IEEE EC-l6,5 
 (IO/67), 659-665. 
 [120] Rosin, R. F. Determining a computing center environment. CACM 8,8 
 (7/o5), U63-U68. 
 
 * 
 Referenced in text. 
 
 - 62 - 
 
L21] Saclonan, H. Time sharing v*. batch processing: the experimental 
 evidence. AFIP3 SJCC 32 (1968), 1-10. 
 
 Scarrott, G. G. The efficient use of multilevel storage. IFIP (1965), 
 137-142. 
 
 Schwartz, J. I. Coffman, E. G. and Weissmen, C. A general purpose 
 tine sharing system. AFIPS SJCC 25 (1964), 397 -411. 
 Senzig, D. N. and Smith, R. V. Computer organization for array 
 processing. AFIPS FJCC 27,1 (1965), 117-128. 
 
 Sherner, J. E. and Gupte, *S. C. On the design of Bayesian storage 
 allocation algorithms for paging and segmentation. Tfflti C-l8,7 (7/69). 
 
 er, J. K, end Gupta, S. C. A simplified analysis of processor 
 "look-ahead" and simultaneous operation of a multimodule main memory. 
 IEEE C-1&,1 (I/69), 64-71. 
 
 Shemer, J. E. and Shippey, G. A. Statistical analysis of paged and 
 segmented computer systems. IEEE EC-15,6 (12/66), 855-863. 
 Sisson, S. S. and Flynn, M. Addressing patterns and memory handling 
 
 fferenced in text. 
 
 - 63 - 
 
algorithms. AFIPS FJCC 33,2 (1968), 957-967. 
 [129] Sherr, A. L. Time-sharing measurement. Datamation 12, h (U/66), 22-26. 
 *[130] Sherr, A. L. An Analysis of Time-Shared Computer Systems . MIT Press, 
 
 Cambridge, Mass. (1967). 
 *[13l] Smith, J. L. An analysis of time-sharing computer systems using 
 
 Markov models. AFIPS SJCC 28 (1966), 87-95. 
 *[132] Smith, J. L. Multiprogramming under a page on demand strategy. 
 
 CACM 10,10 (IO/67), 636-6U6. 
 *[133l Stevenson, D. A. and Vermillion, W. H. Core storage as a slave 
 
 memory for disk storage devices. IFIP (I968), F86-F91. 
 *[13^] Trimble, G. R. Jr. A time sharing bibliography. CR Bibliography 
 
 ;■•' li, Computing Reviews 9,5 (5/68), 291-301. 
 *[135] Varian, L. C and Coffman, E. G. An empirical study of the behavior 
 
 of program in a paging environment. ACM Symposium on OS (10/67). 
 
 Also CACM 11,7 (7/68), U7I-47U. 
 *~136] Wald, B. The Throughput and Cost Effectiveness of Monoprogrammed, 
 
 Referenced in text. 
 
 6k - 
 
Multiprogrammedj and Multl - nroceaaing Digital Computers . NRL Report 
 
 65U9, AD# 65^38U. 
 [137] Wallace, V. L. and Mason, D. L. Degree of multiprogramming in page 
 
 on demand systems. CACM 12,6 (6/69), 305. 
 
 Wagner, P. Machine organization for multiprogramming. Proc. 22 Nat. 
 
 ACM (I967), 135-150. 
 [139] Wingarten, A. The analytical design of real-time disk systems. 
 
 IFIP (1968), D131-137. 
 
 Wingarten, A. The Eschenbach drum scheme. CACM 9,7 (7/66), 509. 
 
 '.veizer, N. and Oppenheimer, G. Virtual memory management in a paging 
 
 environment. AFIPS SJCC 3 1 * (1969), 2U9. 
 
 Wilkes. Slave memories and dynamic storage allocation. IEEE EC-l4,2 
 
 (U/65), 270-271. 
 
 V/ilkes, M. V. A model for core allocation in a time-sharing system. 
 
 :ps sjcc 3^ (1969), 265. 
 
 Referenced in text. 
 
 - 65 - 
 

NOV z 
 
 8 ^72 
 
^ 
 
 ,&