— 
 
 M 
 
 SsSsss nn 
 
 Ha 
 
 cf, ■■ m 
 
 
 ■■ 
 
 ■■■J 
 
 HBhhs B«H 
 am an Mas 
 
 HHB BnHn 
 
 ■ebohS 9HHHH 
 llMilifcMBMWMMBw! 
 
 ■ HP 
 
 ** 
 
 
 ■ 
 
 BBiWBMM gB6«taagBBKg 
 
 —Mi 
 HUT 
 
 mmm 
 
 ■HBBXaKnMH] 
 
 ifllllM 111 I I I 
 
LIBRARY OF THE 
 
 UNIVERSITY OF ILLINOIS 
 
 AT URBANA-CHAMPAIGN 
 
 JjUr 
 
 no. 308-315 
 cop2 
 
The person charging this material is re- 
 sponsible for its return to the library from 
 which it was withdrawn on or before the 
 Latest Date stamped below. 
 
 Theft, mutilation, and underlining of books 
 are reasons for disciplinary action and may 
 result in dismissal from the University. 
 
 UNIVERSITY OF ILLINOIS LIBRARY AT URBANA-CHAMPAIGN 
 
 APR 2 
 
 6 1974 
 
 APK 
 
 9 nigfli 
 
 L161 — O-1096 
 
Digitized by the Internet Archive 
 in 2013 
 
 http://archive.org/details/diskiofornoncore311bern 
 

 Report No. 311 
 
 March 5, I969 
 
 //U^^/ 
 
 DISK I/O FOR NON-CORE-CONTAINED 
 
 P.D.E. MESHES AND ARRAYS flpR -3 
 
 by 
 
 Bruce Allen Bernott 
 
 DEPARTMENT OF COMPUTER SCIENCE 
 UNIVERSITY OF ILLINOIS AT URBANA-CHAMPAIGN 
 
 URBANA, ILLINOIS 
 
 w lEswnr if rfe 
 
 *' 1389 
 
 rynriii, 
 
Report No. 311 
 
 DISK I/O FOR NOW- CORE- CONTAINED 
 P.D.E. MESHES AND ARRAYS* 
 
 by 
 
 Bruce Allen Bernott 
 
 March 5, 1969 
 
 Department of Computer Science 
 University of Illinois at Urb ana- Champaign 
 Urbana, Illinois 6l801 
 
 This work was supported in part by the Advanced Research Projects 
 Agency as administered by the Rome Air Development Center under 
 Contract No. US AF 30(602)i+l4U and submitted in partial fulfillment 
 of the requirements for the degree of Master of Science in Computer 
 Science, February, 1969* 
 
Ill 
 
 ACKNOWLEDGMENT 
 
 The author wishes to express his sincere appreciation to 
 Professor Kuck for his efforts and guidance in the preparation of this 
 report. Professor Kuck originally conceived the idea of CAT, a very 
 high level language oriented toward solving partial differential equations 
 numerically. (The language is also known by the name TRANQUIL II, as the 
 origin of the acronym CAT seems to be shrouded in obscurity.) This report 
 describes a part of what will be implemented in the compiler for that 
 language; and with the advice of Professor Kuck, the author was able to 
 keep the investigation relevant to the problem at hand. 
 
 Gratitude is also expressed to Mrs. Sharon Hardman, who has 
 typed the final manuscript. 
 
IV 
 
 ABSTRACT 
 
 In finding discrete solutions of systems of partial differential 
 equations on a computer, one is faced with the problem that the desired 
 number of mesh points may exceed the machine's fast memory. This problem 
 will be common on machines such as ILLIAC IV, for which extremely high 
 computing power invites the use of meshes many millions of words in size. 
 Because of the high dollar price of fast memory, it is sensible to look 
 at large disk stores with high transmission speeds as back-up storage for 
 meshes and arrays. The main problem encountered is the access time of 
 such a storage unit. 
 
 The address of a block of data stored on disk might be taken as 
 the address of the first word in the block. This address must specify 
 both track number and radial position of the block. If the computer issues 
 a command to transmit the block immediately after the beginning of the 
 block has passed the reading head, then the system must wait for nearly 
 a complete disk rotation before the transmission is started. This access 
 time is, in general, not predictable; but it is bounded by the disk rota- 
 tion time. 
 
 The access time during which the computer does useless work is 
 latency; and the object of the present investigation to minimize the 
 latency for a reasonably large class of problems. 
 
TABLE OF CONTENTS 
 
 Page 
 
 1. INTRODUCTION 1 
 
 1.1 The Problem 1 
 
 1.2 The Machine 4 
 
 2. MAPPING THE MESH ONTO DISK 6 
 
 3. SWEEPING A MESH 10 
 
 3.1 Normal Mode 10 
 
 3.2 Transposed Mode 19 
 
 4. IMPLEMENTING THE SCHEME 25 
 
 4.1 General 25 
 
 4.2 Measuring Latency 32 
 
 4.3 Storage Requirements 34 
 
 4.3.1 Fast Memory 34 
 
 4.3.2 Disk 37 
 
 5. NUMERICAL SOLUTIONS 40 
 
 6. CONCLUSION 49 
 
 APPENDIX 50 
 
 LIST OF REFERENCES 54 
 
VI 
 
 LIST OF TABLES 
 
 Table Page 
 
 1. Positions of activity for row transition in row-normal 
 
 reading 15 
 
 2. Acceptable values of I for normal reading. "A" indicates 
 acceptable; blank indicates unacceptable. 1=1 or 1 . 
 Parentheses indicate value which is unacceptable for 
 transposed reading 17 
 
 3. Positions of activity for column transition k-2 to k-1 in 
 column-transposed reading 23 
 
 h. Survey of results obtained on 11 mesh sizes. Latency 
 
 < .12, normal reading in either direction 1+2 
 
VI 1 
 
 LIST OF FIGUEES 
 
 Figure Page 
 
 1. Blocking a mesh 3 
 
 2. Schematic of disk 5 
 
 3. Grouping edge values for row- sweeping 7 
 
 h. Mesh blocks and edge groups mapped onto a blocked disk . . 55 
 
 5. Row-normal reading: row 2 and beginning of row 3. Entries 56 
 enclosed in rectangles are written on disk; other entries 
 
 are read from disk. . . 
 
 6. Column-transposed reading 57 
 
 7. Merging the end of column k-2 with the beginning of 
 
 column k-1 59 
 
 8. Mapping edges into edge group blocks on disk 28 
 
 9. Illustrating relationships of T, T , and T 30 
 
 10. T vs. N for the solutions found for four mesh sizes. 
 
 R v 
 
 The number of solutions is printed for each value of N . . h^ 
 
 11. Minimum values of T„ found k"J 
 
1. INTRODUCTION 
 
 1.1 The Problem 
 
 A reasonably large class of problems is the class of two- 
 dimensional partial differential equation (PDE) problems. Characteristic 
 of finite difference methods for solving PDE's are the stars or stencils 
 of the methods, an example of which is shown below. In order to compute 
 
 d points deep 
 
 t 
 
 new Values of the variables (i.e., to update the variables) at the center 
 point of the stencil, one needs to know the values of variables at several 
 neighboring points in the horizontal and vertical directions. We may speak 
 of the depth d of the stencil as the distance from the point to be updated 
 to the furthest neighbor, measured in points. For the nine-point stencil 
 illustrated above, the depth d is two points. 
 
 When one wishes to update an entire mesh of points, the stencil 
 must be applied to each of the points simultaneously, so that neighbors 
 used to update each point are old values and not updated values. If the 
 entire mesh cannot be contained in the fast memory of the computer, one 
 must store the mesh in a back-up storage device such as magnetic disk, and 
 read in only a part of the mesh at a time for updating. 
 
 In this investigation, it is assumed that the mesh is rectangular 
 in shape. It can then be sliced into rectangular blocks; and the blocks 
 
may be transmitted back and forth between the disk and the computer memory 
 for updating calculations. The object of this investigation is to formulate 
 a scheme for efficiently swapping these blocks of mesh between disk and 
 fast memory. One problem is that to update one block, the computer must 
 have access to an edge of each of the neighboring blocks. The depth of 
 this interblock communication must be d points, the depth of the stencil 
 which is applied to the mesh. Figures la and lb illustrate the slicing 
 of the mesh and the communication necessary. 
 
 The program or subroutine which performs the updating calculations 
 on a block and its neighbor edges can be the same for each block of mesh, 
 with some conditional branches to handle blocks on mesh boundaries. This 
 subroutine can be called the kernel , to distinguish it from the supervisory 
 program which handles input/output and other chores. The main constituent 
 of the kernel will be the stencil calculations of the method used. 
 
 There are two ways in which a kernel is likely to sweep or pass 
 over the mesh: sequentially by rows or by columns. In sweeping by rows, 
 blocks will be input (and output) in the order 11, 12, ..., In, 21, 22, 
 
 ..., 2n, 31, , nin. In this case, it is clear that when updating 
 
 block (i,j), nothing special need be done to input edges from blocks (i,j-l) 
 and (i,j+l). We simply save in fast memory the rightmost edge, d points 
 deep, of block (i,j-l), which will have just been updated and output to 
 disk; and we delay calculations on (i,j) long enough to allow input of 
 block (i,j+l). The more difficult problem is arranging to have access 
 to the lower edges of blocks in row i+1 and the upper edges of blocks in 
 row i-1. The scheme presented in this report uses multiple storage of 
 
m,i 
 
 • • • • • • 
 
 m,n 
 
 • 
 • 
 • 
 
 • 
 • 
 • 
 
 • 
 • 
 • 
 
 
 • 
 • 
 • 
 
 • 
 • 
 • 
 
 2.1 
 
 2,t 
 
 • • • 
 
 »»l 
 
 *,*• 
 
 • • • • • • 
 
 »,n 
 
 Figure la. A general rectangular mesh sliced into blocks, 
 
 i+t,j 
 
 U-i 
 
 i.J 
 
 I 
 I 
 
 I 
 
 <c 
 
 i-»,j 
 
 IDCt VM-O*.* 
 
 Figure lb. Edge values needed for kernel calculations 
 on block (i, j) . 
 
upper and lower edge points on disk. Upper and lower edge points are 
 grouped by rows into an "edge area" on disk, and also appear in the block 
 storage or "mesh area". A similar arrangement is made for sweeping the 
 mesh by columns; the essential difference between the two possible storage 
 schemes is the grouping of edge values in the edge area. The transposition 
 from row storage to column storage of edges and vice versa may be done dur- 
 ing a sweep, so that successive kernels in a program may switch freely from 
 row-sweeping to column-sweeping, if this is ever required. 
 
 1.2 The Machine 
 
 We will consider implementing an I/O scheme on one quadrant of 
 ILLIAC IV with 20^8 quadrant words, or 20H8 x 6k words, of fast memory [l]. 
 Coupled to the machine is DISK IV, which has a storage capacity of approx- 
 imately 15 million words. DISK IV is organized into hQ tracks of 1200 
 segments each. Disregarding parity bits, each segment is equivalent to 
 one long word across a four-quadrant ILLIAC IV, or four quadrant words. 
 The smallest I/O transaction allowed is one segment. In the scheme to be 
 proposed, a logical track will be divided into a number of blocks of b 
 segments each. Each mesh block will be stored in some block of disk 
 segments. 
 
 For a slight increase in generality, we will allow stringing 
 several disk tracks together to form a logical track of 1200 t segments, t 
 an integer. We will have to take account of the fact that the read/write 
 head cannot switch instantaneously from one track to another. DISK IV 
 actually has a R/w head for every track, and switching is done electronically. 
 
However, it still takes a measurable amount of time to change tracks. In 
 fact, the switch can be performed within the space of disk revolution 
 through two segments; thus two empty segments must appear at the end (or 
 beginning) of each block of b segments and each 1200-segment disk track. 
 The disk may be pictured as in Figure 2. Points A and A are 
 the same radial position, so 
 
 A 
 
 1200t segments 
 
 r 
 
 k8 
 
 t 
 logical 
 tracks 
 
 Figure 2. Schematic of disk. 
 
 that segments may be numbered modulo 1200t. The read/write head is imagined 
 to move constantly from left to right, and can change tracks within the 
 space of two segments. 
 
6 
 
 2. MAPPING THE MESH ONTO DISK 
 
 The blocks (i,j) of Figure la are skewed on disk to allow row or 
 column sweeping. For row sweeping upper and lower edge values from each 
 row of blocks are duplicated into a number of logical tracks set aside for 
 edge storage. These edge values are grouped as shown in Figure 3. Upper 
 or lower edges from k blocks of mesh form one group of edges. The groups 
 are numbered with Roman numerals to indicate row, superscripts U and L for 
 upper and lower, and subscripts to distinguish groups in a particular row. 
 For row R, group R contains edges from blocks (R,l) to (R,k-l), R p contains 
 edges from blocks (R,k) to (R,2k-l), etc. Grouping for both R. and R. is 
 summarized in the following table: 
 
 
 Edges from 
 
 
 Edges from 
 
 Group 
 
 Blocks in Columns: 
 
 Group 
 
 Blocks in Columns 
 
 Ix L 
 
 1 to k-1 
 
 l! U 
 
 1 to k-2 
 
 I 2 L 
 
 k to 2k-l 
 
 I 2 " 
 
 k-1 to 2k-2 
 
 1/ 
 
 n 
 
 U -l)k to n 
 
 K 
 
 (i -l)k-l to n 
 
 The parameter I is the smallest integer such that 
 
 n<ik-2 
 — n 
 

 
 
 
 
 
 * 
 
 u 
 
 
 
 
 £ 
 
 
 
 
 
 
 
 1 
 
 ifl 
 
 k 
 
 2 
 
 -• 
 
 
 
 a 
 
 
 1* 
 
 J? 
 
 
 
 - 1 
 
 
 
 
 
 
 
 
 
 
 -s 
 
 
 
 
 - 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 s* 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 tr 
 
 
 
 5* 
 
 -* 
 
 
 
 
 
 
 i 
 
 j5 
 
 
 J - 
 > H 
 
 2 
 •H 
 
 W 
 I 
 
 g 
 
 U 
 
 u 
 o 
 <H 
 
 w 
 
 H 
 
 > 
 
 0) 
 
 bO 
 C 
 •H 
 
 a 
 o 
 
 m 
 
 •H 
 
 w 
 
 leje 
 
 • • • 
 
 ••• B& 
 
 
 J 
 M 
 
8 
 
 The overhanging outlines of edge groups at the mesh boundaries merely 
 
 indicate that there are empty spaces in those edge groups. For sweeping 
 
 the mesh "by columns, left and right edges of blocks are grouped analogously. 
 
 A smallest I is chosen such that 
 m 
 
 m < I k- 2 
 — m 
 
 Note that k will not be subscripted. In problems requiring transposition, 
 k will have the same value in both directions. All of the parameters used 
 in constructing a scheme will be constant throughout problem execution. 
 
 For the purpose of describing the map of mesh onto disk, we will 
 further simplify the schematic of disk by assuming the logical track to 
 have a length of kT-1 (T defined below) blocks of b segments each. We 
 will store each group of edges and each block of mesh in some block on 
 disk as shown in Figure h (a fold-out chart). Figure k is a snapshot of 
 a radial section of the disk. It is intended to appear as the left end 
 of Figure 2. The map is drawn for "period" T=5 blocks. T measured in 
 units of time will be the kernel calculation time for one block of mesh. 
 T measured in blocks is the ratio of kernel calculation time for a block 
 to the input transmission time for a block. In the analysis below, T will 
 be measured in blocks. 
 
 Figure k shows edges grouped as in Figure 3j i.e., for row- 
 sweeping. For column sweeping, edges would be regrouped and re-allocated, 
 mesh blocks would retain the storage allocation shown. The reason mesh 
 blocks need not be re-allocated is that skewing makes allocation look the 
 same for both rows and columns if one considers radial positions only 
 
(horizontally across page) and disregards track number (vertically down 
 page). Block (i,j) is in the same radial position as block (j,i). Let 
 p(x) be the radial position (disk block number measured from an arbitrary 
 radial point A) of record X, where X is either an edge group or a mesh 
 block. The positions of all records are defined in terms of p(l ). 
 
 p(M 1 U / L ) = Pil^) + T 
 
 p^ 1 ) = p^) * T 
 pd.^) = p^) ♦ 1 
 
 p((i,i)) = pdn^) + 2 
 
 p((i,J+l)) - p((i+l,j)) = p((i,j)) + T 
 
 Besides these, we add the requirement 
 
 p((i,d+k)) = p((i,j)) + 1 (=p((i+k,j)) . 
 
 which relates the grouping constant k to the length of the logical track. 
 It defines the logical track to be of length (kT-l) blocks. 
 
 Implementation of the simplified structure of disk used in 
 Figure k on an actual disk with 1200-segment tracks will be taken up 
 in a later section. For now we will assume zero head-switching time, and 
 proceed to describe the sequence of I/O transactions for sweeping the 
 mesh. 
 
10 
 
 3. SWEEPING A MESH 
 
 3.1 Normal Mode 
 
 If we are to sweep the mesh by rows, we wish to read into fast 
 memory the blocks (l,l), (1,2), (l,3), • ••, (l,n), (2,l), (2,2), ... in 
 sequence. Ignoring edge value transmissions for the time being, we read 
 block (l,l) from logical track i in Figure k as the read/ write head moves 
 from left to right. Let calculations on block (l,l) begin as the R/W head 
 passes the point p((l,2))-l (point p(x) is the beginning of position p(X)). 
 While calculations proceed on block (l,l), block (l,2) is read. Since we 
 intend to read each block of row 1 of mesh as it passes by, the kernel 
 must finish the calculations on each block within the period T. We will 
 write the updated block (l,l) in position p((l,3))-l (on another logical 
 track if necessary). At the point p((l,3))-l we begin calculations on 
 block (l,2). Immediately after writing the new block (l,l) we read block 
 (l,3). In general, we read block (i,j), write updated block (i,j-l), 
 read (i,j+l), write (i,j), etc. in sequence. There will, of course, be 
 a "hiccup" at the end of each mesh row, except in one case to be discussed 
 later. If necessary, we spend a whole or part of a disk rotation of latency 
 to insure that the finishing of one row does not interfere with the begin- 
 ning of the next row. 
 
 After sweeping the whole mesh in this way, the updated mesh will 
 be stored on disk in the same configuration, but shifted 2T-1 blocks to 
 the right. Column-sweeping (i.e., (l,l), (2,l), (3,1), •••, (m,l), (l,2), 
 (2,2), ...) results in the same shift. It is evident that we may sweep 
 by rows or columns any number of times in any order. 
 
11 
 
 We now superimpose edge group transmissions upon the I/O sequence 
 just described. The term "reading normal" will be applied to sweeping the 
 mesh by rows with edge values grouped for row-sweeping (as in Figures 3 
 and k) and to sweeping by columns with edge values grouped for column- 
 sweeping. The other two possibilities are column-sweeping with row-grouped 
 edges and row-sweeping with column-grouped edges; the term "reading trans- 
 posed" is applied to these types of sweeps. 
 
 In reading normal by rows, edge values to the left and right of 
 a block (as viewed in Figure 3) are input automatically because of the 
 sequence of block reading, as mentioned before. Similarly for reading 
 normal by columns, edge values below and above a block are input automat- 
 ically. Edge values below and above for row-normal reading and left and 
 right for column-normal reading will be input from the edge area on disk. 
 The procedure for row-normal reading will be shown. Column-normal reading 
 is analogous: edge group superscript U (upper) is replaced by R for "right" 
 and L (lower) is replaced by L for "left". 
 
 Row-normal reading is shown in Figure 5. This chart is not a 
 storage map as in Figure k. Figure 5 is read a line at a time from left 
 to right. Each numbered line represents one logical disk revolution. 
 Each entry on a line represents the transmission of a record to or from 
 the disk. Entries in rectangles are written on disk, other entries are 
 read from disk. Horizontal position corresponds to radial position on 
 a logical track exactly as in Figure h. Each read entry is, of course, 
 in the position indicated by Figure k. It may be helpful to align Figure 
 5 on Figure h. 
 
12 
 
 Only the I/O sequence for one row of mesh blocks need he described 
 since the relationship between positions of edge groups required for that 
 row and positions of blocks in that row are independent of row number. 
 We will follow through the sequence for row 2 since row 1 is a slightly 
 degenerate case. 
 
 On revolution 1 we must initialize the calculation by reading 
 the first upper edge group of row 1 I_ and the first lower edge group 
 of row 3 III-, . With this data in fast memory we may read and update the 
 first k-2 blocks of row 2, since I, contains edges from the 1st k-2 blocks. 
 We, therefore, start reading the blocks sequentially from (2,1) on revolu- 
 tion 1 and write each block, updated, 2T-1 disk blocks after it is read. 
 This eventually brings us back to radial point A where revolution 2 begins. 
 We read edge group I~ , which contains edges below blocks (2,k-l) to 
 (2,2k-2), as it goes by. Notice that we have read the edge below (2,k-l) 
 well before we actually need it. Later on in revolution 2 we read III p 
 which contains edges above blocks (2,k) to (2,2k-l). The edge above block 
 (2,k) is read in at p((2,k)) + T-2; but this is acceptable since calcula- 
 tions on (2,k) are not started until the point p((2,k)) + T-l. 
 
 Besides writing updated blocks as we move along the disk, we 
 must write updated edge groups. These groups must be written in positions 
 such that the disk configuration of the updated mesh and edge groups is 
 the same as Figure k. Hence, we must write the updated edge group II, 
 
 2T+2 disk blocks before the position of the updated block (2,l), and 
 
 L ' / \ 
 
 II T+2 disk blocks before updated block (2,1). Fortunately, so to 
 
 speak, there are no other activities which must occupy the R/W head in 
 
13 
 
 * U' L' 
 
 these positions on revolution 2 ; and the fact that II, and II, must 
 
 contain updated edges from blocks (2,k-2) and (2,k-l) respectively is no 
 problem, since those blocks will have been updated by the time the corre- 
 sponding edge groups need be written. There must, of course, be buffers 
 in fast memory for accumulating updated edge groups prior to transmissions 
 to disk. 
 
 On revolution 3 the same procedure is followed with all trans- 
 missions shifted to the right by one block, all subscripts incremented by 
 1 and all j-values of blocks (i,j) incremented by k. With each revolution 
 transmissions shift right one disk block by the fact that we have arranged 
 the logical track to be kT-1 blocks long and mesh block transmissions have 
 period T. 
 
 Earlier we defined I such that 
 
 U n -l)k-2 < n < i n k-2 . 
 
 Let us say that n = £ k-2. Then the last transmissions for row 2 of mesh 
 
 n 
 
 appear as shown on revolution I +1 in Figure 5. If n < i k-2, then some 
 
 n n ' 
 
 mesh block transmissions disappear, but updated edge groups II, and 
 
 L- n 
 
 II, must still be written in the positions shown. Processing of row 3 
 
 n 
 on some revolution r~ may begin after row 2 is finished. We may have 
 
 r~ = I + 2; however, in this case we spend more than a revolution of 
 
 We do not attempt to take advantage of the machine's capability to perform 
 two transmissions simultaneously. This capability arises from having two 
 electronics units, each with half of the total number of disk storage 
 units. As a result, any simultaneous transmissions must occur in opposite 
 halves. We do not attempt to meet this restriction. 
 
Ik 
 
 latency (kT-1 + p((3,l)) " (p((2,i k-2)) + T) disk blocks). We would 
 
 like to have r = £ +• 1. This cannot be done with the sequence shown 
 
 in Figure 5 "because of conflicts in p(ll, ) and p((3,l))« The positions 
 
 of activity on revolution & +■ 1 are for £ = T-l with T = 5. If £ were 
 
 n n n 
 
 5, line £ +■ 1 would be shifted one block right, and we could have r =£ +1. 
 
 It is clear that the value of £ is critical in determining latency. 
 
 Acceptable values of 1 , i.e., those for which r_ may equal 
 
 £ + 1, can be found in general by comparing activity in revolution r 
 
 to activity in revolution £ + 1. Take p(ll, ) = 0. Then revolution r_ 
 
 n r 1 3 
 
 has activity in positions 0, 3T, 3T+2, ^T+2, 5T+1, 5T+2, .... We also 
 find positions of activity for revolution £ + 1 by first determining 
 
 p((2, i n k-2)): 
 
 p((2,l)) = 2T + 2 
 p((2,k)) = p((2,l)) + 1 " T 
 p((2,k-2)) = p((2,k)) - 2T 
 p((2,je Q k-2)) = p((2,k-2)) + (£ n - 1). 
 
 Therefore, p((2, £ k-2)) = 2 + £ - T. Other activity in revolution 
 ' ^ ' n n 
 
 # + 1 can be determined from this fix. Positions of activity are 
 
 summarized in Table 1. 
 
(2, -2 n k-2) 
 
 1 + i + T 
 n 
 
 II 
 
 U» 
 
 -2 + J + 2T 
 n 
 
 II 
 
 L» 
 
 •2 + i + 3T 
 
 n 
 
 15 
 
 ENDING ROW 2 
 
 STARTING ROW 3 
 
 ACTIVITY 
 
 POSITION 
 
 ACTIVITY POSITION 
 
 (2, ^ n k-U) 
 
 2 + i n - 3T 
 
 II 
 
 U 
 
 
 
 (2, ^ n k-5) 
 
 1 + ^ - 2T 
 
 n 
 
 IV. 
 
 3T 
 
 (2, ^k -3) 
 
 2 + i - 2T 
 
 (3,1) 
 
 3T + 2 
 
 (2, i n k-^) 
 
 1 + i - T 
 
 n 
 
 (3,2) 
 
 hi + 2 
 
 (2, i k -2) 
 \ , n 
 
 2 + i - T 
 
 (3,1) 
 
 5T + 1 
 
 (2, i n k -3) 
 
 1 + i 
 
 n 
 
 (3,3) 
 
 5T + 2 
 
 Table 1. Positions of activity for row transition 
 in row-normal reading. 
 
16 
 
 We will not take into account any activity before p((2, £ k-k)) 
 in ending row 2 or after p((3,3)) in beginning row 5. This will lead to 
 an analysis which is correct for 2 + £ - 3T < and -2 + £ + 3T < 5T + 2; 
 or I < 3T - 2 and £ < 2T + k. It has not yet been mentioned that the 
 scheme proposed will work only for T > 5. We look for acceptable values 
 of £ < 10, which satisfies both inequalities above for T > 5. A value 
 is acceptable if the two sets of positions in Table 1 are disjoint. The 
 sets are disjoint for £ ~ T, for example. £ = T-3 is unacceptable for 
 T = 5,1, acceptable otherwise (T > 5). I - T-2 is always unacceptable. 
 
 In general, a. value can be tested for acceptability by substituting 
 that value into the left column and comparing each value obtained with all 
 of the elements in the right column, for all values of T. Whether one sub- 
 stitutes numbers or expressions in T for I , it is a painful task. Table 
 2 shows acceptable numerical values of £ versus T. For any numerical 
 
 value of £ . a minimum K„ can be found such that for T > K„ the £ is 
 
 £ n - £ n 
 
 either always acceptable or always unacceptable. L. = 13: £ = 10 is 
 acceptable for T > 13. K. < IC , i - 1, . .., 9. The table is therefore 
 not shown for T > 13. Table 2, along with a corresponding table for trans- 
 posed reading, will be useful in finding a scheme with minimum latency 
 for a given problem. 
 
17 
 
 
 5 
 
 6 
 
 7 
 
 8 
 
 9 
 
 10 
 
 11 
 
 12 
 
 13 
 
 1 
 
 A 
 
 A 
 
 A 
 
 A 
 
 A 
 
 A 
 
 A 
 
 A 
 
 A 
 
 2 
 
 
 
 
 
 
 
 
 
 
 3 
 
 
 A 
 
 A 
 
 A 
 
 A 
 
 A 
 
 A 
 
 A 
 
 A 
 
 1+ 
 
 
 
 
 
 
 
 
 
 
 5 
 
 A 
 
 
 
 A 
 
 A 
 
 A 
 
 A 
 
 A 
 
 A 
 
 6 
 
 A 
 
 A 
 
 
 
 A 
 
 A 
 
 A 
 
 A 
 
 A 
 
 7 
 
 
 A 
 
 A 
 
 
 
 A 
 
 A 
 
 A 
 
 A 
 
 8 
 
 
 
 A 
 
 A 
 
 
 
 A 
 
 A 
 
 A 
 
 9 
 
 
 A 
 
 
 A 
 
 A 
 
 
 
 A 
 
 A 
 
 10 
 
 (A) 
 
 
 A 
 
 
 A 
 
 A 
 
 
 
 A 
 
 
 Table 2. Acceptable values of i for normal reading. 
 "A" indicates acceptable; blank indicates 
 unacceptable. £ = £ or £ . Parentheses 
 indicate value which is unacceptable for 
 transposed reading. 
 
18 
 
 Only row-normal reading has been considered in this section; but 
 it is clear that all of the results apply also to column-normal reading 
 with column parameters replacing corresponding row parameters --namely, I , 
 m, and edge group superscripts R and L. As mentioned before, k is the 
 same in both directions. 
 
 We now examine the problem of changing over from sweeping by rows 
 
 to sweeping by columns. Note in Figure 5 that when updated edge groups 
 
 U' L» 
 II. and II. were written on disk, the upper and lower edges of groups 
 
 of k blocks in a row were written. This prepares the disk storage structure 
 
 for a subsequent sweep by rows. If the next sweep is to be by columns, 
 
 however, we store on disk the updated right and left edges of the blocks 
 
 in the row rather than the upper and lower edges respectively. We will 
 
 R'T L'T R'T U' 
 
 use the notation II. ' and II. .II. is written in place of II. , 
 
 1 11 ^ l ' 
 
 L'T L' 
 
 and II. is written in place of II. . The problem arises that these 
 
 R'T 
 right and left edges are still grouped by rows; i.e., II. contains 
 
 L 'T 
 updated right edges from blocks (2, (i-l)k-l) to (2, ik-2), and II. 
 
 ■X- 
 
 contains updated left edges from blocks (2, (i-l)k ) to (2, ik-l). Edges 
 in all of the R. ' and R. are grouped in the same way. Since these 
 edges are not grouped for column-normal reading, a different I/O sequence 
 will be required. This new sequence will be called column-transposed 
 reading. Instead of reading k edges from one place in a logical disk 
 revolution, we will read the edges one at a time from many points in a 
 
 * 
 
 It may be helpful to imagine the individual edges (k per edge group) in 
 Figure 3 to be rotated clockwise by ninety degrees. 
 
19 
 
 revolution, so that the flow of edge values into fast memory becomes a 
 quasi-continuous process as opposed to the single transmissions in Figure 5. 
 
 3.2 Transposed Mode 
 
 Figure 6,& and b, illustrates the procedure for column-transposed 
 reading. The mesh is assumed to be mapped on disk as in Figure h; except 
 that upper edge groups R. have been replaced with right transposed edge 
 
 groups R. and lower edge groups R. have been replaced with left trans- 
 
 — LT 
 posed edge groups R . 
 
 A sweep on column 2 will be used as a first example. The first 
 
 block to be input is (l,2), for which we need the right edge of (l,l) and 
 
 the left edge of (1,3). The right edge of (l,l) is contained in edge group 
 
 pm pm 
 
 I ; specifically, it is the third edge in I , counting the two empty 
 
 edge spaces overlapping the mesh in Figure 3. Likewise, the left edge of 
 
 T T 
 (l,3) is the fourth edge in I . The edges needed for block (2,2) are 
 
 pm t m 
 
 the third and fourth edges in edge groups II and II, respectively. 
 In order to move up column 2, we must have one edge each from all edge 
 groups with subscript one. Figure 6a shows "simultaneous" transmissions 
 
 ■pm 
 
 of single edges from pairs of edge groups. An edge from II and an edge 
 
 LT 
 from I 1 , for example, are to be read within the space of one disk block. 
 
 The reading head will be required to switch tracks within a disk block, 
 
 and the edges within the groups involved must be arranged such that no 
 
 pair of edges required in any transmission have the same radial position. 
 
 This matter will be taken up in a later section; for now we assume the 
 
 requirement to be satisfied. 
 
20 
 
 As each successive column is swept, the beginning of the mesh 
 block transmissions moves T disk blocks to the right because of skewing. 
 The edge value transmissions, however, start at the same radial position 
 up to and including the sweep on column k-2. We are reading edges farther 
 and farther ahead of corresponding mesh blocks. When mesh blocks have 
 finally shifted around an entire logical track, we will be unnecessarily- 
 reading edges a disk revolution ahead of the mesh blocks for which the 
 edges are needed. At this point we may shift to reading edges immediately 
 before corresponding blocks. Such a shift is executed in two stages in 
 Figure 6b. 
 
 The first stage of the shift is executed at column k-1. Only 
 the transmissions of left edges from mesh blocks in column k are shifted. 
 Transmissions of right edges of column k-2 are still started a revolution 
 ahead of mesh block transmissions. Column k-1 is a special case, as its 
 eastern neighbor edges are now in edge groups with subscript 2 while its 
 western neighbor edges are still in edge groups with subscript 1. It would 
 be possible to shift transmissions of both left and right edges at column 
 k-1; however, shifting the left edges only results in a greater freedom 
 with the parameter I , when one considers the efficiency of the transition 
 from column k-2 to column k-1. 
 
 The second stage of the shift is executed at column k. Now all 
 edges needed are in edge groups with subscript 2. The sweep up column k+1 
 is similar to the sweep up column 1, except that column 1 has no western 
 neighbor edges. Likewise, the sweep up column k+2 is similar to the sweep 
 up column 2; and in general the sweep up column j is similar to the sweep 
 
21 
 
 up column j modulo k. Columns nk-1, n=l,2,...,i! -1, all require edges 
 from edge groups with different subscripts. 
 
 When the updated edge values are written, they will, of course, 
 
 — L' 
 he grouped in the column direction. If left and right edge groups R. 
 
 — R' — — 
 
 and R. , R = I, II, ..., n , are written, then the updated mesh will be 
 
 organized for column-normal reading. If the subsequent sweep were to be 
 
 — L'T — U'T 
 
 by rows, lower-transposed R. and upper transposed R. would be written 
 
 instead. If R indicates a sweep by rows and C a sweep by columns, then 
 
 sequences of the type (RC) will have every read in the transposed mode. 
 
 Sequences of the type (RRCC) will have alternating normal and transposed 
 
 reading sequences. 
 
 Merging the transmissions at the end of one column with the 
 
 transmissions at the beginning of the next column is slightly more complex 
 
 than for reading normal in that edge transmissions at the beginning of a 
 
 column may penetrate very deeply into the transmissions for the preceding 
 
 column. The transitions from column k-2 to column k-1 and from column k-1 
 
 to column k are worst cases. Both of these transitions will be inspected. 
 
 Figure 7 illustrates the ending of column k-2 and the beginning 
 
 RT 
 of column k-1. Let the reference position p((l, )) be zero. For starting 
 
 — — RT 
 column k-1, there is activity in positions 0, T, 2T, ..., p(( k+l ) ) = kT 
 
 mod(kT-l) = 1, p((l,k-l)) = 3, 1+T, 3+T, 1+2T, 2+2T, 3+2T = p((3,k-l)). 
 Positions of activity for ending column k-2 may be computed by first obtain- 
 ing a fix on block ((£ -l)k+l,k-2): 
 
 p((l,k-2)) = p((l,k-l))-T = 3-T 
 
 p((U m -l)kH,k-2)) = p((l,k-2)) 4- (i m -l) = 2 + i m - T . 
 
22 
 
 We must compare revolutions A with C and B with D. Note that for the first 
 comparison, we need only compare activity between markers M and 11 on A 
 with the first activity on C because of the periodicity T of activities 
 
 and the facts that p(| {£ -l)k,k-2 I) = 1 + I > p((l n )) and p((i k-2 n )) = 
 
 m / ' m 1 m 1 '' 
 
 £ -2T < p((l n )) for i < 2T. We will again look for values £ < 10. 
 m - 1 ' m— D m — 
 
 For the comparison of B with D, note that p(| £ k-2,k-2 I) = 
 
 m 
 
 2+i -2T < p((l,k-l)) = 3 for £ < 2T; so that we need only take account 
 
 of possible conflicts between the two updated edge group transmissions on 
 
 B and the mesh block transmissions on D. Transmissions to the left on B 
 
 and D will be compatible if transmissions to the right on A and C are 
 
 compatible. Also the two updated edge group transmissions on B will not 
 
 conflict with edge transmissions on D if updated edge group transmissions 
 
 on A do not conflict with (l n ) for £ < 2T. Of the mesh block transmissions 
 
 1 m — 
 
 on D we need only consider (l,k-l) and (2,k-l) since the next one, | l,k-l 
 
 is always to the right of the last activity of B for i < 2T. Actually, 
 we need only consider the last transmission on B. 
 
 Table 3 lists the activities of interest in making the column 
 transition. Sets A and C must be disjoint and sets B and D must be disjoint. 
 If we construct a table of acceptable values such as Table 2, we find that 
 there is only one unacceptable value pair (£ ,T) which is not on Table 2. 
 That pair is £ = 10, T = 5. If we delete the corresponding "A" from 
 Table 2, we have a table of acceptable values for both normal and trans- 
 posed reading. 
 
23 
 
 STARTING COLUMN k-1 
 
 A 
 
 
 ^uuuiui xv— c: 
 
 C 
 
 ACTIVITY 
 
 ft") 
 
 
 ACTIVITY 
 
 POSITION 
 
 i -2T 
 m 
 
 1+jg -2T 
 
 m 
 
 POSITION 
 
 
 (i m -l)*-2, 
 k-2 
 
 
 ((l n -l)k,k-2) 2^ m -2T 
 
 k-2 
 
 R« 
 
 V 1 
 
 -2+i -T 
 m 
 
 (i -l)k-l, 
 m 
 
 k-2 
 
 1+i -T 
 m 
 
 ((i -l)k+l f k-2) 2+i m -T 
 
 k-2 
 
 L' 
 
 V 1 
 
 
 k-2 
 
 L' 
 
 m 
 
 -2+i 
 
 m 
 
 -1+i 
 
 m 
 
 D 
 
 (l,k-l) 
 
 (2,k-l) 
 
 3 
 
 3+T 
 
 Table 3« Positions of activity for column transition 
 k-2 to k-1 in column-transposed reading. 
 
2k 
 
 One might question the statement that we have obtained all 
 
 unacceptable pairs (i,T) for transposed reading since we have considered 
 
 only one transition out of k different transitions. In fact, it can be 
 
 verified that we have all unacceptable values for I . £ < 10 and T > 5. 
 
 e nr n — — 
 
 Having investigated most of the logical aspects of the scheme 
 under consideration, we now consider the problem of implementing the scheme 
 on a disk storage unit with 1200 addressable segments per track. 
 
25 
 
 k. IMPLEMENTING THE SCHEME 
 
 4.1 General 
 
 For mesh storage in one quadrant of ILLIAC IV, it is efficient 
 to store 8x8 squares of mesh points in "quadrant words" across the 6k 
 processing elements. For this reason, it will be assumed that the mesh 
 is subdivided into 8x8 squares; and mesh blocks will have dimensions of 
 p X q squares, p and q integers. The smallest addressable piece of data 
 on the disk is the segment, which consists of 256 words, or h quadrant 
 words. The head-switching time of the disk, again, is taken as two segments. 
 
 We construct each logical track pictured in Figure k from t disk 
 tracks for some integer t. In doing so, we do not take advantage of the 
 t-1 times an actual radial position is passed within each logical revolution. 
 If b is the number of segments in a disk block, then 
 
 b(kT-l) < 1200t 
 
 must be satisfied. We might try 
 
 segments, 
 
 I l200t | 
 [kT-lJ 
 
 where the operator L J indicates the greatest integer less than or equal 
 to the argument. The truncation will result in wasting a number of seg- 
 ments; we allow this if the number of unused segments is reasonably small. 
 These wasted segments become a dead area on disk and will never be used 
 for storage. The dead area will, of course, contribute to overall latency. 
 Of the b segments in a disk block, we may use b-2 segments, with 
 2 segments at the end or beginning of the block for head switching. If b 
 
26 
 
 does not divide 1200, then some disk blocks may straddle track connections 
 
 and we will, in general, have to allow 2-segment spaces at each of the 
 
 t-1 connections of the disk tracks. If the dead area is not large enough 
 
 to cover this, one might reduce b or try another combination of k and t. 
 
 In an actual mesh there will be a number of variables associated 
 
 with each mesh point. Let this number be N . We must have 
 
 v 
 
 N pq< i+(b-2) . 
 
 v 
 In addition, if the mesh has dimensions M X N 8x8 squares, 
 
 M < pm 
 and N < qn 
 
 where m,n are dimensions in mesh blocks. 
 
 The I , I satisfy 
 nr n 
 
 U -l)k-2 < m < i k-2 
 m — m 
 
 U -l)k-2 < n < i k-2 
 v n — n 
 
 and they should be acceptable as defined by Table 2. 
 
 Noting that there are k edges per edge group, one edge must be 
 contained in 
 
 M 
 
 segments 
 
 for the reason that individual edges must be addressable for transposed 
 reading. An edge consists of N d(8p) or N d(8q) points. Hence 
 
 8N d Max(p,q) < 256 
 
 b-2 
 
27 
 
 The number of segments required for an edge is 
 
 8N d Max(p,q) + 255 
 
 If it is not necessary to change the direction of sweeping over 
 the mesh, then edges need not be individually addressable, and the require- 
 ment is less stringent: 
 
 8N d s k < 256 (b-2) 
 
 where s is p or q depending on the direction of sweeping. 
 
 In the description of transposed reading, it was assumed that 
 when two edges from different edge groups were required within the space 
 of one disk block, the two edges would not be in the same radial position. 
 In fact, the two edges must be separated by at least two segments. For 
 k > 6 this can be guaranteed as follows. 
 
 Observe from Figure 8 that for sweeping up column (i-l)k+j-2 
 we must read the (j-l)th edge from a superscript RT edge group and the jth 
 (modulo k) edge from a superscript LT edge group. Each edge is mapped into 
 an "edge slot" on a logical track as shown. All LT edge groups begin with 
 the (k-l)th edge in the first edge slot; and all RT edge groups begin with 
 the 1st edge in the first edge slot. Then the edges required for trans- 
 mission are separated by two edge slots for k > 6. The smallest edge slot 
 possible is one segment. In moving up column ik-1, we need the kth edge 
 
 RT , ^ ., , =r LT 
 
 from R+k+1 . ' and the 1st edge from R. , for which the requirement is 
 still satisfied. If the edge slot is two segments or more, then k > h 
 
I I 
 
 28 
 
 
 
 
 
 1 
 
 
 
 
 1 
 1 
 
 
 
 
 
 J*: 
 
 
 
 
 
 wmm 
 
 
 i 
 
 
 
 -* 
 
 
 ~ CVJ 
 
 _ t 
 
 CM 
 I 
 
 gSSSSSS 
 
 '-3 
 
 2 
 C 
 Z) 
 -i 
 o 
 o 
 
 -J 
 
 U) 
 
 e 
 
 UJ 
 
 w 
 
 l 
 
 —3 
 
 CM 
 
 
 
 . lil w 
 
 CVi 
 
 r-o 
 
 CM 
 
 CVI 
 
 CM 
 
 O 
 (it 
 
 3 
 
 II | j.| 
 
 i £ i i -j i 
 
 w 
 •H 
 
 Tl 
 
 a 
 o 
 
 w 
 
 A3 
 
 CJ 
 
 o 
 
 H 
 
 o 
 
 taD 
 
 <l> 
 
 cu 
 o 
 
 ■s 
 
 •H 
 
 s 
 
 •H 
 
 I 
 
 CO 
 0) 
 
 •H 
 
 I I 
 
 I 
 
29 
 
 is sufficient. Of course, the unused portion of the b-2 segments allotted 
 might be distributed between edge slots to achieve separation. 
 
 No mention has yet been made of having T be some value > 5 other 
 than an integer. This can certainly be done without upsetting the logic 
 of the I/O sequence. Another point concerning T should be considered, 
 however. 
 
 Recall that T is the ratio of the allowed compute time for a mesh 
 block to the one-way transmission time for a disk block of b segments. T 
 is the logical period of the scheme, and it is a parameter of great interest 
 to us. When dealing with a calculation kernel, however, we are more inter- 
 ested in the ratio of the allowed compute time to the input time for a 
 mesh block, which is always less than a disk block by 2 segments or more. 
 Let this ratio, the real period, be devoted by T . N pq/^ segments are 
 used to store a mesh block; therefore 
 
 UTb 
 
 T 
 
 R N pq 
 v 
 
 Figure 9 presents the relationship graphically. 
 
 Both T and 1L are measures of the same quantity, the amount of 
 time available for computation on a mesh block. They are measures in 
 different units, however, as indicated in Figure 9. We are more interested 
 in T R than in T because kernel times can be given in mesh blocks. By mesh 
 block units of time, we mean, of course, the transmission time for a mesh 
 block of N pq/U segments. Although a mesh block is actually transmitted 
 in [ (N pq+3)AJ segments, we are interested in the time to transmit that 
 
I 
 
 i/> 
 
 30 
 
 W 
 
 
 « 
 
 r -H 
 
 . 
 
 o 
 
 w 
 
 
 -p 
 
 DO 
 
 d 
 
 Q, 
 
 <u 
 
 •H 
 
 • £ 
 
 si 
 
 w M 
 
 m 
 
 -p 0) 
 
 a 
 
 C w 
 
 
 
 0) 
 
 •H 
 
 g^t 
 
 P 
 
 CI 
 
 d) CT 
 
 H 
 
 w Cu 
 
 0) 
 
 > 
 
 *n 
 
 ^ S 
 
 b.n 
 
 III III 
 
 3 
 
 
 •H 
 
 AJ Ai 
 
 ■P 
 
 o o 
 
 cn 
 
 o o 
 
 u 
 
 H H 
 
 p 
 
 ^> ,Q 
 
 n 
 
 
 fl 
 
 ^! X! 
 
 d 
 
 
 H 
 
 Q S 
 
 
 
 
 
 
 h 
 
 
 3 
 
 
 DO 
 
 
 ■H 
 
 
 fe 
 
 
31 
 
 part of L(N pq+3)/^J segments containing data. It is obvious that T is a 
 
 lower bound on T and that T cannot equal T because of the two segments 
 R K 
 
 reserved for head switching. 
 
 With every kernel there can be associated a number T^ which is 
 
 the ratio of the calculation time per point to the input time per point. 
 
 T„ is the calculation time per point normalized to the disk transmission 
 K 
 
 rate; it is the kernel time in mesh blocks. It is assumed that T is 
 
 independent of the size of the mesh block which the kernel is updating; 
 
 although if the compute time varies slightly because of I/O interrupts, 
 
 etc., T T . should be an upper bound on the compute time. We must have 
 
 1 V < T^; and we wish to have T„ as close to T^ as possible in order to 
 JK. — K K iv 
 
 match the overall transmission rate of the scheme to the speed of the 
 kernel. If several different kernels are applied to the mesh, then we 
 match the scheme parameter T to the slowest kernel, since it appears that 
 the parameters of the scheme cannot be changed during a sweep. 
 
 Given a mesh of dimensions M X N 8x8 squares with N variables 
 per point, a stencil depth d, and a normalized kernel time T K , the problem 
 is to find scheme parameters for which the latency of the disk and the 
 value T„ - T„ are as small as possible. In reality, T„ - T„ is part of 
 
 K iv K iv 
 
 the overall latency since for T_ > T^ the data on disk is "not there" 
 
 exactly when we are ready to compute on it. The value T_ - T T/r can be 
 
 K iv 
 
 thought of as the ratio to Tb segments, of the number of segments for 
 
 which the computer is waiting for the disk to send another mesh block. 
 
 In spite of this, the present discussion will distinguish between overall 
 
 latency and the value IL - T T _. 
 
 K iv 
 
32 
 
 k.2 Measuring Latency 
 
 The measure T - T v represents latency which, in a sense, can he 
 
 made to go away hy increasing T„. It is not proposed to increase the com- 
 
 K 
 
 plexity of a given kernel for no reason; but rather if a solution of the 
 
 scheme parameter relations exists, then one might look for a problem with 
 
 T close to the T for the scheme. T - T might be called external latency. 
 
 The question now is: "What is internal latency?" We can include 
 the wasted time between complete sweeps of the mesh, the "hiccups" between 
 sweeps along rows or columns of the mesh, latency due to incompletely- 
 filled blocks on the last row or column, and the time spent skipping over 
 the dead area on disk. The last three elements will be discussed here. 
 
 The size of the dead area on each logical track is 1200t-b(kT-l) 
 segments. Since it is passed on each logical revolution, the percent of 
 total time spent on the dead area, i.e., the dead area latency, is 
 
 1200t-b(kT-l) 
 D " 1200t 
 
 It is unlikely that any code could do useful work to mask the dead area 
 
 latency since the dead area is not distributed across the disk blocks in 
 
 a logical track, and furthermore since it moves left by 2T-1 disk blocks 
 
 relative to the mesh with each complete sweep. 
 
 The "hiccups" between sweeps along rows and columns will cause 
 
 latency unless n = I k-2 and I = 3T, and likewise for m. For n = $, k-2 
 
 n n n 
 
 there will be 3T-i disk blocks of latency at the end of each row. In 
 
 addition, if - < n < I k-2, we may not be using all of the £ k-2 mesh 
 q — — n ' n 
 
 N 
 blocks allowed. Then I k-2 blocks are effectively wasted. For every 
 
33 
 
 block so wasted we incur an additional T blocks of latency. This occurs 
 
 at the end of every row, out of disk revolution through approximately 
 
 I (l200t) + Tb segments. The latency incurred at row connections is 
 n 
 
 therefore 
 
 (3T-i + T(i k-2 - £))b 
 L R = 1 (1200t) + Tb 
 
 n 
 
 Similarly, for column sweeping, 
 
 . <3T-l m ♦ TU m k-2 - g))l, 
 
 L C ,0 (1200t) + Tb 
 
 m 
 
 Additional latency occurs on the last row for row-sweeping if 
 M < pm. The dimension m should be the smallest integer such that M < pm. 
 This latency, like the latency at row connections, might be masked by use- 
 ful calculation on the boundaries of the mesh. Extra computing time can be 
 provided for all four boundaries by strictly enclosing the M X N mesh in 
 the m X n blocks. It will be assumed here, however, that updating calcula- 
 tions on boundary points are no more time-consuming than those on interior 
 points. The storage available in the last row, but not used, amounts to 
 N N(mp-M)/^ segments. The latency occurs once in m rows; its value is 
 approximately 
 
 lAN N(mp-M)T„ 
 T - v R 
 RL m(i (1200t)+Tb) 
 
34 
 
 Similarly, for column-sweeping, 
 
 l/4N M(nq-N)T T3 
 _ ^ V K 
 
 CL n(i (1200t)+Tb) 
 
 In order to determine overall internal latency, one must take 
 account of the order of sweeping. For an equal number of row and column- 
 sweeps, 
 
 L = L D + 1/2(L R 4- L^ + L c f L CL ) 
 
 is a reasonably good measure of overall internal latency. 
 
 One latency term has not been included. It is the latency spent 
 re-initializing between complete sweeps. There will be no attempt here 
 to calculate it, although in some problems it may be important. 
 
 4.3 Storage Requirements 
 4.3.1 Fast Memory 
 
 If one examines the sequence of I/O for sweeping the mesh, he 
 can tally all mesh blocks and edges contained in fast memory at every 
 instant, and determine the amount of storage needed for the data as a 
 function of time. If the amount of storage needed for program and scratch 
 area is added, and the maximum over time of the total storage required is 
 determined, one can state whether the scheme will work for a memory of 
 given size. Alternatively, one may examine storage requirements over a 
 sample of problems and problem sizes, and attempt to estimate the amount 
 of fast memory required for a particular installation. 
 
35 
 
 The maximum amount of fast memory required for storage of the 
 mesh data is a function of the mesh and scheme parameters. It is also, 
 in a somewhat odd way, a function of the organization of the fast memory 
 itself, and of the size of the smallest addressable segment on disk and 
 the length of the disk track. 
 
 The storage required for sweeping in transposed mode is greater 
 than for sweeping in normal mode. The difference is of the order of an 
 edge group, but it must be remembered that edge groups are usually larger 
 if a transposed read is required, since storage for individual edges rather 
 than edge groups is rounded upwards to the nearest disk segment. 
 
 Since there are many numerical PDE problems which do not require 
 changing the direction of sweep, it is worthwhile to investigate the require- 
 ments for normal reading separately from transposed reading. In this report 
 only normal reading will be analyzed. 
 
 Because of the parallel structure of ILLIAC IV, special problems 
 arise in the allocation of memory. One must be clever in the design of the 
 program and in the distribution of data across the processing elements. One 
 of the constraints imposed in the analysis, that of considering 8x8 squares 
 as the smallest subdivision of the mesh, resulted from taking account of the 
 structure of the fast memory. This structure also causes problems with 
 storage of edge values. If an edge or an edge group is packed tightly into 
 the smallest number of quadrant words that will contain it, then it is likely 
 that some of the edge points will not be located in the proper processing 
 elements. Additional code will be needed to route data to proper PE's when 
 the data is needed. The space saved by packing the edges or edge groups 
 may be used up by the additional code and scratch areas. 
 
36 
 
 Nevertheless, in this analysis we will calculate storage 
 requirements based on having data packed moderately tightly. For row- 
 normal reading an edge group consists of 8N kqd words. The number of 
 disk segments needed to contain an edge group is 
 
 s 
 
 8N kqd +255 
 256 
 
 The number of quadrant words needed as an I/O area for an edge 
 
 group is 
 
 T7 EGR 
 
 *F 
 
 Likewise, the number of segments needed to contain a mesh block 
 
 is 
 
 s 
 
 N y pq + 3 
 
 and the number of quadrant words needed as an I/O area for the mesh block 
 
 is 
 
 SQ S 
 
 For calculations on block (i,j), an area must be set aside to 
 store the right edge of block (i,j-l). As calculations on (i,j) sweep the 
 block, old values from the block may be moved into the edge area so that 
 when (i,j) is completely updated, the edge area contains the right edge 
 
37 
 
 of (i,j). The number of quadrant words needed for the single edge is 
 
 Q 
 
 8N pd + 63 
 v 
 
 — 55 — 
 
 We might include an 8 X 8 mesh square of storage as a token 
 amount for overhead. This amounts to N quadrant words. If the I/O 
 sequence for row-normal reading is examined, it can he seen that we need 
 
 «" = 3WE E ♦ iff ♦ ItaOff", iff) ♦ iff ♦ N 
 
 mem SQ, SQ SQ ' SQ Q v 
 
 quadrant words of fast memory for storage of the data. The third term takes 
 
 account of the case in which the ending of one row moves far into the begin- 
 
 ning of the next. A similar expression W exists for column-normal reading, 
 
 mem 
 
 No attempt will be made here to estimate the storage required for 
 program and scratch areas. 
 
 U.3.2 Disk 
 
 An easy way to manage disk is to allocate half of the disk to old 
 mesh and edge groups and half to updated mesh and edge groups. This proce- 
 dure takes no advantage, however, of the space on disk which becomes avail- 
 able for updated mesh blocks as the mesh is swept. 
 
 Referring once again to Figure k, note that, starting from block 
 (l,l) for example, successive disk blocks are filled with each k blocks 
 added to the mesh row. The last block in the mesh row is (l,i k-2). If 
 
 I < T, the mesh row will fit into one logical track, but if i > T+l, we 
 
 n — ° ' n — ' 
 
 will have to use another logical track for storage. The same situation 
 
38 
 
 V 1 
 
 occurs for the other mesh rows. One mesh row would require ( L-= — J+l) 
 
 logical tracks. 
 
 It is now proposed that rather than using T out of T disk blocks 
 
 for storage, we use only T-l out of T blocks; so that we use another logical 
 
 track for I > T, and yet another for £ > 2T (the second logical track is 
 n — ' n — 
 
 filled completely). In this way we insure that there will be an empty block 
 
 immediately before blocks (1,2), (1,3), ..., (l,k). We may then write 
 
 updated blocks in these spaces as we sweep the row. Updated block (l,k-2) • 
 
 is written in the space before (l,k). (l,k-l) • is then written over (l,l) 
 
 since we do not need (l,l) any more. Old blocks are thus successively 
 
 overwritten by the updated (k-2)th block following them in the row. One 
 
 i n 
 mesh row then requires ( L^— J+l) logical tracks; and m mesh rows require 
 
 i n ' 
 
 m( L— J+l) logical tracks. 
 
 Note that this procedure for managing disk also works for column- 
 sweeping. Because of this we may choose the smallest of two possible stor- 
 age requirements: 
 
 W^ = Min(m(L-fj + D, n(L-fM)) . 
 
 No such game can be played with edge group storage. However, 
 we may still choose the minimum requirement for two possible storage 
 methods. Attention is drawn to the upper edge groups (superscript U) in 
 Figure k. One storage method is shown in the schematic. Groups with 
 successive subscripts are stored in adjacent blocks on a logical track 
 for i < T. Group 1l would be stored on logical track 3 in the first 
 
39 
 
 position, although it is not shown since the drawing has £ = T-l. Groups 
 with successive Roman numerals simply wind around disk at intervals of T 
 
 blocks, until the last one, I k-2. For every increment of I , another 
 
 , V 1 ~ — m 
 
 ( L-= — J+l) logical tracks are added, as long as I < 2T. If one increases 
 
 V 1 m 
 
 I to 2T+1, 2.{Y— — J+l) logical tracks must be added; however, we will con- 
 sider only 1,1 < 10 and T > 5- 
 
 The second possible storage method is to put R , R , ..., R„ on 
 
 L d \ 
 
 different tracks, and to put I,, k+1 , ..., ik+1 in adjacent blocks on 
 
 the same logical track for i < T-l. I, , II, , III, , ... are still to be 
 
 stored on the same logical track, spaced T blocks apart. The expression 
 
 for the number of logical tracks required is the dual of the expression for 
 
 the first method if £ , £ < 2T. 
 
 nr n — 
 
 The number of logical tracks needed for upper edge groups is 
 
 W^ = MinU m (L^-J+D, yi^-J + l)) 
 
 We allocate four such areas on disk, one each for old upper and lower edge 
 groups and one each for the new. 
 
 The amount of disk storage needed is 
 
 W dis k ■ ^ + *«£> 
 
 tracks. Note that this measure is in disk tracks and not logical tracks. 
 
Uo 
 
 5. NUMERICAL SOLUTIONS 
 
 For a given problem M, N, N , d, T , a scheme must be found with 
 a latency which is within an acceptable limit. The difference T^ - T v 
 should be accounted for in the latency measure; however, this section will 
 be an informal discussion of the existence of schemes for given values of 
 M, N, N , d and will consider only the internal latency of the scheme. 
 
 If the values of M, N, N , T, t, k, I , I are specified, a 
 
 ' ' v' ' ' ' rrr n ' 
 
 scheme can be determined if all of the relations of the last section are 
 satisfied. The latency may be calculated and tested for acceptability. 
 The value T^. may also be calculated; as well as a maximum allowable value 
 of d. If we are interested only in normal reading in either direction, 
 then we may leave one of the i-values unspecified, and impose some addi- 
 tional restriction or specify some other parameter. 
 
 A program has been written for the purpose of investigating the 
 existence of solutions of the parameters for various problems M, N, N . 
 A simple procedure is used: T, t, k, I , are iterated, and valid- solutions 
 for which the latency is less than or equal to .12 are printed. For each 
 
 T, t, k values for £ and I are tried successively in an attempt to find 
 
 7 ' m n 
 
 a scheme for column-normal or row-normal reading respectively. The program 
 is written in Burroughs B5500 Extended Algol, and it is listed in the 
 appendix to this report. 
 
 For each mesh size M X N squares, values 3 to 8 in unit steps 
 were assigned to N . The free parameters T, t, k were iterated in unit 
 steps over the ranges 5 and 6, 1 to 5 , and h to 30 respectively. Values 
 
in 
 
 for I ox I were chosen from Table 2. The results of particular interest 
 m n 
 
 are the values of T^ and ¥ obtained. We would like to see many solutions, 
 
 R mem 
 
 with values of T^, well distributed and with memory requirements very low. 
 
 A survey of the results that have been obtained is presented in Table h. 
 
 All of the solutions have W < 1500 and W, . , < k8. The smallest problem 
 
 mem — disk — 
 
 listed is core contained for N = 3 and k. The largest problem listed 
 
 represents about one -third of disk capacity for N =8. 
 
 For each solution a maximum allowable value of d, d , is 
 
 7 max 7 
 
 calculated. If d < 3, the solution is rejected. Tests have shown that 
 max ' 
 
 no more than eight per cent of the solutions obtained by the program are 
 
 rejected on this basis. Fast memory requirements are calculated for d = 3- 
 
 There are few, if any, finite difference stencils in use for which d is 
 
 greater than three; so that it is justifiable to group all solutions with 
 
 d > 3 into one class. Each of these solutions is valid for d < 3. 
 max — — 
 
 The smallest and largest values of T , over the entire range of 
 
 K 
 
 the parameter N , are listed. For the large meshes the solutions are 
 
 numerous over small ranges of T ; and the trend seems to be that solutions 
 
 K 
 
 are fewer for smaller meshes and occur over larger ranges of T . The dis- 
 
 K 
 
 tribution over T is discussed below. At this point a comment should be 
 K 
 
 made concerning the method of finding the solutions. 
 
 The program used to obtain the results incorporates an artificial 
 restriction on the mesh block size which is equivalent to an attempt to 
 minimize the value of T for given scheme parameters. Within the program 
 the parameters p and b are calculated; and from these a largest value of 
 q is determined such that the bound W pq < l+(b-2) is satisfied. If q were 
 
1+2 
 
 
 
 
 Percent 
 
 of Solutions 
 
 
 Number of 
 
 m T") *%. »»-. na — 
 
 With 
 
 W < 
 mem — 
 
 
 MXN 
 
 Solutions 
 
 T Kange 
 
 K 
 
 1+00 
 
 800 
 
 1200 
 
 20 X 20 
 
 
 
 
 
 
 
 28 X 32 
 
 270 
 
 5.21 - 59.^3 
 
 55 
 
 100 
 
 100 
 
 30 x i+o 
 
 15* 
 
 5.83 - 1+1+.00 
 
 1+8 
 
 99 
 
 100 
 
 35 x 35 
 
 27 
 
 6.00 - 1+8.00 
 
 1+1 
 
 100 
 
 100 
 
 1+7 x 1+7 
 
 89 
 
 5.16 - 28.37 
 
 35 
 
 85 
 
 100 
 
 ^5 x 55 
 
 1+05 
 
 5.10 - 2k. kh 
 
 29 
 
 79 
 
 97 
 
 55 x 65 
 
 1+97 
 
 5.09 - 17.09 
 
 23 
 
 73 
 
 9h 
 
 60 x 70 
 
 1+00 
 
 5.11 - ll+.OO 
 
 36 
 
 9h 
 
 100 
 
 90 x 70 
 
 507 
 
 5.10 - 11.91 
 
 33 
 
 88 
 
 98 
 
 60 X 110 
 
 675 
 
 5.08 - 12.27 
 
 29 
 
 78 
 
 96 
 
 90 X 110 
 
 567 
 
 5.07 - 11.02 
 
 33 
 
 83 
 
 95 
 
 Table 1+. Survey of results obtained on 11 mesh sizes. 
 
 Latency < .12, normal reading in either direction. 
 
h3 
 
 iterated downward from this largest value, solutions with larger T„ might 
 
 K 
 
 be found. The program uses only the largest q; the attempt is repeated 
 
 with the variables p and q interchanged. There is another constraint on 
 
 q which leads us to expect higher T- for smaller mesh dimensions. Given 
 
 n and N, in order to minimize internal latency, q should he the smallest 
 
 value such that qn > N. For smaller mesh dimension N, this bound is more 
 
 likely to be lower than the bound mentioned above. In fact, most of the 
 
 solutions for the small meshes had m or n equal to one and p or q spanning 
 
 an entire dimension of the mesh. In such solutions an entire column or 
 
 row of 8 X 8 squares would be read at a time: and the W were overesti- 
 
 7 mem 
 
 mated by the program because the edge group areas allocated would not be 
 
 needed. 
 
 A rigorous mathematical investigation of the existence of solutions 
 
 to the system of relations has not been performed; and the program used does 
 
 not find every solution possible in the given T ranges. In fact, because 
 
 K 
 
 of the peculiar behavior of the remainder terms in the integer divisions, 
 
 the program may not even find the solutions with smallest T for the iterated 
 
 K 
 
 parameters. The results obtained, however, are interesting even without 
 the assurance that all possible solutions have been found. 
 
 The last three columns of Table k give the percents of the 
 solutions found for which fast memory requirements are less than or equal 
 to UOO, 800, and 1200 quadrant words. For the larger meshes, a larger 
 percentage of solutions have high memory requirements. This does not 
 necessarily indicate, however, that larger meshes require more memory. 
 No attempt has been made at this time to examine in detail the fast memory 
 requirements; but this problem is worthy of further investigation. 
 
kk 
 
 The relationship between T^ and N is illustrated in Figure 10 
 
 R v 
 
 for four pairs of mesh dimensions. The highest and lowest values of T 
 
 K 
 
 found are plotted for each value of N . In addition, selected solutions 
 
 for N equal to 3 and 8 are plotted, and the number of solutions found is 
 
 printed above the highest point for each N . The additional points at 
 
 N = 3 and N = 8 are selected to represent the densities over T^ of the 
 v v R 
 
 solutions obtained. It is seen that the density increases with increasing 
 mesh size. Further tests are yet to be performed to determine whether the 
 procedure of iterating p or q downward yields values of T which would close 
 the gaps in the higher regions. 
 
 Note that the highest points for the two smallest meshes form 
 straight lines. The six points in each of the graphs represent the same 
 
 solution with only the difference that for smaller N the value of T„ is 
 
 J v R 
 
 v 
 
 greater and the fast memory requirement is lower. A solution for the 
 
 N 
 problem M, N, N , d, T is also valid for the problem M, N, N -1, d, W~j^y} 
 
 v 
 but it is not in general valid for the problem with N +1 variables because 
 
 of the requirement N pq < U(b-2). The term "solution" here refers to the 
 
 set of parameters {T, k, t, I , m, I , n, p, q, b, L, W . ) where L is 
 
 the internal latency of the scheme. Note that L is independent of N and 
 
 T R' 
 
 A straight line (with the slope indicated) may be drawn to the 
 
 left from every solution on the graph; solutions exist along these lines. 
 
 One such line is drawn on the first graph to indicate the existence of a 
 
 solution at the point marked "x". This solution would also be found using 
 
 the procedure of iterating p or q downward. This procedure might also 
 
 yield solutions for N =3 which are invalid for higher N . 
 
 v v 
 
X 
 
 O 
 
 i— 
 
 2 «*<M» • «••• 
 
 s. 
 
 I I I I I I 
 
 O 0> oo r* «o m 
 
 r>- 
 
 i+5 
 
 eg 
 
 in 
 m 
 
 If) 
 
 s< 
 
 io* 
 
 •«• • ••• 
 
 c« 
 
 M • •• ••• •• •• ♦ •» 
 
 ~l 1 P" 
 
 O 0» 00 
 
 -T— 
 
 lO 
 
 CVi" 
 
 
 10 . 
 
 o 
 <r 
 
 x 
 
 O 
 
 ro 
 
 
 O 
 id 
 
 -i — r - 
 o o> 
 
 lf> 
 
 - <r 
 
 w 
 
 
 <D 
 
 <Vh 
 
 N 
 
 O 
 
 •H 
 
 
 co 
 
 aj 
 
 
 d 
 
 £ 
 
 H 
 
 CO 
 
 crt 
 
 <u 
 
 > 
 
 a 
 
 
 
 £1 
 
 >H 
 
 C) 
 
 3 
 
 erf 
 
 o 
 
 CD 
 
 Ch 
 
 
 
 ?H 
 
 Jh 
 
 O 
 
 o 
 
 <fH 
 
 «H 
 
 
 
 Ti 
 
 i3 
 
 CD 
 
 G 
 
 -p 
 
 3 
 
 C 
 
 o 
 
 •H 
 
 «H 
 
 Jh 
 
 
 a 
 
 CO 
 
 
 a 
 
 CO 
 
 o 
 
 •H 
 
 •H 
 
 
 -P 
 
 co 
 
 3 
 
 d 
 
 H 
 
 o 
 
 O 
 
 •H 
 
 co 
 
 +3 
 
 
 3 
 
 cu 
 
 H 
 
 J3 
 
 O 
 
 -p 
 
 co 
 
 Fh 
 
 CH 
 
 o 
 
 O 
 
 <»H 
 
 
 
 *H 
 
 > 
 
 CD 
 
 fe 
 
 £ 
 
 co 
 
 fl 
 
 5> 
 
 
 
 0) 
 
 K 
 
 X! 
 
 H 
 
 EH 
 
 CM 
 
 ro 
 
 X 
 
 
 
 • 
 
 
 
 00 
 C\J 
 
 
 *• 
 
 
 
 
 
 ?• 
 
 • •• ••• • 
 
 — •«• 
 
 •• • 
 
 ••< 
 
 
 o 
 
 (0 
 
 i i 
 o o 
 
 ■ft «• 
 
 o 
 
 ro 
 
 1 
 
 1 
 
 O 
 N 
 
 / 
 
 / 
 
 / 
 
 / 
 
 / 
 
 / • 
 
 -fl- 
 
 ax* • •• •• 
 
 -I 1 1 1 1 1— 
 
 O on ao r> <o m 
 
 cu 
 u 
 
 •H 
 
1+6 
 
 The most interesting region of the graph is the low end of the 
 
 T scale, since for smaller 1 the disk more closely approximates a fast 
 
 memory for the restricted class of problems considered here. The results 
 
 in this region are better for the larger meshes than for the smaller meshes 
 
 in both the density of solutions and in the minimum T obtainable. Figure 11 
 
 K 
 
 shows the minimum values on the same graph. 
 
 A fact of moderate interest is that a solution for the problem 
 
 M, N, sN , d, T where s is an integer can be modified to accommodate the 
 
 problems sM, N, N , d, T^ or M, sN, N , d, T independently of the direction 
 
 of sweeping. The modification consists of multiplying p or q by sj T is 
 
 unchanged and W decreases. Non-integer values may be substituted for 
 mem 
 
 s if one is careful to insure that wherever s is used as a multiplier, the 
 result is an integer. 
 
 In an automatic method for finding the optimum solution for a 
 given problem, the external latency should be added to the internal latency, 
 and the result could be used as a measure of the total latency. The solu- 
 tion with the smallest total latency would be the optimum solution. The 
 external latency is given by 
 
 L 
 
 M, xV* 
 
 c . p- ( W~Tr- 
 
 external I (1200t)+Tb 
 m 
 
 for column sweeping. An exact expression would be obtained by replacing 
 Tb in the denominator by T . The search for a solution could con- 
 tinue over a wide range of the parameter T. Real values of T < T„ or 
 even slightly greater than T v might be tried. Of course the interval 
 
^7 
 
 8i 
 
 7- 
 
 6- 
 
 28x32 
 
 30x40 
 
 45 x 55 
 
 90x110 
 
 T 
 
 4 
 
 T 
 6 
 
 8 
 
 N, 
 
 Figure 11. Minimum values of T found. 
 
 R 
 
hd 
 
 over which T is iterated should be properly discretized to assure that 
 disk blocks do not begin within a disk segment. 
 
 A search over very many values of T would be lengthy, and an 
 automatic method should incorporate a procedure to decide when to stop. 
 The first solution found with an acceptable total latency might be taken, 
 for example. 
 
 The efficiency of disk allocation (i.e., the percentage of allo- 
 cated space which is actually used) has not been considered here, but it 
 
 should be noted that for T « T_. this efficiency cannot be high. If this 
 
 K 
 
 is an important factor, one might start the search with higher values of T. 
 
 One of the I or I would have to be close to a multiple of T for high 
 m n xr o 
 
 storage efficiency. 
 
 We can expect that solutions would be fewer and memory requirements 
 higher for problems requiring changes in the direction of sweeping. Changing 
 direction might be useful in handling boundaries of the mesh, for example. 
 It is meaningless, however, to speak of direction changes for solutions 
 with p or q spanning a dimension of the mesh; there is only one direction 
 possible. Since most of the solutions for small meshes were of this type, 
 we might expect that changing direction could be meaningful only for large 
 meshes. 
 
 It seems that the ability to change direction of sweep is of 
 questionable importance. In this light the results given would indicate 
 that the scheme presented in this report can be useful for non-core- 
 contained problems. 
 
h9 
 
 6. CONCLUSION 
 
 The scheme described in this paper is only one of a family of 
 schemes that could work. Another member of the family is obtained by 
 interchanging the positions of upper and lower edge groups R. and R. 
 and shifting all of the mesh blocks in Figure h left by T disk blocks. 
 
 It might be possible to find a scheme which would work with a 
 logical period T of four. This might be done by allowing transmissions of 
 two edge groups within the space of one disk block. It is clear, however, 
 that no scheme could have a logical period T or a real period T less than 
 two, since every mesh block must be input and output. The value two is 
 an obvious lower limit on the period. 
 
 There are a number of problems in which several different kernels 
 
 are to be applied to different subsets of the N variables. There is no 
 ^ v 
 
 provision in the present scheme for handling these types of problems 
 efficiently; on each sweep of the mesh, all of the variables are transmitted. 
 If one transmits only the variables needed, then the value of T is eff ec- 
 tively increased since the logical period T must remain constant during 
 problem execution. Remapping of the mesh onto disk might be considered, 
 but this would not be an easy procedure to use. The two problems of 
 changing the period and handling overlapping subsets of the N variables 
 are somewhat related and are worthy of further investigation. 
 
50 
 
 APPENDIX 
 
 The following is a listing of the Burroughs B5500 Extended Algol 
 
 program used to obtain the numerical results. A note may be helpful to 
 
 the reader who is unfamiliar with the language. The operator "/" results 
 
 in a floating point division with a floating point result. The operator 
 
 "DIV" indicates a fixed point division with truncation of the remainder. 
 
 The expression "A DIV B" is equivalent to L— J. 
 
 £5 
 
51 
 
 B t G I *» 
 
 REAL ARHAY PCTtt«20]; 
 
 REAL TRE AL.MilLT,l T C Y , C P> OM, ()m, L TC YM# LK YN> TR, TRM T N, TRM AX > 
 
 INTEGER ARRAY THI I 5 I 1 i! » : 7 ] > H 1 ST t 1 I 20 ]* MMM, NNN [ 1 t 1 5 1 ; 
 
 integer mv.tp»omax*k» 1 »p.u»nvpq»b# ktmi»lt»pm2>fbm2» unfil, 
 
 N»m,nn>mm»LM,LN. ILM# ILN*S» SUM,MAxPQ# I»DSKM#0SKN# 
 
 FnGM,E0GN,Dl<^»LMKTM1,LNKTMl,ISZ#TI»D.ENPK,WEGSQ»WMB<i0#WFg»WMEM) 
 ALPHA SYM,SYM1) 
 FILE PTO'iT 15( 1 #1 I )f 
 LARFI SKTP,SkIP2.C YC J 
 LIST LSTrNV* TP»LM,M»l N.N. TREAL» SYM, I)>K » T*P* Q» MVPO* B # llNF IL * RULT # 
 
 LTCY#SYM,SYM1> OM,DN,wMEM.DISK)J 
 FORMAT FA(//2I5," = MM.NN CRITERION *",F5.3/ 
 
 "N\/ Tp is m LN N TREAL D K T PC NVP 
 
 Q B UNFIL R'lLT LTCY OMRL DNRL MFMRY DISK"}* 
 
 FH(l2.v2#T2»2(x3M^»H)'X2.F5,2»Xi,Al»X?,T2»X3,i2*X3»n»IS»H» 
 
 x3,T4,YJ.n,yi.l3,y?,E4.3,x<»#F<»,3»xl>2Al,X4,x3»2F5,1,I8,I5), 
 Fwv("**«*«* 1PMTn="F5.2 m TRHAXe"F5,2 M SUMx"I5)> 
 
 F MSTD(»FA<;T-Vf M hEylUREMENT I STR I BljT T ON" v5"N0 . OF SOL^S = "I6» 
 *10 ,, F0 ( ? M I^"* ,, I 3// 
 " Q-W"Y6"FWiO PCT ACCOM*/)* 
 FHSTC 14. I 1 0,F 7 , 1 ) I 
 THL F 5 , * 1 WITH 4,1 »b»b» 10> 
 THLTft»*] WITH 5. l»3»t>» 7.91 
 TBL r 7 ,* l wlT H 5,1.3, f #«, 10; 
 THL r«,* 1 WITH 5,1 »3*!>*8»VJ 
 TBL r 9 » * 1 WITH 6, 1 , 3.b#6.9» tOJ 
 T H L T 1 , * ] w T TH 6»l*3*S*6*7*10j 
 TBLfll**] WTTH («■, 1 , 1,5#*» 7»fl » 
 
 FILL 
 FILL 
 FILL 
 FILL 
 FILL 
 FILL 
 FILL 
 FILL 
 FILL 
 FILL 
 
 T H L r 1 2 » * I WTTH J»1»*,5»*#7»B»9; 
 
 MMMf*1 wlTH 30. ?j*i*f 35. <tb»«7.55»60»60»90»9n J 
 NNNT * 1 WITH 40»?u» 32. 35,55»47#65>70» U0»70» 1 10J 
 CH*.12; 
 
 FOR TS/«-< STFP 1 UNTIL 11 00 
 RE<", I M 
 
 FOR I«-l STEP 1 Ll\T 1 1 20 DO HlST[I]«-OJ 
 
 WRITECPTOHTTPAGF J ); 
 
 MM*MMM[ ISZ ] S NN^NNNT ISZ1 J 
 
 WRITE(PTOUT.FA,^M.NN.CR>; 
 
 FOR NV«-3 STEP 1 UNTIL 8 DO 
 
 R E G T N 
 
 IF NVxMmxNn>210uOO THEN GO TO SKIP?I 
 
 TRMIN»099J 
 
 TRMAX4-0 I 
 
 SU^«-01 
 
 FOR Tp^b.6 OH 
 
 FOR T«-1 STfP 1 
 
 FOR KM STEP 1 
 
 FOR ILM«-1 S1EP 
 
 BEGIN 
 
 MM*MMMl I S7 J t NN*NNNt ISZ] J 
 
 s y m l ♦ »• " ; 
 
 KTMl<-K«TP-l ) 
 R«-(LT«-l20nxT) D I V 
 UMF IL*.T-KTMl*b; 
 IF T=l THFN SYM>" 
 FL*E IF 1?00 
 
 UNTTL 5 00 
 
 UNTIL 30 00 
 
 1 UNTIL TRL[TP»01 
 
 ktmi; 
 
 DO 
 
 MOO B s THEN SYM«-"E" 
 
52 
 
 FlSE 
 FlSE 
 
 IF ?x(T-l)iUNFlL THFN SYM*"A« 
 
 riEGTN 
 
 B«-B-U 
 UNFIL*UNFIL+KTM1 I 
 
 SYM«-"B" 
 
 LNO I 
 BM2*B-2> 
 FRM2*4kBM?J 
 LM«-TRl.tTP» 1LM1J 
 M4-LM*K-2J 
 
 cyt » 
 
 p«.(MM + M-l) DIV M) 
 
 Q«-FrtM? DIV (NV*P)J 
 
 IF 0=0 THFN GO TO SKIP; 
 
 N«-(NN + U-n DIV Qi 
 
 Q«-( NN+M-1 ) DIV N) 
 
 IN«.(K + 1*|0 Dlv KJ 
 
 NVPQ«-NVXPxQI 
 
 RtlLT*IINF 11 /LT) 
 
 DM4-M-MM/PI 
 
 DN«-N-N\/G) 
 
 TRE AL«-<i*TP*R/NVPQJ 
 
 LTC/M«-( (TPx(3 + DM)-I.M)xB+.?5xNV)«MMx(0-NN/N)xTRFAL)/ 
 
 ( L MXLT*TP*B)J 
 LTC Y + | TCYM + RUI T) 
 IF LTCY>CR THFN GO TO SKlPI 
 DMAX«-c2S6*bM2l DIV CENPK«-fl*NV*P*K ) J 
 IF UPAXO 1HEW GO TO SKlPl 
 
 D«-JJ 
 
 WFGS0*ftx((UxE^PK*255) DIV ?56)J 
 
 WMBSQ*4X( (NVPP*3) DTV 4}) 
 
 WFQ«-(B*NV*WxD + 63) 1 V 64* - urM c /. c « 
 
 W MEM«-3«WEf,SQ*WMBS0*WEQ + NV + (IF wEGSQ>WMBSO THEM WEGSO 
 
 FlSE wMBSOJ 
 IF wMFUM^OO THEN GO TO SkTPJ 
 
 S«-((lF WMFM$2n00 THEN KMEm ELSE 20005-1) DIV 100 ♦ U 
 D^KM*(M-ENTTEC(DM))x(LN DIV TP *l)l 
 d<;kn«-n*(Lm DIV TP *l)J 
 FOGM + I mx( (LN-1 ) DIV TP ♦ 1 ) J 
 E0GN4-1 N*((IM-1 ) DIV TP *1)» 
 
 DTSK*Tk((TF D<*M<DSKN THEN OSKM FLSE DSKN)+ 
 U*L\r EUr.H<EDGN THEN EDGM ELSE FDGN))) 
 0lSK>4« THFN GO TO SKlPI 
 SYm1*"N" THEN 
 H I G I N 
 
 t i *l m; lm«-ln» ln«-tii 
 
 T T 4- M i M*N» N*TI) 
 Il4-p; P*QJ 0«-Tl) 
 IR4-HMJ DM*DN) DN«-TR 
 
 WR ITE ( PTOHT»FR*LST ) J 
 IF TRMlN>TREAI THEN TRK I N*TRE AL > 
 ir TRMAX<TKEAI THEN TRMA X«-TRE A L J 
 SDM«-SlJM*l I 
 HIST(5J«-HISTIS]*1J 
 SKlPl IF SYMU- ■ THEN 
 M l G I w 
 
 IF 
 IF 
 
53 
 
 MM*MNN[ TSZ1I NN«-MMM[ ISZ1I 
 
 SYM1*"N"I 
 
 bO TO CYC 
 
 ffcn 
 
 FND» 
 WRTTE(PTOUT#FNV#TRMlN,TRMAX>SUM) 
 
 FNOt 
 MRTTF(PTnuTtPAGEl)| 
 
 TOR T*l STFP 1 UNTIL *0 DO Pr T [ I ]*SUM*SUM*HI ST til J 
 WR!TF(PTnijT#FHSTn»SUM#MMM[IS7]»NNNtISZ])J 
 
 TF SHM>0 THEN 
 
 FOR T«-l STFP 1 UMTIL ?0 DU 
 HtGTN 
 
 PCTtI]*PCT[ 1 1/SuMJ 
 
 WRlTE(PlOUT»FHST»100Kl.HlST[I]#100)«PCTrI]) 
 FNOt 
 SKIP9I 
 FND 
 FNn. 
 
5h 
 
 LIST OF REFERENCES 
 
 [1] Barnes, G. H., et al. , "The ILLIAC IV Computer," IEEE Transactions 
 
 on Computers, Vol. C-17, No. 8 (August, 1968), pp. 7^6-757, 
 
55 
 
 { A 
 
 x» 
 
 It 
 
 • • • 
 
 is, 
 
 
 t 
 
 • • • 
 
 
 
 
 3tt? 
 
 • • • 
 
 
 
 
 
 
 
 
 
 nV 
 
 • • • 
 
 
 
 
 IV 
 
 • • • 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 s>: 
 
 SEL*t 
 
 • • » 
 
 !t±*J» 
 
 
 y^TSt 
 
 « • • 
 
 • • • 
 
 
 
 igg» 
 
 • • • 
 
 • • • 
 
 
 
 £1" 
 
 *i 
 
 ... 
 
 £E3i» 
 
 
 tt.*-^- 
 
 • • • 
 
 • • • 
 
 
 
 *E^ 
 
 • • • 
 
 • • • 
 
 
 
 
 
 H 
 
 
 
 
 
 
 
 
 
 
 
 
 1 
 
 
 
 
 - 
 
 
 __-- 
 
 
 
 
 
 
 
 
 
 ■ 
 B 
 
 
 
 
 t,K — » 
 
 ^V^ 
 
 
 
 13 
 
 1,K*3 
 
 
 
 
 14 
 
 1, K.+-4 
 
 
 
 
 
 
 
 
 
 
 _- - — 
 
 --- 
 
 
 
 
 
 
 
 
 t,n-t 
 
 
 s 
 
 
 
 E2 
 
 N ^ 
 
 
 
 
 23 
 
 
 
 
 
 
 — 
 
 
 
 
 
 
 ^ ^ J 
 
 
 
 
 
 
 
 -J.K--S 
 
 
 
 
 -51 
 
 3, K + l 
 
 X 
 
 
 
 3Z. 
 
 X ~' 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 4, K-4 
 
 
 
 
 
 4 ,x 
 
 
 
 
 41 
 
 4, K*l 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 | 
 
 ,_--- 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 K-1,1 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 __ ,_ • 
 
 
 
 
 
 _ _ - — " 
 
 -- 
 
 
 
 
 
 
 
 
 
 ,, ^ 
 
 
 
 
 V N^' 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 - 
 
 
 
 
 m _ 
 
 
 
 
 
 
 
 
 
 K+1,3 
 
 s 
 
 
 
 
 *^ 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 mission is started. This access time is, in general, not predictable; but it is 
 bounded by the disk rotation time. 
 
 The access time during which the computer does useless work is latency; and 
 the object of the present investigation to minimize the latency for a reasonanbly 
 large class of problems. 
 
 DD .'°"r.,1473 
 
 UNCLASSIFIED 
 
 Security Classification 
 

 
 
 
 
 
 
 
 
 
 
 
 
 - 
 
 
 O c *, » %«»**%»4T% 
 
 
 
 
 
 
 
 55 
 
 A 
 
 
 
 
 
 
 
 
 •-DI5K »\ 
 
 
 \* 
 
 it 
 
 
 Vt. 
 
 
 nT 
 
 Ht 
 
 
 TI«i. 
 
 
 in* 
 
 nrt 
 
 • • • 
 
 mi. 
 
 
 ™r 
 
 nl 
 
 
 R&, 
 
 
 i 
 
 
 
 
 
 ai: 
 
 ... 
 
 
 
 
 
 
 
 
 IV 
 
 II 
 
 • • ■ 
 
 !«. 
 
 
 x\ 
 
 Hi 
 
 
 n ;. 
 
 
 "n^ 
 
 mt 
 
 
 mi. 
 
 
 n> 
 
 
 
 
 
 xV 
 
 ... 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 sir 
 
 *.*\\ 
 
 ... 
 
 Sii. 
 
 
 tii' 
 
 EIS 
 
 • • • 
 
 Ell. 
 
 
 •jz^i" 
 
 rr** 
 
 
 iJiL. 
 
 
 g?vr 
 
 sat 
 
 
 531. 
 
 
 GIT 
 
 * • • 
 
 . . . 
 
 
 
 KTl« 
 
 
 
 
 
 K^ 
 
 H, L 
 
 • • • 
 
 S, L . 
 
 
 •grr^ 
 
 MI* 
 
 • * • 
 
 ^D; 
 
 
 43^ 
 
 tax 
 
 
 mi. 
 
 
 rr»v 
 
 ESS. 
 
 
 Gfc. 
 
 
 G» 
 
 • • • 
 
 . . . 
 
 
 
 "^ 
 
 • • • 
 
 • • • 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 • 
 
 
 
 
 
 
 
 
 
 
 
 • 
 
 
 
 
 
 
 
 
 
 
 
 • 
 
 
 
 
 1.K.-1 
 
 l,tK-V 
 
 
 
 
 l,K. 
 
 i, » 
 
 
 
 » 
 
 i.k-m 
 
 l,lX*t 
 
 '' 
 
 
 i« 
 
 i,<*t 
 
 s ^' 
 
 
 
 i* 
 
 <>• 1 
 
 
 
 
 14 
 
 l,ft+4 
 
 1 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 T 
 
 
 * 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 t,K-t 
 
 
 
 
 
 2,K-t 
 
 
 
 
 
 l,K 
 
 
 
 
 ii 
 
 I,«t*l 
 
 ** 
 
 
 
 ei 
 
 *-^ »- 
 
 : 
 
 
 
 ■' * 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 ■*.«■-■> 
 
 
 
 
 
 S, K.-t 
 
 
 
 
 
 3 ( *c.-t 
 
 
 
 
 
 lit 
 
 
 
 
 V 
 
 Y K*l 
 
 * * 
 
 
 
 »t 
 
 -■ 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 4,K-4 
 
 
 
 
 
 4,*'-» 
 
 
 
 
 
 *t*t-t 
 
 
 
 
 
 4, K.-1 
 
 
 
 
 
 M 
 
 
 
 
 41 
 
 4,»*> 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 S/' 
 
EE£\ 
 
 a, k-2 
 
 \ \ 
 
 \ \ 
 
 \ \ 
 
 Z,i«t-4 
 
 dEI 
 
 i*'**-*i 
 
 t.j^n-i 
 
 I «■*.«■«. 
 
 rssn 
 
 56 
 
 HtFEREUCE 
 
 3, a 
 
 — ^- urn, UJ.CU.10- 
 
 bouX $ £H& ro^oTSe"^ iS ' in g6ner ^"^"p-dictable, butiTls 
 
 the o™lr o Tl h TZetZi n tn?£ b ^ 'T^ d ° eS Useless M0 * ls ^tenoy; and 
 large class of proWems 1OTeStlgatl ° n to »"*»*« the latency for a reasonably 
 
 DD ,'°?..1473 
 
 UNCLASSIFIED 
 
 Security Classification 
 
56 
 
 I? 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 tnV 
 
 
 V* 
 
 
 
 
 
 
 
 
 
 
 M 
 
 
 
 
 t. t 
 
 
 
 
 1 «•> 1 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 It 
 
 I'**"* 1 
 
 2, * 2 
 
 
 
 
 l«.«^. 1 
 
 t,fc-l 
 
 
 
 
 L^lJ 
 
 *..* 
 
 1 --■ 1 
 
 
 m\ 
 
 h-| 
 
 t,l». 
 
 1 -*•! 
 
 
 
 1 - 1 
 
 1 ••! 
 
 
 
 
 1 v- | 
 
 t.Vkl* 
 
 
 
 \ 
 
 \ 
 
 X 
 
 
 
 
 \ 
 
 N 
 
 \ 
 
 
 
 
 s 
 
 
 \ 
 
 
 
 \ 
 
 
 \ 
 
 \ 
 
 
 
 \ 
 
 \ 
 
 
 
 
 V 
 
 V 
 
 
 \ * N 
 
 
 
 > 
 
 
 \ 
 
 \ 
 
 \ 
 
 
 
 S 
 
 \ 
 \ 
 
 \ 
 \ 
 
 
 
 N 
 
 \ 
 
 \ 
 
 \ 
 
 \ 
 
 \ 
 
 
 \ 
 \ 
 \ 
 
 \ 
 
 \ 
 \ 
 
 
 \ 
 
 \ 
 \ 
 
 
 \ 
 
 \ 
 \ 
 
 
 
 \ 
 
 \ 
 
 !,»-» 
 
 
 
 **. 
 
 1 «.,.-, 1 
 
 t.lK-2 
 
 
 
 
 l—l 
 
 i.i*.-l 
 
 
 
 
 1— 1 
 
 V»* 
 
 1 «r | 
 
 
 ^t, 
 
 l«.v — | 
 
 i,Iki 
 
 1 «f | 
 
 
 
 | ,.«. | 
 
 F ,.*k.t 
 
 
 
 
 • 
 
 
 (i"ln 
 
 ->1 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 |t,W*l 
 
 t,*FI*» 
 
 
 
 
 |M,*4| 
 
 «.«n*-2- 
 
 
 
 
 Imw.| 
 
 
 
 
 
 |wH 
 
 
 1 *a 1 
 
 
 
 
 
 1 «*s) 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 3.' 
 
 BtFe«Ek4CE 
 
STAKTIMfc OOIOXAN t 
 
 f«n 
 
 (SUV) 
 
 F^~t 
 
 Cn," 
 
 av T i 
 
 TIT) 
 
 CTWKTIM^ bLUMN K-t 
 
 Ori 
 
 of 
 
 
 t,K-t 
 
 
 
 
 
 "i.Z 
 
 (51 "Vl 
 
 57 
 
 
 ez: 
 
 
 B 
 
 
 
 I *•**«■ 1 
 
 
 j.- — — — — ure one trans- 
 mission is started. This access time is, in general, not predictable; but it is 
 bounded by the disk rotation time. 
 
 The access time during which the computer does useless work is latency; and 
 the object of the present investigation to minimize the latency for a reasonanblv 
 large class of problems. 
 
 5D /Tr..t473 
 
 UNCLASSIFIED 
 
 Security Classification 
 

 i 
 
 ! 
 1 
 
 A 
 
 IU4> COCO 
 
 •AM X. 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 57 
 
 
 (If) 
 
 
 
 
 
 (TI* T '> 
 
 
 
 
 
 
 
 
 
 
 Cm.';''') 
 
 
 M 
 
 
 
 
 
 *.i 
 
 
 
 (•!a* T "l 
 
 
 »,l 
 
 
 
 
 ... | 
 
 • • * 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 «T*ir 
 
 car) 
 
 
 k-i.i 
 
 
 
 
 
 K-\,t 
 
 
 
 fE3?^ 
 
 
 K,t 
 
 
 
 (E5V> 
 (B3") 
 
 
 K*1,t 
 
 
 
 (sari 
 
 (E3T1 
 
 
 «»t,i 
 
 
 
 (83 
 
 
 *«».t 
 
 
 
 1 -« I 
 
 v-^ 1 
 
 l*-»*l 
 
 l-M 
 
 K* 1 
 
 1 ^' 1 
 
 1 *•* 1 
 
 I*"'* 1 
 
 
 
 
 
 
 
 
 
 
 
 
 >HW CDt-U 
 
 Mil Kl>t 
 
 
 
 
 
 
 
 
 
 
 
 
 • 
 
 
 
 
 
 
 
 
 
 
 1 
 
 
 
 1 - 
 
 • 
 • 
 • 
 
 
 cri 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 (m'-.M 
 
 
 
 
 
 (2 VI 
 
 
 
 
 
 
 
 
 
 • • 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 ! 
 
 
 fEIV) 
 
 
 tjK-ft 
 
 
 
 
 
 S.K-l 
 
 
 
 CSSf) 
 
 
 *,* ^ 
 
 
 
 (EST) 
 
 
 *,*.-*. 
 
 
 
 
 
 *,K-l 
 
 
 
 CESV1 
 
 
 T.K-t 
 
 
 
 Ivc-l 1 
 
 l«.« 1 
 
 h--« 1 
 
 !*.-« | 
 
 !*..„ 
 
 • 4 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 I 
 
 
 
 csflri 
 
 
 K*C,K-t 
 
 
 
 (tgrp 
 
 
 K.l.K-1 
 
 
 
 • 
 
 • 
 
 • 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 |a? | 
 
 |k,k-i [ 
 
 [*♦>,«- 1 | 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 • 
 
 
 
 
 
 
 
 
 1 
 
 1 
 
 
 
 
 
 • 
 • 
 • 
 
59 
 
 58 
 
 aT*.«rr\vj<i. cocuovfci K-\ 
 
 1 an 
 
 1ST 
 
 (^>r> 
 
 
 K+\,¥L-\ 
 
 C*r> 
 
 
 r^i 
 
 (»n 
 
 
 *T»,»cri»JC COUJMM K 
 
 aTi 
 
 tit ) 
 
 f^n 
 
 PC?") 
 
 r^ni 
 
 oncn^ue BjfBuem must wau ior nearly a complete aisK rotation Derore tne trans- 
 mission is started. This access time is, in general, not predictable; but it is 
 bounded by the disk rotation time. 
 
 The access time during which the computer does useless work is latency; and 
 tne object of the present investigation to minimize the latency for a reasonably 
 large class of problems. 
 
 DD 
 
 FORM 
 
 ,1473 
 
 UNCLASSIFIED 
 
 Security Classification 
 
58 
 
 A 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 1 VV) 
 
 
 
 
 
 fnf - ) 
 
 
 
 
 
 TTT*^ 
 
 
 
 
 
 ™r 
 
 
 
 
 
 l«T) 
 
 
 
 
 
 c»n 
 
 
 
 
 
 
 (mr^i 
 
 
 l»K-\ 
 
 
 
 
 
 t.K-l 
 
 
 
 (EI 7} 
 
 1 '- 1 
 
 V--V 
 
 
 
 (est) 
 
 
 4, (CM 
 
 
 
 
 
 •.-» 
 
 
 
 
 
 fc,<-» 
 
 
 1 •"« 1 
 
 1 •»« 1 
 
 1 *•«-' 1 
 
 • • 
 
 
 
 
 
 
 @d 
 
 
 
 
 K+t,K-\ 
 
 
 
 
 l *•«•' l 
 
 K + *,K-\ 
 
 
 
 fCEaT r ) 
 
 
 H»»,K-' 
 
 
 
 • 
 
 • 
 
 • 
 
 
 
 
 
 
 
 
 
 
 
 | W | 
 
 |EL J | 
 
 h.« | 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 1 
 
 
 
 r 
 
 
 
 
 
 
 
 
 
 1TM 
 
 TIMfe oau 
 
 MM V. 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 fiVi 
 
 
 
 
 
 US) 
 
 
 l|K 
 
 
 
 
 
 I.* 
 
 
 
 
 
 v« 
 
 
 
 
 
 *.* 
 
 
 
 
 
 *,*- 
 
 
 1 
 
 1 '•« 1 
 
 1 - 1 
 
 l*>- 1 
 
 • • 
 
 
 
 
 
 
 
 
 
 (E.V) 
 
 tK? ) 
 
 
 *,*. 
 
 
 
 CELT) 
 
 
 K+\,K 
 
 
 
 • 
 
 • 
 
 • 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 r 
 
 1 "." 1 
 
 1*,-' | 
 
 l--» 1 
 
 1** 1 
 
 • • 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 1 
 
Fiimm* 
 
 COLUMN K_t 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 • 
 
 • 
 
 
 
 
 
 
 • • 
 
 
 
 
 
 
 tt-I.K-t 
 
 
 
 
 K-ft.fc-Z 
 
 
 
 
 
 
 
 
 • • 
 
 V 
 
 \ 
 
 \ 
 \ 
 \ 
 \ 
 
 \ 
 \ 
 X 
 \ 
 
 
 
 \ 
 
 \ 
 s 
 \ 
 
 N 
 \ 
 S 
 \ 
 
 > 
 
 \ 
 
 
 
 
 
 R£VOI_UT\OVi ft 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 (ij^r) 
 
 H.-0*-«, 
 K-t 
 
 
 
 
 
 
 
 
 
 
 cu-ok-m, 
 
 
 [v+?) 
 
 • • 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 RSVOUITIOH B 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 K-2. 
 
 
 
 
 
 
 
 
 
 • • 
 
 
 
 
 
 COLUMN t 
 
 C-t 
 
 
 
 
 
 
 
 SWCTIMt 
 
 
 
 REVOl 
 
 ,UT\0*4 C 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 ( m -V) 
 
 
 
 • • 
 
 
 
 
 
 
 
 
 
 
 
 
 REVO 
 
 -JTION C» 
 
 
 
 
 
 
 
 (s*r) 
 
 
 
 
 
 Csr) 
 
 
 (ESV1 
 
 
 £,«.-> 
 
 
 
 
 
 
 1 , 1C- \ 
 
 • • 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 59 
 
 uueii une sysoem must wait ior neariy a complete aisK rotation Derore the trans- 
 mission is started. This access time is, in general, not predictable; but it is 
 bounded by the disk rotation time. 
 
 The access time during which the computer does useless work is latency; and 
 the object of the present investigation to minimize the latency for a reasonanbly 
 large class of problems. 
 
 DD , F °?..1473 
 
 UNCLASSIFIED 
 
 Security Classification 
 
59 
 
 PH4M1N* 
 
 COt-U^** 
 
 <L-i 
 
 
 
 
 
 
 
 
 
 
 
 POV - ■* -T 
 
 a' 
 
 A, 
 
 
 
 * 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 »,K-l 
 
 
 
 
 
 t,*-t 
 
 
 • 
 
 • 
 
 • 
 
 
 
 
 
 
 
 • • 
 
 HH-! 
 
 
 
 
 
 K-I.K-l 
 
 
 
 
 
 K,K-2. 
 
 
 
 
 
 K*S*-* 
 
 
 • 
 
 • 
 
 • 
 
 
 
 
 
 
 
 
 
 
 
 
 
 \ 
 
 \ 
 
 
 
 
 \ 
 
 \ 
 
 * 
 
 Po« 
 
 \ 1 
 
 \ 
 
 \ 
 
 \ 
 
 » l«-f . \ 
 
 A. M, 
 
 \1 
 
 -H 
 
 V 
 
 \ 
 \ 
 
 \ 
 
 I 5? _ 
 
 
 
 
 ■KVOLUTtON * 
 
 
 
 
 
 " 
 
 
 
 Krt 
 
 
 
 Ckee?) 
 
 *«-■>*■». 
 
 *-*- 
 
 
 
 
 £fc3") 
 
 
 
 ssi£ 
 
 
 
 
 
 a/';'-, 
 
 
 
 *-t. 
 
 OatlMl, 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 S 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 RSVOLUTtOM B 
 
 
 
 
 
 
 
 
 
 i.«-t, 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 •• 
 
 it- a. 
 
 
 JilKt, 
 K-t. 
 
 Sk 
 
 cr, 1 ; 
 
 
 
 
 
 
 
 
 
 
 
 STMTTlNfa 
 
 
 
 
 
 
 
 
 
 '-Po^, = - l-V» m 
 
 
 1 
 
 
 MVOtOHON C 
 
 
 1 1 
 
 
 COLUMN < 
 
 ^-t 
 
 
 
 
 
 
 
 C 
 
 
 
 
 
 |-REFE«£MCS. 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 Cir - ) 
 
 
 
 
 
 («r) 
 
 
 
 
 
 («^ 
 
 
 
 •• 
 
 
 
 
 
 
 
 
 
 
 
 D 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 RCVOHJTtON o 
 
 
 
 
 
 
 
 
 (on 
 
 
 
 
 
 cstn 
 
 
 
 
 
 (<r) 
 
 
 
 
 
 (DT) 
 
 
 t,K-l 
 
 
 
 
 
 t,lfc-l 
 
 
 
 (ESV 
 
 
 
 
 ..*., 
 
 •• 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 {_ 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
UNCLASSIFIED 
 
 Security Classification 
 
 DOCUMENT C3NTR0L DATA -R&D 
 
 (Security claeeillcetlon ol title, body ol abstract and indexing annotation null he entered whan the overall report la claeellled) 
 
 1. originating ACTIVITY (Corporate author) 
 
 Department of Computer Science 
 
 University of Illinois at Urbana-Champaign 
 
 Urbana, Illinois 6l801 
 
 2a. REPORT SECURITY CLASSIFICATION 
 
 UNCLASSIFIED 
 
 2b. GROUP 
 
 3. REPORT TITLE 
 
 DISK I/O FOR NON-CORE-CONTAINED P.D.E. MESHES AND ARRAYS 
 
 4. DESCRIPTIVE NOTES (Type 0/ report and Indue Ire datea) 
 
 Research Report 
 
 5 »uTHOR(5i(F/r.in»m., middle Initial, laat name) 
 
 Bruce Allen Bernott 
 
 8. REPORT DATE 
 
 March 5, 1969 
 
 7«. TOTAL NO. OF PAGES 
 
 67 
 
 76. NO. OF REFS 
 
 1 
 
 •a. CONTRACT OR GRANT NO. 
 
 46-26-15-305 
 b. PROJECT NO. 
 
 USAF 30(602) klkk 
 c. 
 
 d. 
 
 •a. ORIGINATOR'S REPORT NUMBER(S) 
 
 DCS Report No. 311 
 
 •6. OTHER REPORT NOISI (Any other numbers that may be aaelgned 
 thle report) 
 
 10 DISTRIBUTION STATEMENT 
 
 Qualified requesters may obtain copies of this report from DCS. 
 
 II. SUPPLEMENTARY NOTES 
 
 NONE 
 
 12. SPONSORING MILITARY ACTIVITY 
 
 Rome Air Development Center 
 Griffiss Air Force Base 
 Rome, New York 134^0 
 
 13. ABSTRACT 
 
 In finding discrete solutions of systems of partial differential equations on 
 a computer, one is faced with the problem that the desired number of mesh points 
 may exceed the machine's fast memory. This problem will be common on machines suet 
 as ILLIAC IV, for which extremely high computing power invites the use of meshes 
 many millions of words in size. Because of the high dollar price of fast memory, 
 it is sensible to look at large disk stores with high transmission speeds as back- 
 up storage for meshes and arrays. The main problem encountered is the access time 
 of such a storage unit. 
 
 The address of a block of data stored on disk might be taken as the address ol 
 the first word in the block. This address must specify both track number and 
 radial position of the block. If the computer issues a command to transmit the 
 block immediately after the beginning of the block has passed the reading head, 
 then the system must wait for nearly a complete disk rotation before the trans- 
 mission is started. This access time is, in general, not predictable; but it is 
 bounded by the disk rotation time. 
 
 The access time during which the computer does useless work is latency; and 
 the object of the present investigation to minimize the latency for a reasonanbly 
 large class of problems. 
 
 DD .'°~ .1473 
 
 UNCLASSIFIED 
 
 Security Classification 
 
UNCLASSIFIED 
 
 Security Classification 
 
 K E V WO R OS 
 
 LINK A 
 
 LINK B 
 
 ROLE WI 
 
 partial differential equations 
 
 mesh 
 
 arrays 
 
 back-up storage 
 
 disk rotation time 
 
 reading head 
 
 discrete solutions 
 
 1 ■ 
 
 UNCLASSIFIED 
 
 Securitv Cla*sifirarinn 
 
,* 
 
 <$>