— M SsSsss nn Ha cf, ■■ m ■■ ■■■J HBhhs B«H am an Mas HHB BnHn ■ebohS 9HHHH llMilifcMBMWMMBw! ■ HP ** ■ BBiWBMM gB6«taagBBKg —Mi HUT mmm ■HBBXaKnMH] ifllllM 111 I I I LIBRARY OF THE UNIVERSITY OF ILLINOIS AT URBANA-CHAMPAIGN JjUr no. 308-315 cop2 The person charging this material is re- sponsible for its return to the library from which it was withdrawn on or before the Latest Date stamped below. Theft, mutilation, and underlining of books are reasons for disciplinary action and may result in dismissal from the University. UNIVERSITY OF ILLINOIS LIBRARY AT URBANA-CHAMPAIGN APR 2 6 1974 APK 9 nigfli L161 — O-1096 Digitized by the Internet Archive in 2013 http://archive.org/details/diskiofornoncore311bern Report No. 311 March 5, I969 //U^^/ DISK I/O FOR NON-CORE-CONTAINED P.D.E. MESHES AND ARRAYS flpR -3 by Bruce Allen Bernott DEPARTMENT OF COMPUTER SCIENCE UNIVERSITY OF ILLINOIS AT URBANA-CHAMPAIGN URBANA, ILLINOIS w lEswnr if rfe *' 1389 rynriii, Report No. 311 DISK I/O FOR NOW- CORE- CONTAINED P.D.E. MESHES AND ARRAYS* by Bruce Allen Bernott March 5, 1969 Department of Computer Science University of Illinois at Urb ana- Champaign Urbana, Illinois 6l801 This work was supported in part by the Advanced Research Projects Agency as administered by the Rome Air Development Center under Contract No. US AF 30(602)i+l4U and submitted in partial fulfillment of the requirements for the degree of Master of Science in Computer Science, February, 1969* Ill ACKNOWLEDGMENT The author wishes to express his sincere appreciation to Professor Kuck for his efforts and guidance in the preparation of this report. Professor Kuck originally conceived the idea of CAT, a very high level language oriented toward solving partial differential equations numerically. (The language is also known by the name TRANQUIL II, as the origin of the acronym CAT seems to be shrouded in obscurity.) This report describes a part of what will be implemented in the compiler for that language; and with the advice of Professor Kuck, the author was able to keep the investigation relevant to the problem at hand. Gratitude is also expressed to Mrs. Sharon Hardman, who has typed the final manuscript. IV ABSTRACT In finding discrete solutions of systems of partial differential equations on a computer, one is faced with the problem that the desired number of mesh points may exceed the machine's fast memory. This problem will be common on machines such as ILLIAC IV, for which extremely high computing power invites the use of meshes many millions of words in size. Because of the high dollar price of fast memory, it is sensible to look at large disk stores with high transmission speeds as back-up storage for meshes and arrays. The main problem encountered is the access time of such a storage unit. The address of a block of data stored on disk might be taken as the address of the first word in the block. This address must specify both track number and radial position of the block. If the computer issues a command to transmit the block immediately after the beginning of the block has passed the reading head, then the system must wait for nearly a complete disk rotation before the transmission is started. This access time is, in general, not predictable; but it is bounded by the disk rota- tion time. The access time during which the computer does useless work is latency; and the object of the present investigation to minimize the latency for a reasonably large class of problems. TABLE OF CONTENTS Page 1. INTRODUCTION 1 1.1 The Problem 1 1.2 The Machine 4 2. MAPPING THE MESH ONTO DISK 6 3. SWEEPING A MESH 10 3.1 Normal Mode 10 3.2 Transposed Mode 19 4. IMPLEMENTING THE SCHEME 25 4.1 General 25 4.2 Measuring Latency 32 4.3 Storage Requirements 34 4.3.1 Fast Memory 34 4.3.2 Disk 37 5. NUMERICAL SOLUTIONS 40 6. CONCLUSION 49 APPENDIX 50 LIST OF REFERENCES 54 VI LIST OF TABLES Table Page 1. Positions of activity for row transition in row-normal reading 15 2. Acceptable values of I for normal reading. "A" indicates acceptable; blank indicates unacceptable. 1=1 or 1 . Parentheses indicate value which is unacceptable for transposed reading 17 3. Positions of activity for column transition k-2 to k-1 in column-transposed reading 23 h. Survey of results obtained on 11 mesh sizes. Latency < .12, normal reading in either direction 1+2 VI 1 LIST OF FIGUEES Figure Page 1. Blocking a mesh 3 2. Schematic of disk 5 3. Grouping edge values for row- sweeping 7 h. Mesh blocks and edge groups mapped onto a blocked disk . . 55 5. Row-normal reading: row 2 and beginning of row 3. Entries 56 enclosed in rectangles are written on disk; other entries are read from disk. . . 6. Column-transposed reading 57 7. Merging the end of column k-2 with the beginning of column k-1 59 8. Mapping edges into edge group blocks on disk 28 9. Illustrating relationships of T, T , and T 30 10. T vs. N for the solutions found for four mesh sizes. R v The number of solutions is printed for each value of N . . h^ 11. Minimum values of T„ found k"J 1. INTRODUCTION 1.1 The Problem A reasonably large class of problems is the class of two- dimensional partial differential equation (PDE) problems. Characteristic of finite difference methods for solving PDE's are the stars or stencils of the methods, an example of which is shown below. In order to compute d points deep t new Values of the variables (i.e., to update the variables) at the center point of the stencil, one needs to know the values of variables at several neighboring points in the horizontal and vertical directions. We may speak of the depth d of the stencil as the distance from the point to be updated to the furthest neighbor, measured in points. For the nine-point stencil illustrated above, the depth d is two points. When one wishes to update an entire mesh of points, the stencil must be applied to each of the points simultaneously, so that neighbors used to update each point are old values and not updated values. If the entire mesh cannot be contained in the fast memory of the computer, one must store the mesh in a back-up storage device such as magnetic disk, and read in only a part of the mesh at a time for updating. In this investigation, it is assumed that the mesh is rectangular in shape. It can then be sliced into rectangular blocks; and the blocks may be transmitted back and forth between the disk and the computer memory for updating calculations. The object of this investigation is to formulate a scheme for efficiently swapping these blocks of mesh between disk and fast memory. One problem is that to update one block, the computer must have access to an edge of each of the neighboring blocks. The depth of this interblock communication must be d points, the depth of the stencil which is applied to the mesh. Figures la and lb illustrate the slicing of the mesh and the communication necessary. The program or subroutine which performs the updating calculations on a block and its neighbor edges can be the same for each block of mesh, with some conditional branches to handle blocks on mesh boundaries. This subroutine can be called the kernel , to distinguish it from the supervisory program which handles input/output and other chores. The main constituent of the kernel will be the stencil calculations of the method used. There are two ways in which a kernel is likely to sweep or pass over the mesh: sequentially by rows or by columns. In sweeping by rows, blocks will be input (and output) in the order 11, 12, ..., In, 21, 22, ..., 2n, 31, , nin. In this case, it is clear that when updating block (i,j), nothing special need be done to input edges from blocks (i,j-l) and (i,j+l). We simply save in fast memory the rightmost edge, d points deep, of block (i,j-l), which will have just been updated and output to disk; and we delay calculations on (i,j) long enough to allow input of block (i,j+l). The more difficult problem is arranging to have access to the lower edges of blocks in row i+1 and the upper edges of blocks in row i-1. The scheme presented in this report uses multiple storage of m,i • • • • • • m,n • • • • • • • • • • • • • • • 2.1 2,t • • • »»l *,*• • • • • • • »,n Figure la. A general rectangular mesh sliced into blocks, i+t,j U-i i.J I I I H 2 •H W I g U u o 0) bO C •H a o m •H w leje • • • ••• B& J M 8 The overhanging outlines of edge groups at the mesh boundaries merely indicate that there are empty spaces in those edge groups. For sweeping the mesh "by columns, left and right edges of blocks are grouped analogously. A smallest I is chosen such that m m < I k- 2 — m Note that k will not be subscripted. In problems requiring transposition, k will have the same value in both directions. All of the parameters used in constructing a scheme will be constant throughout problem execution. For the purpose of describing the map of mesh onto disk, we will further simplify the schematic of disk by assuming the logical track to have a length of kT-1 (T defined below) blocks of b segments each. We will store each group of edges and each block of mesh in some block on disk as shown in Figure h (a fold-out chart). Figure k is a snapshot of a radial section of the disk. It is intended to appear as the left end of Figure 2. The map is drawn for "period" T=5 blocks. T measured in units of time will be the kernel calculation time for one block of mesh. T measured in blocks is the ratio of kernel calculation time for a block to the input transmission time for a block. In the analysis below, T will be measured in blocks. Figure k shows edges grouped as in Figure 3j i.e., for row- sweeping. For column sweeping, edges would be regrouped and re-allocated, mesh blocks would retain the storage allocation shown. The reason mesh blocks need not be re-allocated is that skewing makes allocation look the same for both rows and columns if one considers radial positions only (horizontally across page) and disregards track number (vertically down page). Block (i,j) is in the same radial position as block (j,i). Let p(x) be the radial position (disk block number measured from an arbitrary radial point A) of record X, where X is either an edge group or a mesh block. The positions of all records are defined in terms of p(l ). p(M 1 U / L ) = Pil^) + T p^ 1 ) = p^) * T pd.^) = p^) ♦ 1 p((i,i)) = pdn^) + 2 p((i,J+l)) - p((i+l,j)) = p((i,j)) + T Besides these, we add the requirement p((i,d+k)) = p((i,j)) + 1 (=p((i+k,j)) . which relates the grouping constant k to the length of the logical track. It defines the logical track to be of length (kT-l) blocks. Implementation of the simplified structure of disk used in Figure k on an actual disk with 1200-segment tracks will be taken up in a later section. For now we will assume zero head-switching time, and proceed to describe the sequence of I/O transactions for sweeping the mesh. 10 3. SWEEPING A MESH 3.1 Normal Mode If we are to sweep the mesh by rows, we wish to read into fast memory the blocks (l,l), (1,2), (l,3), • ••, (l,n), (2,l), (2,2), ... in sequence. Ignoring edge value transmissions for the time being, we read block (l,l) from logical track i in Figure k as the read/ write head moves from left to right. Let calculations on block (l,l) begin as the R/W head passes the point p((l,2))-l (point p(x) is the beginning of position p(X)). While calculations proceed on block (l,l), block (l,2) is read. Since we intend to read each block of row 1 of mesh as it passes by, the kernel must finish the calculations on each block within the period T. We will write the updated block (l,l) in position p((l,3))-l (on another logical track if necessary). At the point p((l,3))-l we begin calculations on block (l,2). Immediately after writing the new block (l,l) we read block (l,3). In general, we read block (i,j), write updated block (i,j-l), read (i,j+l), write (i,j), etc. in sequence. There will, of course, be a "hiccup" at the end of each mesh row, except in one case to be discussed later. If necessary, we spend a whole or part of a disk rotation of latency to insure that the finishing of one row does not interfere with the begin- ning of the next row. After sweeping the whole mesh in this way, the updated mesh will be stored on disk in the same configuration, but shifted 2T-1 blocks to the right. Column-sweeping (i.e., (l,l), (2,l), (3,1), •••, (m,l), (l,2), (2,2), ...) results in the same shift. It is evident that we may sweep by rows or columns any number of times in any order. 11 We now superimpose edge group transmissions upon the I/O sequence just described. The term "reading normal" will be applied to sweeping the mesh by rows with edge values grouped for row-sweeping (as in Figures 3 and k) and to sweeping by columns with edge values grouped for column- sweeping. The other two possibilities are column-sweeping with row-grouped edges and row-sweeping with column-grouped edges; the term "reading trans- posed" is applied to these types of sweeps. In reading normal by rows, edge values to the left and right of a block (as viewed in Figure 3) are input automatically because of the sequence of block reading, as mentioned before. Similarly for reading normal by columns, edge values below and above a block are input automat- ically. Edge values below and above for row-normal reading and left and right for column-normal reading will be input from the edge area on disk. The procedure for row-normal reading will be shown. Column-normal reading is analogous: edge group superscript U (upper) is replaced by R for "right" and L (lower) is replaced by L for "left". Row-normal reading is shown in Figure 5. This chart is not a storage map as in Figure k. Figure 5 is read a line at a time from left to right. Each numbered line represents one logical disk revolution. Each entry on a line represents the transmission of a record to or from the disk. Entries in rectangles are written on disk, other entries are read from disk. Horizontal position corresponds to radial position on a logical track exactly as in Figure h. Each read entry is, of course, in the position indicated by Figure k. It may be helpful to align Figure 5 on Figure h. 12 Only the I/O sequence for one row of mesh blocks need he described since the relationship between positions of edge groups required for that row and positions of blocks in that row are independent of row number. We will follow through the sequence for row 2 since row 1 is a slightly degenerate case. On revolution 1 we must initialize the calculation by reading the first upper edge group of row 1 I_ and the first lower edge group of row 3 III-, . With this data in fast memory we may read and update the first k-2 blocks of row 2, since I, contains edges from the 1st k-2 blocks. We, therefore, start reading the blocks sequentially from (2,1) on revolu- tion 1 and write each block, updated, 2T-1 disk blocks after it is read. This eventually brings us back to radial point A where revolution 2 begins. We read edge group I~ , which contains edges below blocks (2,k-l) to (2,2k-2), as it goes by. Notice that we have read the edge below (2,k-l) well before we actually need it. Later on in revolution 2 we read III p which contains edges above blocks (2,k) to (2,2k-l). The edge above block (2,k) is read in at p((2,k)) + T-2; but this is acceptable since calcula- tions on (2,k) are not started until the point p((2,k)) + T-l. Besides writing updated blocks as we move along the disk, we must write updated edge groups. These groups must be written in positions such that the disk configuration of the updated mesh and edge groups is the same as Figure k. Hence, we must write the updated edge group II, 2T+2 disk blocks before the position of the updated block (2,l), and L ' / \ II T+2 disk blocks before updated block (2,1). Fortunately, so to speak, there are no other activities which must occupy the R/W head in 13 * U' L' these positions on revolution 2 ; and the fact that II, and II, must contain updated edges from blocks (2,k-2) and (2,k-l) respectively is no problem, since those blocks will have been updated by the time the corre- sponding edge groups need be written. There must, of course, be buffers in fast memory for accumulating updated edge groups prior to transmissions to disk. On revolution 3 the same procedure is followed with all trans- missions shifted to the right by one block, all subscripts incremented by 1 and all j-values of blocks (i,j) incremented by k. With each revolution transmissions shift right one disk block by the fact that we have arranged the logical track to be kT-1 blocks long and mesh block transmissions have period T. Earlier we defined I such that U n -l)k-2 < n < i n k-2 . Let us say that n = £ k-2. Then the last transmissions for row 2 of mesh n appear as shown on revolution I +1 in Figure 5. If n < i k-2, then some n n ' mesh block transmissions disappear, but updated edge groups II, and L- n II, must still be written in the positions shown. Processing of row 3 n on some revolution r~ may begin after row 2 is finished. We may have r~ = I + 2; however, in this case we spend more than a revolution of We do not attempt to take advantage of the machine's capability to perform two transmissions simultaneously. This capability arises from having two electronics units, each with half of the total number of disk storage units. As a result, any simultaneous transmissions must occur in opposite halves. We do not attempt to meet this restriction. Ik latency (kT-1 + p((3,l)) " (p((2,i k-2)) + T) disk blocks). We would like to have r = £ +• 1. This cannot be done with the sequence shown in Figure 5 "because of conflicts in p(ll, ) and p((3,l))« The positions of activity on revolution & +■ 1 are for £ = T-l with T = 5. If £ were n n n 5, line £ +■ 1 would be shifted one block right, and we could have r =£ +1. It is clear that the value of £ is critical in determining latency. Acceptable values of 1 , i.e., those for which r_ may equal £ + 1, can be found in general by comparing activity in revolution r to activity in revolution £ + 1. Take p(ll, ) = 0. Then revolution r_ n r 1 3 has activity in positions 0, 3T, 3T+2, ^T+2, 5T+1, 5T+2, .... We also find positions of activity for revolution £ + 1 by first determining p((2, i n k-2)): p((2,l)) = 2T + 2 p((2,k)) = p((2,l)) + 1 " T p((2,k-2)) = p((2,k)) - 2T p((2,je Q k-2)) = p((2,k-2)) + (£ n - 1). Therefore, p((2, £ k-2)) = 2 + £ - T. Other activity in revolution ' ^ ' n n # + 1 can be determined from this fix. Positions of activity are summarized in Table 1. (2, -2 n k-2) 1 + i + T n II U» -2 + J + 2T n II L» •2 + i + 3T n 15 ENDING ROW 2 STARTING ROW 3 ACTIVITY POSITION ACTIVITY POSITION (2, ^ n k-U) 2 + i n - 3T II U (2, ^ n k-5) 1 + ^ - 2T n IV. 3T (2, ^k -3) 2 + i - 2T (3,1) 3T + 2 (2, i n k-^) 1 + i - T n (3,2) hi + 2 (2, i k -2) \ , n 2 + i - T (3,1) 5T + 1 (2, i n k -3) 1 + i n (3,3) 5T + 2 Table 1. Positions of activity for row transition in row-normal reading. 16 We will not take into account any activity before p((2, £ k-k)) in ending row 2 or after p((3,3)) in beginning row 5. This will lead to an analysis which is correct for 2 + £ - 3T < and -2 + £ + 3T < 5T + 2; or I < 3T - 2 and £ < 2T + k. It has not yet been mentioned that the scheme proposed will work only for T > 5. We look for acceptable values of £ < 10, which satisfies both inequalities above for T > 5. A value is acceptable if the two sets of positions in Table 1 are disjoint. The sets are disjoint for £ ~ T, for example. £ = T-3 is unacceptable for T = 5,1, acceptable otherwise (T > 5). I - T-2 is always unacceptable. In general, a. value can be tested for acceptability by substituting that value into the left column and comparing each value obtained with all of the elements in the right column, for all values of T. Whether one sub- stitutes numbers or expressions in T for I , it is a painful task. Table 2 shows acceptable numerical values of £ versus T. For any numerical value of £ . a minimum K„ can be found such that for T > K„ the £ is £ n - £ n either always acceptable or always unacceptable. L. = 13: £ = 10 is acceptable for T > 13. K. < IC , i - 1, . .., 9. The table is therefore not shown for T > 13. Table 2, along with a corresponding table for trans- posed reading, will be useful in finding a scheme with minimum latency for a given problem. 17 5 6 7 8 9 10 11 12 13 1 A A A A A A A A A 2 3 A A A A A A A A 1+ 5 A A A A A A A 6 A A A A A A A 7 A A A A A A 8 A A A A A 9 A A A A A 10 (A) A A A A Table 2. Acceptable values of i for normal reading. "A" indicates acceptable; blank indicates unacceptable. £ = £ or £ . Parentheses indicate value which is unacceptable for transposed reading. 18 Only row-normal reading has been considered in this section; but it is clear that all of the results apply also to column-normal reading with column parameters replacing corresponding row parameters --namely, I , m, and edge group superscripts R and L. As mentioned before, k is the same in both directions. We now examine the problem of changing over from sweeping by rows to sweeping by columns. Note in Figure 5 that when updated edge groups U' L» II. and II. were written on disk, the upper and lower edges of groups of k blocks in a row were written. This prepares the disk storage structure for a subsequent sweep by rows. If the next sweep is to be by columns, however, we store on disk the updated right and left edges of the blocks in the row rather than the upper and lower edges respectively. We will R'T L'T R'T U' use the notation II. ' and II. .II. is written in place of II. , 1 11 ^ l ' L'T L' and II. is written in place of II. . The problem arises that these R'T right and left edges are still grouped by rows; i.e., II. contains L 'T updated right edges from blocks (2, (i-l)k-l) to (2, ik-2), and II. ■X- contains updated left edges from blocks (2, (i-l)k ) to (2, ik-l). Edges in all of the R. ' and R. are grouped in the same way. Since these edges are not grouped for column-normal reading, a different I/O sequence will be required. This new sequence will be called column-transposed reading. Instead of reading k edges from one place in a logical disk revolution, we will read the edges one at a time from many points in a * It may be helpful to imagine the individual edges (k per edge group) in Figure 3 to be rotated clockwise by ninety degrees. 19 revolution, so that the flow of edge values into fast memory becomes a quasi-continuous process as opposed to the single transmissions in Figure 5. 3.2 Transposed Mode Figure 6,& and b, illustrates the procedure for column-transposed reading. The mesh is assumed to be mapped on disk as in Figure h; except that upper edge groups R. have been replaced with right transposed edge groups R. and lower edge groups R. have been replaced with left trans- — LT posed edge groups R . A sweep on column 2 will be used as a first example. The first block to be input is (l,2), for which we need the right edge of (l,l) and the left edge of (1,3). The right edge of (l,l) is contained in edge group pm pm I ; specifically, it is the third edge in I , counting the two empty edge spaces overlapping the mesh in Figure 3. Likewise, the left edge of T T (l,3) is the fourth edge in I . The edges needed for block (2,2) are pm t m the third and fourth edges in edge groups II and II, respectively. In order to move up column 2, we must have one edge each from all edge groups with subscript one. Figure 6a shows "simultaneous" transmissions ■pm of single edges from pairs of edge groups. An edge from II and an edge LT from I 1 , for example, are to be read within the space of one disk block. The reading head will be required to switch tracks within a disk block, and the edges within the groups involved must be arranged such that no pair of edges required in any transmission have the same radial position. This matter will be taken up in a later section; for now we assume the requirement to be satisfied. 20 As each successive column is swept, the beginning of the mesh block transmissions moves T disk blocks to the right because of skewing. The edge value transmissions, however, start at the same radial position up to and including the sweep on column k-2. We are reading edges farther and farther ahead of corresponding mesh blocks. When mesh blocks have finally shifted around an entire logical track, we will be unnecessarily- reading edges a disk revolution ahead of the mesh blocks for which the edges are needed. At this point we may shift to reading edges immediately before corresponding blocks. Such a shift is executed in two stages in Figure 6b. The first stage of the shift is executed at column k-1. Only the transmissions of left edges from mesh blocks in column k are shifted. Transmissions of right edges of column k-2 are still started a revolution ahead of mesh block transmissions. Column k-1 is a special case, as its eastern neighbor edges are now in edge groups with subscript 2 while its western neighbor edges are still in edge groups with subscript 1. It would be possible to shift transmissions of both left and right edges at column k-1; however, shifting the left edges only results in a greater freedom with the parameter I , when one considers the efficiency of the transition from column k-2 to column k-1. The second stage of the shift is executed at column k. Now all edges needed are in edge groups with subscript 2. The sweep up column k+1 is similar to the sweep up column 1, except that column 1 has no western neighbor edges. Likewise, the sweep up column k+2 is similar to the sweep up column 2; and in general the sweep up column j is similar to the sweep 21 up column j modulo k. Columns nk-1, n=l,2,...,i! -1, all require edges from edge groups with different subscripts. When the updated edge values are written, they will, of course, — L' he grouped in the column direction. If left and right edge groups R. — R' — — and R. , R = I, II, ..., n , are written, then the updated mesh will be organized for column-normal reading. If the subsequent sweep were to be — L'T — U'T by rows, lower-transposed R. and upper transposed R. would be written instead. If R indicates a sweep by rows and C a sweep by columns, then sequences of the type (RC) will have every read in the transposed mode. Sequences of the type (RRCC) will have alternating normal and transposed reading sequences. Merging the transmissions at the end of one column with the transmissions at the beginning of the next column is slightly more complex than for reading normal in that edge transmissions at the beginning of a column may penetrate very deeply into the transmissions for the preceding column. The transitions from column k-2 to column k-1 and from column k-1 to column k are worst cases. Both of these transitions will be inspected. Figure 7 illustrates the ending of column k-2 and the beginning RT of column k-1. Let the reference position p((l, )) be zero. For starting — — RT column k-1, there is activity in positions 0, T, 2T, ..., p(( k+l ) ) = kT mod(kT-l) = 1, p((l,k-l)) = 3, 1+T, 3+T, 1+2T, 2+2T, 3+2T = p((3,k-l)). Positions of activity for ending column k-2 may be computed by first obtain- ing a fix on block ((£ -l)k+l,k-2): p((l,k-2)) = p((l,k-l))-T = 3-T p((U m -l)kH,k-2)) = p((l,k-2)) 4- (i m -l) = 2 + i m - T . 22 We must compare revolutions A with C and B with D. Note that for the first comparison, we need only compare activity between markers M and 11 on A with the first activity on C because of the periodicity T of activities and the facts that p(| {£ -l)k,k-2 I) = 1 + I > p((l n )) and p((i k-2 n )) = m / ' m 1 m 1 '' £ -2T < p((l n )) for i < 2T. We will again look for values £ < 10. m - 1 ' m— D m — For the comparison of B with D, note that p(| £ k-2,k-2 I) = m 2+i -2T < p((l,k-l)) = 3 for £ < 2T; so that we need only take account of possible conflicts between the two updated edge group transmissions on B and the mesh block transmissions on D. Transmissions to the left on B and D will be compatible if transmissions to the right on A and C are compatible. Also the two updated edge group transmissions on B will not conflict with edge transmissions on D if updated edge group transmissions on A do not conflict with (l n ) for £ < 2T. Of the mesh block transmissions 1 m — on D we need only consider (l,k-l) and (2,k-l) since the next one, | l,k-l is always to the right of the last activity of B for i < 2T. Actually, we need only consider the last transmission on B. Table 3 lists the activities of interest in making the column transition. Sets A and C must be disjoint and sets B and D must be disjoint. If we construct a table of acceptable values such as Table 2, we find that there is only one unacceptable value pair (£ ,T) which is not on Table 2. That pair is £ = 10, T = 5. If we delete the corresponding "A" from Table 2, we have a table of acceptable values for both normal and trans- posed reading. 23 STARTING COLUMN k-1 A ^uuuiui xv— c: C ACTIVITY ft") ACTIVITY POSITION i -2T m 1+jg -2T m POSITION (i m -l)*-2, k-2 ((l n -l)k,k-2) 2^ m -2T k-2 R« V 1 -2+i -T m (i -l)k-l, m k-2 1+i -T m ((i -l)k+l f k-2) 2+i m -T k-2 L' V 1 k-2 L' m -2+i m -1+i m D (l,k-l) (2,k-l) 3 3+T Table 3« Positions of activity for column transition k-2 to k-1 in column-transposed reading. 2k One might question the statement that we have obtained all unacceptable pairs (i,T) for transposed reading since we have considered only one transition out of k different transitions. In fact, it can be verified that we have all unacceptable values for I . £ < 10 and T > 5. e nr n — — Having investigated most of the logical aspects of the scheme under consideration, we now consider the problem of implementing the scheme on a disk storage unit with 1200 addressable segments per track. 25 k. IMPLEMENTING THE SCHEME 4.1 General For mesh storage in one quadrant of ILLIAC IV, it is efficient to store 8x8 squares of mesh points in "quadrant words" across the 6k processing elements. For this reason, it will be assumed that the mesh is subdivided into 8x8 squares; and mesh blocks will have dimensions of p X q squares, p and q integers. The smallest addressable piece of data on the disk is the segment, which consists of 256 words, or h quadrant words. The head-switching time of the disk, again, is taken as two segments. We construct each logical track pictured in Figure k from t disk tracks for some integer t. In doing so, we do not take advantage of the t-1 times an actual radial position is passed within each logical revolution. If b is the number of segments in a disk block, then b(kT-l) < 1200t must be satisfied. We might try segments, I l200t | [kT-lJ where the operator L J indicates the greatest integer less than or equal to the argument. The truncation will result in wasting a number of seg- ments; we allow this if the number of unused segments is reasonably small. These wasted segments become a dead area on disk and will never be used for storage. The dead area will, of course, contribute to overall latency. Of the b segments in a disk block, we may use b-2 segments, with 2 segments at the end or beginning of the block for head switching. If b 26 does not divide 1200, then some disk blocks may straddle track connections and we will, in general, have to allow 2-segment spaces at each of the t-1 connections of the disk tracks. If the dead area is not large enough to cover this, one might reduce b or try another combination of k and t. In an actual mesh there will be a number of variables associated with each mesh point. Let this number be N . We must have v N pq< i+(b-2) . v In addition, if the mesh has dimensions M X N 8x8 squares, M < pm and N < qn where m,n are dimensions in mesh blocks. The I , I satisfy nr n U -l)k-2 < m < i k-2 m — m U -l)k-2 < n < i k-2 v n — n and they should be acceptable as defined by Table 2. Noting that there are k edges per edge group, one edge must be contained in M segments for the reason that individual edges must be addressable for transposed reading. An edge consists of N d(8p) or N d(8q) points. Hence 8N d Max(p,q) < 256 b-2 27 The number of segments required for an edge is 8N d Max(p,q) + 255 If it is not necessary to change the direction of sweeping over the mesh, then edges need not be individually addressable, and the require- ment is less stringent: 8N d s k < 256 (b-2) where s is p or q depending on the direction of sweeping. In the description of transposed reading, it was assumed that when two edges from different edge groups were required within the space of one disk block, the two edges would not be in the same radial position. In fact, the two edges must be separated by at least two segments. For k > 6 this can be guaranteed as follows. Observe from Figure 8 that for sweeping up column (i-l)k+j-2 we must read the (j-l)th edge from a superscript RT edge group and the jth (modulo k) edge from a superscript LT edge group. Each edge is mapped into an "edge slot" on a logical track as shown. All LT edge groups begin with the (k-l)th edge in the first edge slot; and all RT edge groups begin with the 1st edge in the first edge slot. Then the edges required for trans- mission are separated by two edge slots for k > 6. The smallest edge slot possible is one segment. In moving up column ik-1, we need the kth edge RT , ^ ., , =r LT from R+k+1 . ' and the 1st edge from R. , for which the requirement is still satisfied. If the edge slot is two segments or more, then k > h I I 28 1 1 1 J*: wmm i -* ~ CVJ _ t CM I gSSSSSS '-3 2 C Z) -i o o -J U) e UJ w l —3 CM . lil w CVi r-o CM CVI CM O (it 3 II | j.| i £ i i -j i w •H Tl a o w A3 CJ o H o taD cu o ■s •H s •H I CO 0) •H I I I 29 is sufficient. Of course, the unused portion of the b-2 segments allotted might be distributed between edge slots to achieve separation. No mention has yet been made of having T be some value > 5 other than an integer. This can certainly be done without upsetting the logic of the I/O sequence. Another point concerning T should be considered, however. Recall that T is the ratio of the allowed compute time for a mesh block to the one-way transmission time for a disk block of b segments. T is the logical period of the scheme, and it is a parameter of great interest to us. When dealing with a calculation kernel, however, we are more inter- ested in the ratio of the allowed compute time to the input time for a mesh block, which is always less than a disk block by 2 segments or more. Let this ratio, the real period, be devoted by T . N pq/^ segments are used to store a mesh block; therefore UTb T R N pq v Figure 9 presents the relationship graphically. Both T and 1L are measures of the same quantity, the amount of time available for computation on a mesh block. They are measures in different units, however, as indicated in Figure 9. We are more interested in T R than in T because kernel times can be given in mesh blocks. By mesh block units of time, we mean, of course, the transmission time for a mesh block of N pq/U segments. Although a mesh block is actually transmitted in [ (N pq+3)AJ segments, we are interested in the time to transmit that I i/> 30 W « r -H . o w -p DO d Q, *n ^ S b.n III III 3 •H AJ Ai ■P o o cn o o u H H p ^> ,Q n fl ^! X! d H Q S h 3 DO ■H fe 31 part of L(N pq+3)/^J segments containing data. It is obvious that T is a lower bound on T and that T cannot equal T because of the two segments R K reserved for head switching. With every kernel there can be associated a number T^ which is the ratio of the calculation time per point to the input time per point. T„ is the calculation time per point normalized to the disk transmission K rate; it is the kernel time in mesh blocks. It is assumed that T is independent of the size of the mesh block which the kernel is updating; although if the compute time varies slightly because of I/O interrupts, etc., T T . should be an upper bound on the compute time. We must have 1 V < T^; and we wish to have T„ as close to T^ as possible in order to JK. — K K iv match the overall transmission rate of the scheme to the speed of the kernel. If several different kernels are applied to the mesh, then we match the scheme parameter T to the slowest kernel, since it appears that the parameters of the scheme cannot be changed during a sweep. Given a mesh of dimensions M X N 8x8 squares with N variables per point, a stencil depth d, and a normalized kernel time T K , the problem is to find scheme parameters for which the latency of the disk and the value T„ - T„ are as small as possible. In reality, T„ - T„ is part of K iv K iv the overall latency since for T_ > T^ the data on disk is "not there" exactly when we are ready to compute on it. The value T_ - T T/r can be K iv thought of as the ratio to Tb segments, of the number of segments for which the computer is waiting for the disk to send another mesh block. In spite of this, the present discussion will distinguish between overall latency and the value IL - T T _. K iv 32 k.2 Measuring Latency The measure T - T v represents latency which, in a sense, can he made to go away hy increasing T„. It is not proposed to increase the com- K plexity of a given kernel for no reason; but rather if a solution of the scheme parameter relations exists, then one might look for a problem with T close to the T for the scheme. T - T might be called external latency. The question now is: "What is internal latency?" We can include the wasted time between complete sweeps of the mesh, the "hiccups" between sweeps along rows or columns of the mesh, latency due to incompletely- filled blocks on the last row or column, and the time spent skipping over the dead area on disk. The last three elements will be discussed here. The size of the dead area on each logical track is 1200t-b(kT-l) segments. Since it is passed on each logical revolution, the percent of total time spent on the dead area, i.e., the dead area latency, is 1200t-b(kT-l) D " 1200t It is unlikely that any code could do useful work to mask the dead area latency since the dead area is not distributed across the disk blocks in a logical track, and furthermore since it moves left by 2T-1 disk blocks relative to the mesh with each complete sweep. The "hiccups" between sweeps along rows and columns will cause latency unless n = I k-2 and I = 3T, and likewise for m. For n = $, k-2 n n n there will be 3T-i disk blocks of latency at the end of each row. In addition, if - < n < I k-2, we may not be using all of the £ k-2 mesh q — — n ' n N blocks allowed. Then I k-2 blocks are effectively wasted. For every 33 block so wasted we incur an additional T blocks of latency. This occurs at the end of every row, out of disk revolution through approximately I (l200t) + Tb segments. The latency incurred at row connections is n therefore (3T-i + T(i k-2 - £))b L R = 1 (1200t) + Tb n Similarly, for column sweeping, . <3T-l m ♦ TU m k-2 - g))l, L C ,0 (1200t) + Tb m Additional latency occurs on the last row for row-sweeping if M < pm. The dimension m should be the smallest integer such that M < pm. This latency, like the latency at row connections, might be masked by use- ful calculation on the boundaries of the mesh. Extra computing time can be provided for all four boundaries by strictly enclosing the M X N mesh in the m X n blocks. It will be assumed here, however, that updating calcula- tions on boundary points are no more time-consuming than those on interior points. The storage available in the last row, but not used, amounts to N N(mp-M)/^ segments. The latency occurs once in m rows; its value is approximately lAN N(mp-M)T„ T - v R RL m(i (1200t)+Tb) 34 Similarly, for column-sweeping, l/4N M(nq-N)T T3 _ ^ V K CL n(i (1200t)+Tb) In order to determine overall internal latency, one must take account of the order of sweeping. For an equal number of row and column- sweeps, L = L D + 1/2(L R 4- L^ + L c f L CL ) is a reasonably good measure of overall internal latency. One latency term has not been included. It is the latency spent re-initializing between complete sweeps. There will be no attempt here to calculate it, although in some problems it may be important. 4.3 Storage Requirements 4.3.1 Fast Memory If one examines the sequence of I/O for sweeping the mesh, he can tally all mesh blocks and edges contained in fast memory at every instant, and determine the amount of storage needed for the data as a function of time. If the amount of storage needed for program and scratch area is added, and the maximum over time of the total storage required is determined, one can state whether the scheme will work for a memory of given size. Alternatively, one may examine storage requirements over a sample of problems and problem sizes, and attempt to estimate the amount of fast memory required for a particular installation. 35 The maximum amount of fast memory required for storage of the mesh data is a function of the mesh and scheme parameters. It is also, in a somewhat odd way, a function of the organization of the fast memory itself, and of the size of the smallest addressable segment on disk and the length of the disk track. The storage required for sweeping in transposed mode is greater than for sweeping in normal mode. The difference is of the order of an edge group, but it must be remembered that edge groups are usually larger if a transposed read is required, since storage for individual edges rather than edge groups is rounded upwards to the nearest disk segment. Since there are many numerical PDE problems which do not require changing the direction of sweep, it is worthwhile to investigate the require- ments for normal reading separately from transposed reading. In this report only normal reading will be analyzed. Because of the parallel structure of ILLIAC IV, special problems arise in the allocation of memory. One must be clever in the design of the program and in the distribution of data across the processing elements. One of the constraints imposed in the analysis, that of considering 8x8 squares as the smallest subdivision of the mesh, resulted from taking account of the structure of the fast memory. This structure also causes problems with storage of edge values. If an edge or an edge group is packed tightly into the smallest number of quadrant words that will contain it, then it is likely that some of the edge points will not be located in the proper processing elements. Additional code will be needed to route data to proper PE's when the data is needed. The space saved by packing the edges or edge groups may be used up by the additional code and scratch areas. 36 Nevertheless, in this analysis we will calculate storage requirements based on having data packed moderately tightly. For row- normal reading an edge group consists of 8N kqd words. The number of disk segments needed to contain an edge group is s 8N kqd +255 256 The number of quadrant words needed as an I/O area for an edge group is T7 EGR *F Likewise, the number of segments needed to contain a mesh block is s N y pq + 3 and the number of quadrant words needed as an I/O area for the mesh block is SQ S For calculations on block (i,j), an area must be set aside to store the right edge of block (i,j-l). As calculations on (i,j) sweep the block, old values from the block may be moved into the edge area so that when (i,j) is completely updated, the edge area contains the right edge 37 of (i,j). The number of quadrant words needed for the single edge is Q 8N pd + 63 v — 55 — We might include an 8 X 8 mesh square of storage as a token amount for overhead. This amounts to N quadrant words. If the I/O sequence for row-normal reading is examined, it can he seen that we need «" = 3WE E ♦ iff ♦ ItaOff", iff) ♦ iff ♦ N mem SQ, SQ SQ ' SQ Q v quadrant words of fast memory for storage of the data. The third term takes account of the case in which the ending of one row moves far into the begin- ning of the next. A similar expression W exists for column-normal reading, mem No attempt will be made here to estimate the storage required for program and scratch areas. U.3.2 Disk An easy way to manage disk is to allocate half of the disk to old mesh and edge groups and half to updated mesh and edge groups. This proce- dure takes no advantage, however, of the space on disk which becomes avail- able for updated mesh blocks as the mesh is swept. Referring once again to Figure k, note that, starting from block (l,l) for example, successive disk blocks are filled with each k blocks added to the mesh row. The last block in the mesh row is (l,i k-2). If I < T, the mesh row will fit into one logical track, but if i > T+l, we n — ° ' n — ' will have to use another logical track for storage. The same situation 38 V 1 occurs for the other mesh rows. One mesh row would require ( L-= — J+l) logical tracks. It is now proposed that rather than using T out of T disk blocks for storage, we use only T-l out of T blocks; so that we use another logical track for I > T, and yet another for £ > 2T (the second logical track is n — ' n — filled completely). In this way we insure that there will be an empty block immediately before blocks (1,2), (1,3), ..., (l,k). We may then write updated blocks in these spaces as we sweep the row. Updated block (l,k-2) • is written in the space before (l,k). (l,k-l) • is then written over (l,l) since we do not need (l,l) any more. Old blocks are thus successively overwritten by the updated (k-2)th block following them in the row. One i n mesh row then requires ( L^— J+l) logical tracks; and m mesh rows require i n ' m( L— J+l) logical tracks. Note that this procedure for managing disk also works for column- sweeping. Because of this we may choose the smallest of two possible stor- age requirements: W^ = Min(m(L-fj + D, n(L-fM)) . No such game can be played with edge group storage. However, we may still choose the minimum requirement for two possible storage methods. Attention is drawn to the upper edge groups (superscript U) in Figure k. One storage method is shown in the schematic. Groups with successive subscripts are stored in adjacent blocks on a logical track for i < T. Group 1l would be stored on logical track 3 in the first 39 position, although it is not shown since the drawing has £ = T-l. Groups with successive Roman numerals simply wind around disk at intervals of T blocks, until the last one, I k-2. For every increment of I , another , V 1 ~ — m ( L-= — J+l) logical tracks are added, as long as I < 2T. If one increases V 1 m I to 2T+1, 2.{Y— — J+l) logical tracks must be added; however, we will con- sider only 1,1 < 10 and T > 5- The second possible storage method is to put R , R , ..., R„ on L d \ different tracks, and to put I,, k+1 , ..., ik+1 in adjacent blocks on the same logical track for i < T-l. I, , II, , III, , ... are still to be stored on the same logical track, spaced T blocks apart. The expression for the number of logical tracks required is the dual of the expression for the first method if £ , £ < 2T. nr n — The number of logical tracks needed for upper edge groups is W^ = MinU m (L^-J+D, yi^-J + l)) We allocate four such areas on disk, one each for old upper and lower edge groups and one each for the new. The amount of disk storage needed is W dis k ■ ^ + *«£> tracks. Note that this measure is in disk tracks and not logical tracks. Uo 5. NUMERICAL SOLUTIONS For a given problem M, N, N , d, T , a scheme must be found with a latency which is within an acceptable limit. The difference T^ - T v should be accounted for in the latency measure; however, this section will be an informal discussion of the existence of schemes for given values of M, N, N , d and will consider only the internal latency of the scheme. If the values of M, N, N , T, t, k, I , I are specified, a ' ' v' ' ' ' rrr n ' scheme can be determined if all of the relations of the last section are satisfied. The latency may be calculated and tested for acceptability. The value T^. may also be calculated; as well as a maximum allowable value of d. If we are interested only in normal reading in either direction, then we may leave one of the i-values unspecified, and impose some addi- tional restriction or specify some other parameter. A program has been written for the purpose of investigating the existence of solutions of the parameters for various problems M, N, N . A simple procedure is used: T, t, k, I , are iterated, and valid- solutions for which the latency is less than or equal to .12 are printed. For each T, t, k values for £ and I are tried successively in an attempt to find 7 ' m n a scheme for column-normal or row-normal reading respectively. The program is written in Burroughs B5500 Extended Algol, and it is listed in the appendix to this report. For each mesh size M X N squares, values 3 to 8 in unit steps were assigned to N . The free parameters T, t, k were iterated in unit steps over the ranges 5 and 6, 1 to 5 , and h to 30 respectively. Values in for I ox I were chosen from Table 2. The results of particular interest m n are the values of T^ and ¥ obtained. We would like to see many solutions, R mem with values of T^, well distributed and with memory requirements very low. A survey of the results that have been obtained is presented in Table h. All of the solutions have W < 1500 and W, . , < k8. The smallest problem mem — disk — listed is core contained for N = 3 and k. The largest problem listed represents about one -third of disk capacity for N =8. For each solution a maximum allowable value of d, d , is 7 max 7 calculated. If d < 3, the solution is rejected. Tests have shown that max ' no more than eight per cent of the solutions obtained by the program are rejected on this basis. Fast memory requirements are calculated for d = 3- There are few, if any, finite difference stencils in use for which d is greater than three; so that it is justifiable to group all solutions with d > 3 into one class. Each of these solutions is valid for d < 3. max — — The smallest and largest values of T , over the entire range of K the parameter N , are listed. For the large meshes the solutions are numerous over small ranges of T ; and the trend seems to be that solutions K are fewer for smaller meshes and occur over larger ranges of T . The dis- K tribution over T is discussed below. At this point a comment should be K made concerning the method of finding the solutions. The program used to obtain the results incorporates an artificial restriction on the mesh block size which is equivalent to an attempt to minimize the value of T for given scheme parameters. Within the program the parameters p and b are calculated; and from these a largest value of q is determined such that the bound W pq < l+(b-2) is satisfied. If q were 1+2 Percent of Solutions Number of m T") *%. »»-. na — With W < mem — MXN Solutions T Kange K 1+00 800 1200 20 X 20 28 X 32 270 5.21 - 59.^3 55 100 100 30 x i+o 15* 5.83 - 1+1+.00 1+8 99 100 35 x 35 27 6.00 - 1+8.00 1+1 100 100 1+7 x 1+7 89 5.16 - 28.37 35 85 100 ^5 x 55 1+05 5.10 - 2k. kh 29 79 97 55 x 65 1+97 5.09 - 17.09 23 73 9h 60 x 70 1+00 5.11 - ll+.OO 36 9h 100 90 x 70 507 5.10 - 11.91 33 88 98 60 X 110 675 5.08 - 12.27 29 78 96 90 X 110 567 5.07 - 11.02 33 83 95 Table 1+. Survey of results obtained on 11 mesh sizes. Latency < .12, normal reading in either direction. h3 iterated downward from this largest value, solutions with larger T„ might K be found. The program uses only the largest q; the attempt is repeated with the variables p and q interchanged. There is another constraint on q which leads us to expect higher T- for smaller mesh dimensions. Given n and N, in order to minimize internal latency, q should he the smallest value such that qn > N. For smaller mesh dimension N, this bound is more likely to be lower than the bound mentioned above. In fact, most of the solutions for the small meshes had m or n equal to one and p or q spanning an entire dimension of the mesh. In such solutions an entire column or row of 8 X 8 squares would be read at a time: and the W were overesti- 7 mem mated by the program because the edge group areas allocated would not be needed. A rigorous mathematical investigation of the existence of solutions to the system of relations has not been performed; and the program used does not find every solution possible in the given T ranges. In fact, because K of the peculiar behavior of the remainder terms in the integer divisions, the program may not even find the solutions with smallest T for the iterated K parameters. The results obtained, however, are interesting even without the assurance that all possible solutions have been found. The last three columns of Table k give the percents of the solutions found for which fast memory requirements are less than or equal to UOO, 800, and 1200 quadrant words. For the larger meshes, a larger percentage of solutions have high memory requirements. This does not necessarily indicate, however, that larger meshes require more memory. No attempt has been made at this time to examine in detail the fast memory requirements; but this problem is worthy of further investigation. kk The relationship between T^ and N is illustrated in Figure 10 R v for four pairs of mesh dimensions. The highest and lowest values of T K found are plotted for each value of N . In addition, selected solutions for N equal to 3 and 8 are plotted, and the number of solutions found is printed above the highest point for each N . The additional points at N = 3 and N = 8 are selected to represent the densities over T^ of the v v R solutions obtained. It is seen that the density increases with increasing mesh size. Further tests are yet to be performed to determine whether the procedure of iterating p or q downward yields values of T which would close the gaps in the higher regions. Note that the highest points for the two smallest meshes form straight lines. The six points in each of the graphs represent the same solution with only the difference that for smaller N the value of T„ is J v R v greater and the fast memory requirement is lower. A solution for the N problem M, N, N , d, T is also valid for the problem M, N, N -1, d, W~j^y} v but it is not in general valid for the problem with N +1 variables because of the requirement N pq < U(b-2). The term "solution" here refers to the set of parameters {T, k, t, I , m, I , n, p, q, b, L, W . ) where L is the internal latency of the scheme. Note that L is independent of N and T R' A straight line (with the slope indicated) may be drawn to the left from every solution on the graph; solutions exist along these lines. One such line is drawn on the first graph to indicate the existence of a solution at the point marked "x". This solution would also be found using the procedure of iterating p or q downward. This procedure might also yield solutions for N =3 which are invalid for higher N . v v X O i— 2 «* oo r* «o m r>- i+5 eg in m If) s< io* •«• • ••• c« M • •• ••• •• •• ♦ •» ~l 1 P" O 0» 00 -T— lO CVi" 10 . o lf> - a £1 >H C) 3 erf o CD Ch ?H Jh O o CD fe £ co fl 5> 0) K X! H EH CM ro X • 00 C\J *• ?• • •• ••• • — •«• •• • ••< o (0 i i o o ■ft «• o ro 1 1 O N / / / / / / • -fl- ax* • •• •• -I 1 1 1 1 1— O on ao r> OM, ()m, L TC YM# LK YN> TR, TRM T N, TRM AX > INTEGER ARRAY THI I 5 I 1 i! » : 7 ] > H 1 ST t 1 I 20 ]* MMM, NNN [ 1 t 1 5 1 ; integer mv.tp»omax*k» 1 »p.u»nvpq»b# ktmi»lt»pm2>fbm2» unfil, N»m,nn>mm»LM,LN. ILM# ILN*S» SUM,MAxPQ# I»DSKM#0SKN# FnGM,E0GN,Dl<^»LMKTM1,LNKTMl,ISZ#TI»D.ENPK,WEGSQ»WMBK » T*P* Q» MVPO* B # llNF IL * RULT # LTCY#SYM,SYM1> OM,DN,wMEM.DISK)J FORMAT FA(//2I5," = MM.NN CRITERION *",F5.3/ "N\/ Tp is m LN N TREAL D K T PC NVP Q B UNFIL R'lLT LTCY OMRL DNRL MFMRY DISK"}* FH(l2.v2#T2»2(x3M^»H)'X2.F5,2»Xi,Al»X?,T2»X3,i2*X3»n»IS»H» x3,T4,YJ.n,yi.l3,y?,E4.3,x<»#F<»,3»xl>2Al,X4,x3»2F5,1,I8,I5), Fwv("**«*«* 1PMTn="F5.2 m TRHAXe"F5,2 M SUMx"I5)> F MSTD(»FA<;T-Vf M hEylUREMENT I STR I BljT T ON" v5"N0 . OF SOL^S = "I6» *10 ,, F0 ( ? M I^"* ,, I 3// " Q-W"Y6"FWiO PCT ACCOM*/)* FHSTC 14. I 1 0,F 7 , 1 ) I THL F 5 , * 1 WITH 4,1 »b»b» 10> THLTft»*] WITH 5. l»3»t>» 7.91 TBL r 7 ,* l wlT H 5,1.3, f #«, 10; THL r«,* 1 WITH 5,1 »3*!>*8»VJ TBL r 9 » * 1 WITH 6, 1 , 3.b#6.9» tOJ T H L T 1 , * ] w T TH 6»l*3*S*6*7*10j TBLfll**] WTTH («■, 1 , 1,5#*» 7»fl » FILL FILL FILL FILL FILL FILL FILL FILL FILL FILL T H L r 1 2 » * I WTTH J»1»*,5»*#7»B»9; MMMf*1 wlTH 30. ?j*i*f 35. 70» U0»70» 1 10J CH*.12; FOR TS/«-< STFP 1 UNTIL 11 00 RE<", I M FOR I«-l STEP 1 Ll\T 1 1 20 DO HlST[I]«-OJ WRITECPTOHTTPAGF J ); MM*MMM[ ISZ ] S NN^NNNT ISZ1 J WRITE(PTOUT.FA,^M.NN.CR>; FOR NV«-3 STEP 1 UNTIL 8 DO R E G T N IF NVxMmxNn>210uOO THEN GO TO SKIP?I TRMIN»099J TRMAX4-0 I SU^«-01 FOR Tp^b.6 OH FOR T«-1 STfP 1 FOR KM STEP 1 FOR ILM«-1 S1EP BEGIN MM*MMMl I S7 J t NN*NNNt ISZ] J s y m l ♦ »• " ; KTMl<-K«TP-l ) R«-(LT«-l20nxT) D I V UMF IL*.T-KTMl*b; IF T=l THFN SYM>" FL*E IF 1?00 UNTTL 5 00 UNTIL 30 00 1 UNTIL TRL[TP»01 ktmi; DO MOO B s THEN SYM«-"E" 52 FlSE FlSE IF ?x(T-l)iUNFlL THFN SYM*"A« riEGTN B«-B-U UNFIL*UNFIL+KTM1 I SYM«-"B" LNO I BM2*B-2> FRM2*4kBM?J LM«-TRl.tTP» 1LM1J M4-LM*K-2J cyt » p«.(MM + M-l) DIV M) Q«-FrtM? DIV (NV*P)J IF 0=0 THFN GO TO SKIP; N«-(NN + U-n DIV Qi Q«-( NN+M-1 ) DIV N) IN«.(K + 1*|0 Dlv KJ NVPQ«-NVXPxQI RtlLT*IINF 11 /LT) DM4-M-MM/PI DN«-N-N\/G) TRE AL«-CR THFN GO TO SKlPI DMAX«-c2S6*bM2l DIV CENPK«-fl*NV*P*K ) J IF UPAXO 1HEW GO TO SKlPl D«-JJ WFGS0*ftx((UxE^PK*255) DIV ?56)J WMBSQ*4X( (NVPP*3) DTV 4}) WFQ«-(B*NV*WxD + 63) 1 V 64* - urM c /. c « W MEM«-3«WEf,SQ*WMBS0*WEQ + NV + (IF wEGSQ>WMBSO THEM WEGSO FlSE wMBSOJ IF wMFUM^OO THEN GO TO SkTPJ S«-((lF WMFM$2n00 THEN KMEm ELSE 20005-1) DIV 100 ♦ U D^KM*(M-ENTTEC(DM))x(LN DIV TP *l)l d<;kn«-n*(Lm DIV TP *l)J FOGM + I mx( (LN-1 ) DIV TP ♦ 1 ) J E0GN4-1 N*((IM-1 ) DIV TP *1)» DTSK*Tk((TF D<*M4« THFN GO TO SKlPI SYm1*"N" THEN H I G I N t i *l m; lm«-ln» ln«-tii T T 4- M i M*N» N*TI) Il4-p; P*QJ 0«-Tl) IR4-HMJ DM*DN) DN«-TR WR ITE ( PTOHT»FR*LST ) J IF TRMlN>TREAI THEN TRK I N*TRE AL > ir TRMAXSUM) FNOt MRTTF(PTnuTtPAGEl)| TOR T*l STFP 1 UNTIL *0 DO Pr T [ I ]*SUM*SUM*HI ST til J WR!TF(PTnijT#FHSTn»SUM#MMM[IS7]»NNNtISZ])J TF SHM>0 THEN FOR T«-l STFP 1 UMTIL ?0 DU HtGTN PCTtI]*PCT[ 1 1/SuMJ WRlTE(PlOUT»FHST»100Kl.HlST[I]#100)«PCTrI]) FNOt SKIP9I FND FNn. 5h LIST OF REFERENCES [1] Barnes, G. H., et al. , "The ILLIAC IV Computer," IEEE Transactions on Computers, Vol. C-17, No. 8 (August, 1968), pp. 7^6-757, 55 { A x» It • • • is, t • • • 3tt? • • • nV • • • IV • • • s>: SEL*t • • » !t±*J» y^TSt « • • • • • igg» • • • • • • £1" *i ... £E3i» tt.*-^- • • • • • • *E^ • • • • • • H 1 - __-- ■ B t,K — » ^V^ 13 1,K*3 14 1, K.+-4 _- - — --- t,n-t s E2 N ^ 23 — ^ ^ J -J.K--S -51 3, K + l X 3Z. X ~' 4, K-4 4 ,x 41 4, K*l | ,_--- K-1,1 __ ,_ • _ _ - — " -- ,, ^ V N^' - m _ K+1,3 s *^ mission is started. This access time is, in general, not predictable; but it is bounded by the disk rotation time. The access time during which the computer does useless work is latency; and the object of the present investigation to minimize the latency for a reasonanbly large class of problems. DD .'°"r.,1473 UNCLASSIFIED Security Classification - O c *, » %«»**%»4T% 55 A •-DI5K »\ \* it Vt. nT Ht TI«i. in* nrt • • • mi. ™r nl R&, i ai: ... IV II • • ■ !«. x\ Hi n ;. "n^ mt mi. n> xV ... sir *.*\\ ... Sii. tii' EIS • • • Ell. •jz^i" rr** iJiL. g?vr sat 531. GIT * • • . . . KTl« K^ H, L • • • S, L . •grr^ MI* • * • ^D; 43^ tax mi. rr»v ESS. Gfc. G» • • • . . . "^ • • • • • • • • • 1.K.-1 l,tK-V l,K. i, » » i.k-m l,lX*t '' i« i,<*t s ^' i* <>• 1 14 l,ft+4 1 T * t,K-t 2,K-t l,K ii I,«t*l ** ei *-^ »- : ■' * ■*.«■-■> S, K.-t 3 ( *c.-t lit V Y K*l * * »t -■ 4,K-4 4,*'-» *t*t-t 4, K.-1 M 41 4,»*> S/' EE£\ a, k-2 \ \ \ \ \ \ Z,i«t-4 dEI i*'**-*i t.j^n-i I «■*.«■«. rssn 56 HtFEREUCE 3, a — ^- urn, UJ.CU.10- bouX $ £H& ro^oTSe"^ iS ' in g6ner ^"^"p-dictable, butiTls the o™lr o Tl h TZetZi n tn?£ b ^ 'T^ d ° eS Useless M0 * ls ^tenoy; and large class of proWems 1OTeStlgatl ° n to »"*»*« the latency for a reasonably DD ,'°?..1473 UNCLASSIFIED Security Classification 56 I? tnV V* M t. t 1 «•> 1 It I'**"* 1 2, * 2 l«.«^. 1 t,fc-l L^lJ *..* 1 --■ 1 m\ h-| t,l». 1 -*•! 1 - 1 1 ••! 1 v- | t.Vkl* \ \ X \ N \ s \ \ \ \ \ \ V V \ * N > \ \ \ S \ \ \ \ N \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ !,»-» **. 1 «.,.-, 1 t.lK-2 l—l i.i*.-l 1— 1 V»* 1 «r | ^t, l«.v — | i,Iki 1 «f | | ,.«. | F ,.*k.t • (i"ln ->1 |t,W*l t,*FI*» |M,*4| «.«n*-2- Imw.| |wH 1 *a 1 1 «*s) 3.' BtFe«Ek4CE STAKTIMfc OOIOXAN t f«n (SUV) F^~t Cn," av T i TIT) CTWKTIM^ bLUMN K-t Ori of t,K-t "i.Z (51 "Vl 57 ez: B I *•**«■ 1 j.- — — — — ure one trans- mission is started. This access time is, in general, not predictable; but it is bounded by the disk rotation time. The access time during which the computer does useless work is latency; and the object of the present investigation to minimize the latency for a reasonanblv large class of problems. 5D /Tr..t473 UNCLASSIFIED Security Classification i ! 1 A IU4> COCO •AM X. 57 (If) (TI* T '> Cm.';''') M *.i (•!a* T "l »,l ... | • • * «T*ir car) k-i.i K-\,t fE3?^ K,t (E5V> (B3") K*1,t (sari (E3T1 «»t,i (83 *«».t 1 -« I v-^ 1 l*-»*l l-M K* 1 1 ^' 1 1 *•* 1 I*"'* 1 >HW CDt-U Mil Kl>t • 1 1 - • • • cri (m'-.M (2 VI • • ! fEIV) tjK-ft S.K-l CSSf) *,* ^ (EST) *,*.-*. *,K-l CESV1 T.K-t Ivc-l 1 l«.« 1 h--« 1 !*.-« | !*..„ • 4 I csflri K*C,K-t (tgrp K.l.K-1 • • • |a? | |k,k-i [ [*♦>,«- 1 | • 1 1 • • • 59 58 aT*.«rr\vjr> K+\,¥L-\ C*r> r^i (»n *T»,»cri»JC COUJMM K aTi tit ) f^n PC?") r^ni oncn^ue BjfBuem must wau ior nearly a complete aisK rotation Derore tne trans- mission is started. This access time is, in general, not predictable; but it is bounded by the disk rotation time. The access time during which the computer does useless work is latency; and tne object of the present investigation to minimize the latency for a reasonably large class of problems. DD FORM ,1473 UNCLASSIFIED Security Classification 58 A 1 VV) fnf - ) TTT*^ ™r l«T) c»n (mr^i l»K-\ t.K-l (EI 7} 1 '- 1 V--V (est) 4, (CM •.-» fc,<-» 1 •"« 1 1 •»« 1 1 *•«-' 1 • • @d K+t,K-\ l *•«•' l K + *,K-\ fCEaT r ) H»»,K-' • • • | W | |EL J | h.« | 1 r 1TM TIMfe oau MM V. fiVi US) l|K I.* v« *.* *,*- 1 1 '•« 1 1 - 1 l*>- 1 • • (E.V) tK? ) *,*. CELT) K+\,K • • • r 1 "." 1 1*,-' | l--» 1 1** 1 • • 1 Fiimm* COLUMN K_t • • • • tt-I.K-t K-ft.fc-Z • • V \ \ \ \ \ \ \ X \ \ \ s \ N \ S \ > \ R£VOI_UT\OVi ft (ij^r) H.-0*-«, K-t cu-ok-m, [v+?) • • RSVOUITIOH B K-2. • • COLUMN t C-t SWCTIMt REVOl ,UT\0*4 C ( m -V) • • REVO -JTION C» (s*r) Csr) (ESV1 £,«.-> 1 , 1C- \ • • 59 uueii une sysoem must wait ior neariy a complete aisK rotation Derore the trans- mission is started. This access time is, in general, not predictable; but it is bounded by the disk rotation time. The access time during which the computer does useless work is latency; and the object of the present investigation to minimize the latency for a reasonanbly large class of problems. DD , F °?..1473 UNCLASSIFIED Security Classification 59 PH4M1N* COt-U^** *■». *-*- £fc3") ssi£ a/';'-, *-t. OatlMl, S RSVOLUTtOM B i.«-t, •• it- a. JilKt, K-t. Sk cr, 1 ; STMTTlNfa '-Po^, = - l-V» m 1 MVOtOHON C 1 1 COLUMN < ^-t C |-REFE«£MCS. Cir - ) («r) («^ •• D RCVOHJTtON o (on cstn (