LIBRARY OF THE UNIVERSITY OF ILLINOIS AT URBANA-CHAMPAIGN 510.8^ no.74G-75/ cop. Z The person charging this material is re- sponsible for its return to the library from which it was withdrawn on or before the Latest Date stamped below. Theft, mutilation, and underlining of books are reasons for disciplinary action and may result in dismissal from the University. UNIVERSITY OF ILLINOIS LIBRARY AT URBANA-CHAMPAIGN 1 2 RECTI '% L161 — O-1096 Digitized by the Internet Archive in 2013 http://archive.org/details/analysisdesignof747chan 0> 'O If. 2- // ^«U^v, Report No. UIUCDCS-R-75-7^7 NSF - OCA - DCR73-07980 A02 - 0000011 ANALYSIS AND DESIGN OF INTERLEAVED MEMORY SYSTEMS *y DONALD YI-CHUNG CHANG August 1975 Report No. UIUCDCS-R-75-7^7 * ANALYSIS AND DESIGN OF INTERLEAVED MEMORY SYSTEMS by DONALD Y I -CHUNG CHANG August 1975 Department of Computer Science University of Illinois at Urbana-Champaign Urbana, Illinois 6l801 x This work was supported in part by the National Science Foundation under Grant No. US NSF DCR73-07980 A02 and was .submitted in partial fulfillment of the requirements for the degree of Master of Science in Computer Science, August 1975. Ill ACKNOWLEDGMENT I would like to express my sincere thanks to my advisor, Professor D. J. Kuck, for his constant encouragement, patience, and guidance. I would also like to thank Professor D. H. Lawrie for his valuable discussions and suggestions. Special thanks go to Mrs. Vivian Alsip for her help in preparing this thesis. I also wish to thank my wife, Li, for her excellent typing job as well as her long time assistance during my study at the University of Illinois. iv TABLE OF CONTENTS Page 1. INTRODUCTION 1 1.1 What is an Interleaved Memory System 1 1.2 Thesis Organization 3 2. HISTORY . . 4 2.1 Early Works 4 2.2 Hellerman's Model 5 2.3 Burnett and Coff man's Other Models 12 2.4 Ravi's Model 15 2.5 Conclusion 18 3. DATA DEPENDENCY 20 3.1 Data Dependency Types 20 3.2 Summary 23 4. OUR MODELS AND RESULTS 25 4.1 Introduction 25 4.2 Model I: Request Slicing 26 4.3 Model II: Conflict Blocking 33 4.4 Model III: Queueing in the Processor Units .... J>8 4.5 Model IV: Queueing in the Memory 49 5. COMPARISON OF ALL MODELS . . . . 54 5.1 Comparison of Performances 5^ 5.2 Comparison of Costs . 56 5.3 Data Dependency Types and Usefulness 59 6. LOGIC DESIGN 60 6.1 The Problems 60 6.2 Circuit Design of Conflict Resolution Box 60 6.2.1 Leftmost-first Circuit 6l 6.2.2 Random-selection Circuit • . 62 6.3 Switching Network Design 65 7. CONCLUSION 72 Page APPENDIX 7^ A. Expansion of Burnett and Goffman's Recursive Equation 7^ B. The Proof of S 75 C. Simplification of Ravi's Bandwidth Equation 77 D. Derivation of lim (l- 1 /m) V «= 1 /e r 79 px» E. Solution of f (m,s) 80 F. Solution of g, (m,s). . , 80 G. IBM Random Number Generator 81 LIST OF REFERENCES 82 1. INTRODUCTION 1.1 What is an Interleaved Memory System A traditional computer system consists of five components - a processing unit, a primary memory, a control unit, an input/output sub- system, and a secondary memory. The memory is the central element of the whole system, in the sense that it is the source or destination of all information flowing to or from the other portions of the system. So the memory hierarchy and memory management are two extremely important subjects in computer design since they greatly influence the throughput and utilization of a computer system. For years, memory design has been a core problem of computer architecture. Some computer systems have only one main memory module. In every memory cycle, only one word in the memory can be accessed. Since the memory speed is usually slower than the processor speed, in most cases they are incompatible. The processor is often lying idle waiting for the memory. Hence the throughput and utilization of the system are reduced significantly by the slowness of the memory. Naturally, people would try to solve this problem using hard- ware, that is, to construct faster memory by using faster circuits . A lot of effort has been spent in this area trying to invent a faster logic component that can be used as a memory unit. Right now, modern memory technology can produce a memory with less than a hundred nano- seconds cycle time, for example a bipolar memory. However, faster memory is expensive and cost becomes a discouraging factor, especially in a large computer system. A simple but better way is just to break the whole memory into several pieces and attach a decoder to each submodule. Then we would be able to access several words at the same time if we have a fancier control unit. Thus we effectively reduce the memory cycle time by several times. This scheme is called an interleaved memory system. The word "interleaved" comes from the fact that we can access several memory modules at the same time in a random fashion. The interleaved memory systems have many beneficial effects: (l) each module is smaller, hence the memory cycle time will be slightly reduced due to the shorter address decoding time, (2) several memory can be accessed at the same time, (3) they can be implemented by using cheaper and easy to be replaced elements, hence more suitable for the modern trend of modular computer systems. In a modern high-speed system, such as ILLIAG IV or B67OO, the main memory is divided into several modules. This modularized memory system also has the advantages of easy to be replaced and easy to expand. Some systems even use the bus structure to overlap the memory operations, thus reducing the memory cycle time effectively. So modularity is very similar to interleaving in some sense, since they achieve some common goals . We believe that the interleaved and modular systems will become the main trend of memory design in the future. That is the reason we study this problem here. 1.2 Thesis Organization In Chapter 2 of this thesis, we will describe some of the work other people have done in this area and describe their models of inter- leaved memory systems. Most of this work has been done analytically, we will show how they got their equations and also show our further results. One thing no one else has explicitly considered before is the data dependency between requests. It is very important in real machines. In Chapter 3> we will discuss several possible kinds of data dependency in the real programs. Data dependency between requests will constrain the usefulness of a model, and so it should be considered carefully. In Chapter h, we will present four models we have worked on. A description of each model will be given first, then we will show the performances we have measured. All the models, including some of those studied by other people, will be compared on the same basis in Chapter 5« Chapter 6 contains some circuit designs and system layouts of our models. Although our designs are very simple, they can show the feasibility of our models. The last chapter is the conclusion of our work. In order to give a concise presentation, we will put all the messy details, such as the derivation of equations, in the appendix which follows the conclusion. 2. HISTORY 2.1 Early Works The idea of using multiple memory modules to increase system throughput was thought of by people more than ten years ago. In 1964-, Ivan Flores wrote a paper [l] to analyze a multiple-bank memory system. He used queueing theory arguments to derive the waiting-time factor of the system and show how waiting time varies with respect to several other parameters. As you will see, the waiting-time factor is similar to the bandwidth which we will define very soon. . Later, Michael Flynn [2] used a different approach to tackle the same problem. He studied a lot of recorded address traces of real programs and measured the "true" waiting time and the bandwidth of those programs. He defined bandwidth as the retrieval rate of words from memory, that is, the average number of words that can be accessed in one memory cycle. This is what most people are interested in. No doubt, Flynn's results are more realistic than Flores'. How- ever, they only reflect the performance of the programs he analyzed. If we want to know the general performance of a certain system we need to study a rather large number of programs of several different characteristics. This is very time-consuming. Work in this area is usually done by defining a model and then analyzing it analytically or by simulation. Although most of the models people design are very abstract, they still give some flavor of how a factor will influence the system performance. 2.2 Hellerman's Model In [ 3] t H. Hellerman presented a very simple interleaved memory system as shown in Figure 1, and gave the system performance in terms of the average relative bandwidth. The definition of the average relative bandwidth is the same as what Flynn used in his system. Later, a lot of people began to study interleaved memory systems in Hellerman's way and use the same definition of bandwidth as the system measurement. We will start with Hellerman's model and give more detailed description of the works other people have done along this line. Hellerman's model is an m memory module system with a single string of requests coming in. Each request is an integer from to m-1 that specifies a specific module. All requests are assumed to be equally likely, or generated randomly, and the input stream always contains more than m requests. Then the average relative bandwidth can be identified as the average length of a string of distinct integers. The probability of a string of distinct integers with length k is: P T = k(m-l)! m (m-k)! since the probability of choosing first request is 1, second request is , third request is , . . . , kth request is , and (k+l)th request is — since it must be equal to one of the previous k requests. Multiply them together and you get the above result. Then, by probability defini- tion, the average length (and hence the average relative bandwidth) is: m B av. " ^ **,< (2-1) Substituting P, into equation (2.l) we get: m k 2 (m-l)l B av = Z < 2 ' 2 ) k=l k/ . n, Equation (2.l) is adopted as the definition of bandwidth by most people, and the efforts focus on finding the proper P, for each model. When l<_m<_45, Hellerman found a good numerical approximation to equation (2.2) to be m * , or approximately ,/m. The error is no more than ^.3%. This is the square root concept that has been generally accepted. That is, when you increase the number of memory modules in your system, the bandwidth will grow as the square root of m. However, in Chapter k, we will show that this is not quite true. Very recently, Knuth and Rao have shown in [k] that a nice closed form to equation (2.2) can be mathematically proved to be: /ifm 1 1 /n~ , — + _ + _/_ + 0(m ~l) 2 3 12>/ 2m which tells that the average relative bandwidth of Hellerman f s model is indeed asymptotic to the square root of m. One thing Hellerman did not say about his model is how many pro- cessors he used to generate those requests. Apparently, he implicitly assumed that those requests were generated by one single processor, other- wise there is no reason why he should stop at the first conflict. If this is the case, the bandwidth should not grow as the square root function of m indefinitely. Since no matter how fast the processor is it can only generate a finite number of requests per memory cycle, say M requests and M may be very large. When n > II, the bandwidth should not grow any more since the request supply is finite. So a reasonable graphic representation MEMORY MQDUIAS INPUT REQUESTS m B QV. Figure 1. Hellerman's Model f = ^m * m M Figure 2. Bandwidth Curve of Hellerman's Model 8 of Hellerman's model should look like Figure 2. Beyond the point m=M, the curve becomes a horizontal line B = JE. av. On the other hand, you may say that in real machine design the number of memory modules used will be at most a few hundred, so we still should accept square root of m as the fact. However, it is very easy to see that Hellerman's model is only good when there is no logical dependency between the requests. We will discuss this problem in the next chapter. The we will propose a more realistic way of designing a model. In a 1973 paper [5] f G. Burnett and E. Goffman generalized Hellerman's model and showed a more interesting result. They used the same model, the same definition of bandwidth, and almost the same assumption about the requests except they put two probability parameters a and 3 in the input request stream. Then they show how a would influence the band- width in addition to m. They assumed that the probability of a request addressing the next module in sequence (modulo m) will be a and the probability of addressing any other module out of sequence will be 3. Where 3 =(l-a )/(m-l). Or formally, let r, ,r_, . . . ,r. ,r. , , . . . denote the input request stream, then P(r x =j) = l/m, j £K m = {0,l,...,m-l> P(r. =(r.+l) mod m) = a , i = l,2,3t • • • . P(r. ,,=n) =3. n eK and n^(r.+l) mod m l+l m 1 For example, when m=8 the 8-length sequence 067231^5 would have probabilty 13 4 q & 3 • By the pigeonhole principle, the longest possible sequence would be of length m. The successive addressing is called an a -transition and the other a 3 -transition. By using another form of equation (2.l), they define their bandwidth as: m B av = Z P k (w ± k) av ' k =l * where P, (w>_k) is the probability of a sequence with length at least k. Then they got the bandwidth equation: m k-1 . . . . B = Z Z a J B J G (j,k) av. . . . rt m w ' k=l j=0 where C (j,k) is the total number of k-length sequences with j a-tran- m sitions, k-l-j 3-transitions and r =0, It is very easy to show that the inner summation is the probability B^.(w>_ k). Since the number of sequences begin with is the same as the number of sequences begin with any other number, they only count a special class of sequences and cancel l/m in their equation. Note that a k-length sequence with j a-transitions should have the probability l/m or 3 J . The counting of C (j,k) is a very interesting combinatorial problem. Unfortunately, they solved this combinatorial problem by using a state transition argument and reached a rather complicated result. They first derived a theorem which says that for a fixed ordering of j a-transitions, the total number of k-length sequences with j a-tran- sitions and beginning with a is the same as that for any other ordering. Let G (j,k) be the number of k-length sequences that begin with a and with all a-transitions occurring at the first j transitions. Then this theorem can be written as: G m (j,k) = ( k - X ) c£(j,k) (2.3) where ( . ) is the total number of orderings. 10 Then by using the following three facts: 3=0 C (0,k) = C°(0,k) m rn (2.4) (2.5) (2.6) they first got the recursive solution for G (0,k) by equations (2.5), (2.6) and (2.3): k-1 'k-l N „0 G (0,k) = (m-1). , - Z ( . ) C U .(0,k-j) m v v 'k-1 • -, J m-j v ' J/ J=l J Where (m-l), , is the falling factorial of k-1 terms, ie. (m-l) (m-2) . . . . (m-k+l) Substituting equation (2.4) into (2.3)1 they got the solution for C (j,k): m c (j,k) = ( k : x ) c°(j,k) m 3 w = ( k : x ) c° .(o,k-j) 3 m-J ,0 where G _.(0,k-j) can be evaluated by the above recursive equation. The boundary conditions are: #0.1) - 1 cS(k-i,k) = 1 Although they presented an evaluation ordering for these numbers so that no number will be calculated twice, their equation still takes a long time to solve. Later in a short note [ 6] , H. Stone used an inclusion-exclusion argument to get a direct solution for G (j,k). If a. is defined to be the property that the ith transition is an a -transition, then G° m (0,k) =N(a{a^...^ 1 ) k-1 = Z (-1) J S. j=0 J 11 where S .= EN (a. a ...a. ). This formula can be found in any combinatorial J 1 1 1 2 j book, e.g. chapter k of [?]. Stone claimed that S .=( . )(m-j-l). . 1 and got the solution: k-1 J " 'k-j By substituting this into equation (2.3) we can get «»«>,*) - Z I^I^h (2.7) This is indeed an improvement over Burnett and Coff man's result in terras of computational complexity. However, Stone did not show how he got S .. Although Burnett and Coff man got a recursive solution, their equation can be expanded into Stone's result. We show this expansion in Appendix A. This reveals that their results support each other. Actually, the derivation of S . is not a trivial problem. In Appendix B, we show a way to derive S .. Our result shows that S . is indeed equal to ( . )(m-j-l) T . , and this completes what Stone did not do in his paper. As a matter of fact, if we know S . we can just plug it into another inclusion-exclusion formula and solve the original problem, i.e. C (j,k), even more directly. The formula ist e o - s o - (j ?> v + (j 2 2 ) s J+2 - •••• + (-^-^fci-j) \-l k-l-j = z (-i) n ( J4tl ) s _ n=0 n jin where e . is the number of objects that have exactly j properties and S . is J J the number of objects that have at least j properties. If we define a property to be that an a -transition must occur at a certain position in a k-length sequence, just as Stone did, then e .=C (j,k) and S. is what we 12 proved in Appendix B, So n (j, k ) - e. - ^^ (-1)" (*») (£) (^-n-l),...^ Notice that all sequences considered here begin with a 0. By a simple manipulation of those binomial coefficients, it is very easy to show the above result is the same as Stone's result, equation (2.7). However, our solution is more direct and doesn't need Burnett and Goffman's theorem. For the limiting casea = 3=l/m, Burnett and Goffman's bandwidth equation can be found to have the same numerical value as Hellerman's result o Although Burnett and Coffman did not give a nice result for their model, they did show a very important phenomenon about the serial correlation between requests. That is, when a increases, the bandwidth will increase exponentially. They claimed that for most programs a is about 0.25f so the bandwidth will be higher than Hellerman's result when m> k. All we said in this section can be found in [8] , and this problem is now finished. Later, Burnett and Coffman had several papers about interleaved memory systems and we will describe them in the next section. 2.3 Burnett and Goffman's Other Models In [ 9 ] , Burnett and Coffman continued to consider Hellerman's model in more detail. However, they changed the strategy. They separated the data and the instruction requests and considered their individual bandwidth DB and IB . The resultant bandwidth is the sum of these av. av. 13 two wliich shows the average number of memory modules in operation on data or on instructions during a memory cycle. In a real program, we usually put successive instructions in successive memory modules, so there won't be any conflict due to instruction references. But there will be branch instructions in the progran which will cause some instruction requests generated during a memory cycle to be wasted. They called this internal waste. In their analysis, they put a parameter X to denote the probability that a branch will occur during a memory cycle, or the probability that internal waste occurs. Then it is easy to get: n i IB (n., X) = Z k(l-X ) k_1 X av# x k=l where n. is the number of instruction requests scanned in one memory cycle. For data requests, they just applied the result of [5] to be DB which is a function of a and n, , the number of data requests con- av. d ^ sidered in one cycle. Combining IB and DB yields the total bandwidth. This scheme allows referencing different numbers of instruction and data addresses during one memory cycle. Since the resultant bandwidth is a function of a and X as well as n. and n, , for fixed a and X we can adjust n. and n, to reach the maximum bandwidth. 1 a This paper seems to be closer to the real world. However, control might be a big problem. Later, Burnett, Goffman and Snowdon presented an interleaved memory model using queue to store blocked requests [10]. They do not stop accessing at the first conflict, instead they just put the blocked requests into a conflict buffer and keep going until either all memory modules are busy or the buffer is full. In the next cycle, the system will scan the 14 buffer first, then go on to the input stream. They showed the influence of the queue length on the resulting bandwidth. Their result indicates that the use of a queue will improve the performance. However, they treated the input stream as a sequence of in- dependent module numbers. This is not true in real programs. So the result of this model isn't very practical. But it did give a very good suggestion for a system design. In this paper, they didn't give an analytical solution since it is very difficult to get due to the appearance of a queue. They used simulation to solve the problem. In Chapter 4 we will show four of our models, two of them have queues in the system, and we also use simulation to get the result. Recently, they published another paper [ll] which uses a tech- nique called Group Request Structure that combines those techniques shown in their last two papers. They used a model with a conflict buffer like that in [10] . However, there are two input streams coming in, one for instruction requests and the other for data requests. In every memory cycle, they analyzed M. instruction requests. Just as in [9], they assumed that the instruction requests do not have conflicts except with a branching probability A . So analytically, in- M i xk-1 struction requests contribute j k(l-A ) A to the total bandwidth. k=l As for data, they used the new model and simulated by using the method of [10], The input data requests they generated will have a probability a such that an a -transition occurs. In every cycle, at most II, data requests will be satisfied. Again, when a conflict occurs the later request will be put into the buffer. 15 The final bandwidth will be the sum of the instruction bandwidth and the data bandwidth, which is a function of M , M , , a, X, and the queue length L. By changing M. , M and L, they can reach a maximum bandwidth for fixed a and A. They claimed that this model has the best performance. However, it is even more artificial due to adding band widths. 2.4 Ravi's Model In a short note [12], C. V. Ravi presented a model of a multi- processor with an interleaved memory system. He described a system with p processors and m memory modules. In every cycle, each of the p processors will generate a request ranging from to m-1. He predicted the average memory bandwidth by mathematically figuring out the average number of distinct numbers among these p module numbers. This is a well-known combinatorial problem and the solution is: t klS(p,k)(£) B = l fc £- (2.8) aV ' k-1 m? where t=min(m,p) and S(p,k) is the Stirling number of second kind. k!S(p,k) is the number of ways to put p distinct objects (requests) into k distinct boxes (modules ) with each box containing at least one object. (, ) is the number of ways to choose k modules out of m modules. Ravi showed some performance curves for his model. But he did not plot his bandwidth curves against fixed ratios of m and p, and that is something interesting he missed. Actually, equation (2.8) can be reduced to a very simple closed form, that is: B = m[l-(l- -i ) P 1 av. L m J 16 We show this derivation in Appendix G. If we plot this result against p for fixed ratios r = / , we get a family of curves as shown in Figure 3« As you can see, all these curves are almost linear. This is not surprising since you can further reduce the above equation to an asymptotic form by using the fact lim (1- ^) P - J - v m' r p -x» e We repeat the derivation of this limit value in Appendix D. So the asymp- totic equation is: B = m(l- -i- ) av. e r = a(r) m r v r / e = 3 (r) p which is a linear function of either m or p. Obviously, this result is quite different from what we would expect by using Hellerman's square root concept. In Chapter k, vie will show that our models have the same result. Of course, you might say that Ravi used p processors rather than 1 and that's why he got linear result. But Hellerman assumed that there are always more than m requests available, or equivalently, the processor can generate at least m requests per memory cycle. So we might compare Ravi's result against Hellerman's result with M=p. In order to compare them, we plot B against m curve of Ravi's 3- V « model for a certain p in Figure k, superimposed with Hellerman's result when M=p. Both of them are analytical solutions. Two things can be seen from this diagram; first, Ravi's bandwidth is exponentially increasing and approaches p when m gets large, and second, r--y a Figure 3. Bandwidth Curves of Ravi's Model for some Fixed Ratios of p and m P 2P Figure 4. Comparison Between Ravi's Model and Hellerman's Model 18 Ravi's curve is much higher than Hellerman's curve. Although Ravi claimed that his model allows queueing in the memory modules, he did not explain how to build these queues and what the impact of these queues is. Eesides, he failed to explain how to handle the conflicts and what happens to those blocked requests that can not be satisfied in the present cycle, i.e. he ignored them. But his paper did give us some inspiration in building our models, and the linearity shown in Figure 3 indeed triggers us to find out whether it is true for all other models or not. 2.5 Conclusion All the models introduced in this chapter have a common flaw, that is, none of them mentioned the data dependency possibly exists between the requests. Thus they give people an illusion that their models can be fitted into any machine. However, this is not true. For example, let us take a look at the following piece of assembly code which performs the task A = B + C: a LOAD B INTO ACCUMULATOR b LOAD C INTO AUXILIARY REGISTER c ADD d STORE THE RESULT INTO A where a, b, c, d are assumed to be the addresses of these four instructions. Certainly, when we execute this sequence, we can not execute (but can fetch) d until c is finished. In other words, there is a data dependency between these two instructions. Using our terminology, we will have an input request stream like: a, B, b, C, c, d, A 19 ' If we have a very fast and fancy processor, probably using a look-ahead scheme, we can generate all these seven addresses in the same memory cycle. Even though all these requests refer to different memory modules, i.e. there is no conflict, we can't execute STORE A due to the data dependency between c and A. So when we feed this string into any model discussed in this chapter, the real result is not what they would expect. By "real" result we mean if we run the program correctly. There are also some other kinds of data dependency. The data dependency will greatly constrain the usage of a model. Although nothing is wrong with all these analyses, we must keep in mind that no model is "universal". So before we go on to our models, we will study some data depen- dency problems. Then we can use the result to justify the usefulness of all the models, including ours. 20 3. DATA DEPENDENCY 3.1 Data Dependency Types As far as the memory design is concerned, the data dependencies in real machines can possibly be classified into four different types. We depict them graphically in Figure 5« A small circle represents a request (or an address) and an arrow indicates the dependency relation. For example, in Figure 5 there is an arrow from a to b, this means request a can not be serviced until request b has been satisfied. A horizontal string of circles means those requests that will be investigated at the same time. They might be generated either by seperated processors or by a fast processor. A vertical string of requests always consists of requests generated by the same processor. Type A Type B Type C OOOO I OOOO Type D bQ O O O °6 6 6 6 OOOO OOOO Figure 5» Four Types of Data Dependency 21 Type A is the only one with horizontal data dependency. That means there are data dependent relations between the requests scanned at the same time. This is the worst case we can have since we must do every- thing one by one. The horizontal arrows might not appear all at the same time, here we only show all the possibilities by drawing all the arrows explicitly. Usually, type A occurs when the machine is running a serial program. Of course, we can draw the picture as a single vertical string. But we fold the string in order to cover the case when several processors are running one program. All standard serial machines fall into this category. Here is a FORTRAN example of type A data dependency t DO 10 J = 1, P*H 10 M(I(J)) = M(J) + G All assignment statements must be executed serially. Type B data dependency is the one when some (or all) of the re- quests at second level have data dependent relations with some (or all) of the requests at first level. The relations might not be one to one and might change every cycle. So in this case, the first level must have all been done before the second level can be served. In Figure 5» we represent this by showing only one vertical arrow. This type of requests usually arise in a vector machine, such as GDC STAR. The first vector must be completely done before the second vector can be launched. For example, if we have a matrix M stored in p memory modules in a standard way, i.e. a column will be in one module, then do the following double DO loops i 22 DO 10 K = 2, H DO 10 J = 1, P 10 Ii(l(K),Il(j)) = 1 1 (K , J ) Since the subscripts at the left-hand side are variables, the best we can do is to proceed row after row. After we fetch a row, we must store all p elements into memory before we can fetch another row. This is exactly the type B data dependency. You can find a lot of type E data dependency in the differential equation problems. For type G, there is no arrow across the vertical boundaries. The data dependent relations only occur in vertical strings. All requests along a vertical string must be done serially. This type happens when several processors are running seperate tasks (or programs). Some multi- processor machines are operated in this fashion. Here is a FORTRAN example with type G data dependency: DO 10 K - 1, P DO 10 J = 1, H 10 M(K,I(J)) = H(K,-J) It is easy to see that the permutation in a row must be done one by one, but all rows can be operated simultaneously. The last type is the one with no data dependency relation any- where. All the requests are generated independently. Although this type is not found very often in real machines, it is possible. It is easy to analyse and has been used in many previous models. This usually occurs at the beginning of a program when we initialize the program. For example: 23 T = DO 10 I = 1, 100 A(I) = 1 DO 10 J = 1, 50 10 B(I,J) - There is no connection "between any two variables, so we can initialize them randomly. One example of this type of data dependency in the memory for programs with more data dependency is the IBM 360/9I1 where the independency between requests may be made by using Tomasulo's algorithm in the processor, queues and tags. All cases in the real machines can be classified into these four types, and they might be used to justify the usefulness of a model. 3.2 Summary Table 1 below shows the machine type and interleaved memory models most suitable for a certain type of data dependency. We have already ex- plained the second column in the last section. Now we are going to explain how we fit Hellerman's, Burnett, Coffman and Snowdon's and Ravi's models into third column. If you recall, Hellerman's model stops at the first conflict and assumes no data dependency up to that point. The only explanation of this operation is that all the incoming requests are independent so it can keep going until a conflict is met, and the control unit of this model is so simple that it doesn't know how to handle the conflict, hence it stops. So we fit it into type D according to the data dependency. zh Snowdon, Burnett and Goff nan's model also ignores the data dependency. It throws the blocked requests into a conflict buffer for later processing and keeps looking for more requests. So their model is also classified as type D. As for Ravi's model, all p processors will generate a new request independently in every cycle. So we fit it into type G. However, he did not explain how to handle the conflicts. Apparently, he just throws those blocked requests away since there is no queue provided in his model. This is a big flaw. But here we are only concerned with the data dependency type, so we place it in type G group. In the next chapter, we will present four interleaved memory models. Primarily, we design them for solving different types of data dependency. We also put them in Table 1 for later reference. The reason will become apparent after we describe them. Data Dependency Type Machine Type Interleaved Memory Models A Old standard serial machines Hone B Vector machines, e.g. GDC STAR I, II G Standard multiprocessor machines executing independent tasks Ravi, III D Tomasulo type machines, e.g. I Hi 360/91 Kellerman, Goff man, IV Table 1. Data Dependency Types with Their Associated Machine Type and Most Suitable Interleaved Memory Models 25 4. OUR MODELS AND RESULTS 4.1 Introduction All the models described in Chapter 2 are based on one assumption, that is, the system has an infinite supply of requests and processors never run out of requests. In other words, the system is always in a steady state. This is certainly not always true in real machines. In our analysis, we also consider the case when the request supply is finite and we call this a transient state. In a transient state, some fringe effect will occur and we are going to show how this fringe effect influences the performance of the system. In this chapter, we will present four interleaved memory systems we have been working on. The approach we use is more realistic than others. Both steady state and transient state performances will be analyzed. One thing that nobody else has worried about before, viz. how to queue the memory conflicts, will be considered here. This is an important factor which will greatly influence the design and use of a model. All our models are multiprocessor and multimemory systems, i.e. each has p processors and m memory modules. The basic difference between these four models is the way they handle requests. As we said in the last chapter, each model can only fit into a certain class of machines. For example, Model I and Model II will satisfy all requests generated in one cycle before going on, so they can be implemented in a vector machine like GDC STAR where all processors are running the same program and data depen- dency is important. Model III and Model IV allow every processor to generate a new request in every cycle if possible, and the blocked requests 26 will be stored in the waiting queues. So they can only be used in a MIMD machine, such as B67OO, where each processor is running an independent job, or in some special cases on a SI1ID machine, or in a vector machine where vectors have tags. The reason vie limit our models to a certain class of machines is to fit the type of data dependency between requests. !-/e have pointed out this in the last chapter. Vie are not proposing a best way of designing interleaved memory systems here. Instead we are trying to tackle this problem in a more realistic way in order to give the flavor of how to design a proper system for a particular machine. In the following sections, we will first give a description of each model, followed by an analytical or simulation result. Then the com- parisons of their performances and costs will be given in the next chapter which can be used as a design guide. In all our models, we will assume that all memory modules operate synchronously and with identical cycle time. The requests generated by the processors will be the memory module numbers instead of conventional storage addresses. The definition of bandwidth we are going to use is slightly different, we will divide the total number of requests satisfied by the total number of memory cycles required. So vie are viewing things from a macroscopic point of view. But essentially, this is the same as the other definitions. 4.2 Model 1 1 Request Slicing Figure 6 shows the logic structure of our model I. There are p processors and m memory modules. Single lines represent the control flows and double lines represent the data flows. 27 At the beginning of a cycle, each processor will generate a request which is an integer from the set {0, 1,..., m-1} . All p requests will be sent into a conflict resolution box which handles the memory conflicts, which occur when two or more processors generate the same module number. Those requests that will be served then go to a switching network, such as a cross- bar alignment network, which will switch them to the proper memory modules. Those who are blocked will be reprocessed in the next memory cycle. After all these p requests have been satisfied, the processors are allowed to generate a new burst of requests. This mode of operation is the way of handling type B data dependency. So this model can be used in type B machines. The details of how the conflict resolution box works can be seen in Chapter 5 when we give a design for it. If it is a read operation, the data fetched from the memory must also be switched back to the processors. So the switching network must be bi-directional or have two copies. Notice that on the way back there is no conflict problem, so the returning info- mation can be sent to the switching network directly. Let's take an example to see how this model operates. Suppose we have 8 processors and 8 memory modules. The 8 requests generated are: 3, k, 0, 1, 3, *, 7, 4 These requests can be viewed as a distribution histogram shown in Figure 7. We may slice them into 3 pieces and satisfy them in 3 memory cycles. This is why we call this model a "request slicer". Obviously, the bandwidth of this model depends on the maximum height of the request distribution. In the above case, the maximum height is 3 so we must spend 3 memory cycles to access 8 words. Consequently, the 28 bandwidth (in words/cycle) can be defined as 8/3 = 2.33* So the performance, or the average bandvddth of the model, can be found by figuring out the average height of the request distribution. This is an interesting combinatorial problem. We derived an ana- lytical solution for the average height shown as equation (4.l), and we will explain every term in this expression. The average height is: P H = Z h P(h) aV * h=l L p/h J i-1 e ( n ( P T h ) ) ( m ) f -^U-j,h-i) p j-l k-o h J p ~ J * h = S h (4.1) h=l m p The function P(h) is the probability that the maximum height of the distribution of p requests is h. We multiply it by h and sum. over all possible h's which gives the average height of the distribution H . The expression in the numerator of P(h) is the total number of sequences such that at least one module number occurs exactly h times and other numbers occur no more than h times. Denominator m is the number of all possible sequences with repetitions. The ratio gives the probability function. ITotice that there might be more than one number that occurs h times, that is why there is a summation in the numerator. The maximum number of such numbers (each occurs exactly h times) is (_ /hj which serves as the upper limit of the summation. The product of the binomial coefficients is the total number of ways to chcose j*h positions out of p positions so as to distribute those j numbers such that each occurs exactly h times, or equi- valcntly, assign j different requests to j*h processors so that every h of them have the same request. ('.) chooses j numbers from m possible numbers. And f .... (m-j.h-l) takes care of the rest of the nositions. Hence the p-.r'h numerator indeed covers all the possible cases. 29 p PROCESSOR UNITS IPLY SIGNAL s i 1 k CONFLICT RESOLUTION SWITCHING NETWORK m MEMORY MODULES Figure 6. Block Diarram of Model I HEQUETS i • • • • • 5 1 1 _l —J J 1 i j.j MEMORY 7 MODULES Fifoire 7. An Example of Request Slicing 30 The function f (n,s) is the number of ways to distribute n distinct objects into m distinct boxes such that each box contains at most s objects. Here the object is a processor and the box is a memory module. He use it in equation (^.l) to count the number of ways to place with repetition the re- maining m-j numbers into the remaining p-j*h positions with the restriction that the maximum occurrence is h-1, or equivalently, assign requests to the remaining p-j*h processors so that no more than h-1 of them have the same request. The solution of this function can be found in Chapter 4 of [13] and we repeat it in Appendix E for reference. After finding out H , we can calculate the average bandwidth by ct V • av. II av. as we explained at the end of the last section. Since the analytical result is essentially the steady state result, we put a superscript ss. to denote this fact. Figure 8(a) shows B * versus m curves for various p values. v ' av. r Apparently, we can increase the bandwidth either by increasing the number of processors or by increasing the number of memory modules or by both. From this diagram, we can see the former effect is slightly greater than the later. As Ravi argued in his paper, Hellerman's model assumes that there are at least as many requests in the input stream as there are memory modules and when p = m his model is much better than Hellerman's. As you can see from Figure 8(a), our model shows the same thing, although the bandwidth is not as good as Ravi's. However, our model I is more realistic. 31 P= 32 P = 16 P= 8 ♦ p = 4 j — ► m (a) Steady State Bandwidth Curves Figure 8. Bandwidth Curves of Model I 32 DSS. u av, m P = 32 m (b) Normalized Bandwidth Curves Figure 8 (continued). Bandvadth Curves of Model I 33 Figure 8(b) shows the normalized bandwidth B ' /m plotted against av» m. This graph shows the utilization of the memory modules. When p = m, the utilization is quite low. This is what you would expect if new requests can not be generated until the old ones have all been served. fc.3 Model lit Conflict Blocking One other possible way to handle the conflicts is just to block all those requests that cause conflict and satisfy the others in the present cycle. All blocked requests will be piled up in a separated area, and will be served one by one in the following cycles. After all p requests have been satisfied, the processors will generate another set of requests. This is the second model we analyzed and it also belongs to type B. The logic structure is shown in Figure 9« Here we substitute the conflict resolution box by a conflict detection box which is a combinational circuit that signals a 1 to those who cause conflict. The design of this box is simple but costs a lot of gates. If we use the same example in the last section, we need 6 memory cycles to satisfy these requests, hence the bandwidth is reduced by one- half. The order in which these requests will be served is shown in Figure 10. He also have an analytical solution for this model. The average number of cycles needed to satisfy p requests is: G = + £ c av. P o V nr c=3 m * 34 p PROCESSOR UNITS PILED-UP AREA CONFLICT RESOLUTION ±L SWITCHING NETWORK Figure ^. Block Diagram of Model II 34013474 REQUESTS ©©©©©©©© HONORED ORDER Figure 10. The Service Order of the Same Example by- Using Model II 35 (m) is the number os sequences with p distinct numbers that can be satisfied in only one cycle. Since no sequences will be satisfied in two cycles when m >1, the summation goes from 3 to p. ( _ + ,)(m) _ is the number of ways to choose p-c+1 distinct numbers to be the p-c+1 requests that have no con- flict and can be satisfied in first cycle. The summation in the numerator is the total number of ways to select j numbers out of the remaining m-p+c-1 numbers and distribute with repetition these j numbers into the remaining c-1 positions with each number occurring at least twice. The upper limit L /2J is the maximum possible number of such events. The function g(m,s) is the number of ways to distribute n distinct objects into m distinct boxes where each box contains at least s objects. In our case, s = 2. Again, the solution can be found in Chapter 4 of [13] and we repeat it in Appendix F. ss The bandwidth B ' is equal to p divided by the average number of cycles G . In Figure 11, we show the bandwidth and the normalized band- cLV ■ width of our model II. Eoth of these quantities are much smaller than that of Model I as we would expect. One interesting thing is that when we plot the bandwidth curves against p they show maximum values at p = /iti. This is shown in Figure 11(a). 2 In other words, if you want to use Model II, you had better use p memory modules, or use ^m. processor units. Since the bandwidth and utilization of Model II are much worse than Model I and the cost is often higher, we will discard this model. Hence in Chapter 6, we will not show the design of this model. 36 B t 7 - ss. av. m = 8 m = 16 m = 64 m = 32 J L J ► P 4 8 16 32 (a) Steady State Bandwidth Curves Figure 11. Bandwidth Curves of Mcdel II 3? B ss. av., m 8 32 64 -► m (b) Normalized Bandwidth Curves Figure 11 (continued). Bandwidth Curves of Model II 38 4.4 Model III: Queueing in the Processor Units In Models I and II, ue did not use any queueing technique. Since Burnett and Goffman have shown better results will be obtained if queues are used, we construct two other models which use queues in two different places to store blocked requests. However, their use will be different. Figure 12 shows the logical structure of our model III. Basically, the structure is very similar to model I except we build a queue in each processor unit. At the beginning of each cycle, every processor will generate a new request into the queue if the processor queue is not full. Then all the first elements of the queues will be processed. Again, these requests go into a conflict resolution box. If there is a conflict, the circuit will choose one to be honored according to some criterion, e.g. leftmost first, random selection, round robin, etc.. Rejected requests will remain at the heads of the queues. Those who are selected will be sent into the switching network and dispatched to the address registers of the proper memory modules, and all other requests waiting in the queues will move one place forward. Then another cycle begins. Figure 13 shows some history of a 4 processor and 4 memory module system with queue length 3« COUNT records how many requests have already been generated by the processor. The circled numbers are the newly generated requests. We see in the second cycle, P_ does not generate a new request since the queue is full. Here we use the leftmost first algorithm. It is easy to see this model is very suitable for handling type C data dependency. As a matter of fact, this model is designed for type C programs . 39 p PROCESSOR UNITS and p QUEUE CONFLICT RESOLUTION SWITCHING NETWORK n MEMORY MODULES Figure 12. Block Diagram of Model III ivO PRESENT CYCLE i BEGIN END NEXT CYCLE t EEGIN END 8 8 8 8 3 10 10 10 10 10 © ® ® 1 3 © 2 2 3 10 3 3 112 2 11 © (o) 3 3 © 1 1 2 2 11 2 3 1 13 2 COUNT COUNT COUNT COUNT Figure 13. An Example of the History of Processor Queues in A ^--Processor System kl Since an analytical solution for this model is very difficult to get, we used Monte Carlo Methods to simulate it. The random number gerera- tor we used to generate the requests is the IBM random number generator RANDU (see Appendix G) with the initial value suggested by [ll*] . The first simulation result of model III is given in Figure Ik, T which shows the transient state bandwidth B ' versus H, the maximum number av. of requests each processor generates, when p = m. The transient state bandwidth is defined as the total number of requests, i.e. H*p, divided by the time to finish the whole process. For other ratios of p and m, the shape of the curves remain the same. The transient state bandwidth is lower than the steady state bandwidth. Because every processor will only generate H requests, as soon as some of the processors stop generating requests the bandwidth of the whole system will start going down. After all processors have stopped, there are still some requests left in the queues and a few more cycles are needed to drain the queues. When we average the whole thing, this fringe effect at the end will lower the resulting bandwidth. Figure 15 shows the histograms for four different H values, B is the number of requests that are satisfied in a certain cycle, or we may call it the "instantaneous bandwidth". From this figure, we can get a rough idea of how the requests are satisfied. The average band widths are marked by X's. If we connect these points together, we will get a curve similar to those shown in Figure 1^. So Figure 15 also shows the formation of Figure Ik, As H gets larger, the bandwidth approaches the steady state value since the fringe effect becomes smaller. We will not show the normalized bandwidth (or utilization) curves ij-2 here since they axe proportional to the bandwidth curves. Better bandwidth, of course, means better utilization. One interesting observation of Figure 15 is that the area under the histogram is actually equal to H*p, so the shape of the curve will in- fluence the transient state bandwidth. The tail part of the curve is indeed the reason that the transient bandwidth is less than the steady state band- width. Thus, if we can manage the queues so that the histogram will become more rectangular, or the tail part becomes smaller, then we will get better bandwidth. We have done three experiments, each used a different conflict resolution algorithm and they showed very interesting results. The first algorithm is always to choose the request from the leftmost processor if several requests refer to the same module. The second one is to choose a request randomly. The third one is to choose the one generated by the least used processor, or the processor with the smallest COUNT. The results are shown in Figure 16. From Figure 16, we can see the first algorithm is the worst, the second one has 15% improvement over the first one and the third one is the best with 25% improvement over the first one. So, the fancier. conflict re- solution circuit we use the better performance we get and the faster the job can be done. Figures 1^4- and 15 use the first algorithm. We will get the same kind of results if we use other algorithms. The design of a conflict resolution circuit is not a trivial pro- blem. The first one is relatively easy to implement and we show an example in Chapter 6. If we make some modification, we will get a random selection circuit. However the conflict resolution circuit for the last scheme is B ov 40 30 20 h 10 - *3 m - p m- 64 m = 32 m m 16 m « 8 •► H 10 20 30 40 50 60 Figure 14. Transient State Bandwidth Curves of Model III when m = p m= p = 16 Figure 15. Formation of a Transient State Bandwidth Curve 44 m = p = 16 Least-used-first * T Figure 16. Histograms for Three Different Resolution Algorithms B av. r=^ 1 m Perfect Bandwidth > / Ravi / r=l Ravi ^^111 H = co H-*co r=l ■> P Figure 17. Bandwidth Curves of Model III ^5 prohibitive since it needs a lot of sorting or comparison circuits. So our model III provides a designer the freedom to choose his own conflict resolution circuit according to the trade-off between cost and performance. ss Figure 17 shows the steady state bandwidth B " for different cL v • ratios of m and p, together with Ravi's result. The reason we plot band- width against p is to show that when we fix p and increase m, the bandwidth vail increase and the curve will move upward. Besides, the perfect band- ss width is a slope 1 line in this graph. The B ' curves show the linearity with respect to fixed r. In fact, the transient state bandwidth curves are also linear when r is fixed. Here we show two cases, namely, r *> 1 and H = 5» 55 1 in this graph. We can see the steady state bandwidths of our model III are essentially the same as Ravi's. Let us explain the reason by two cases. Suppose we have eight processors and eight memory modules, and in the last cycle, three requests were blocked and remained in the queues as shown in Figure 18(a). Suppose the newly generated requests are those shown in Figure 18(b), then Ravi's model will give you bandwidth 7 and our model will only give you 5* However, if the new requests are those in Figure 18(c), then our model will give you 7 but Ravi's only gives you 5« These two cases are equally likely. If a large number of trials have been co- llected, or the machine runs for a long time, then the outcomes that favor us should be almost equal to those which favor Ravi. However, when multiple conflict (a number occurs several times) occurs, our model needs several cycles to get rid of them but Ravi's model just throws them away. That probably is the reason why our steady state bandwidth is a little worse than his. 46 ©""I I® r> ©- ■a ©^ *—* © © CM 0CM @~ VTv ©^ ©*> © I I© ©n i® rH a) "8 (0 I § (0 f CO £ ^7 Of course, we do not always increase both m and p at the same time. What happens if we hold one fixed and increase the other one? Figure 19 ss shows B ' versus m when p is fixed. The curve is very similar to Figure av# k. That means, in real cases, bandwidth should grow as p(l-e c ) when the number and speed of p are fixed. On the other hand, when m is fixed the ss B ' against p the will have a similar curve, av. The other parameter of this model is the queue length Q. However, both transient state bandwidth and steady state bandwidth show the fact that Q really does not make any difference. The reason is quite simple: if the request at the head of the queue is blocked, then the flow of requests is clogged. Generating a new request to pile up in the queue is the same as blocking the processor and generating the request later. So it is waste- ful to build a long queue. One or two registers in the processor unit can do a good job already. This is why we did not put the parameter Q in all Model III diagrams. However, the use of longer queues allows us to do the address look-ahead. The contents of the queues can also be used as the decision factor of conflict resolution. One more interesting thing is, our model I is essentially a special case of Model III with H = 1. So Model I is the lower bound of Model III. Thus the bandwidth of Model III is bounded above by Ravi's model and below by our model I. As we said at the beginning of this chapter, this model is very useful to MIMD machines. The reason is that some processors might generate requests faster than the others. This asynchronism indeed ties our model to MIMD machines and some similar environments. Actually, if we add some extra control, Model III can also be used in SIMD machines, since Model I is a special case of Model III. 48 B ss. QV. P fixed ► m Figure 19. Steady State Bandwidth Curve when p is Fixed ^9 lf-,5 Model IVi Queuelng In the Memory Our fourth Interleaved memory model is very similar to the previous one except the queues are built in the memory modules instead of processor units. The logic structure is shown in Figure 20 which has p processors and m memory modules and each module contains a queue of length Q. At the beginning of every cycle, each processor will generate a new request or regenerate the request which was blocked in the last cycle. Then all p requests will be sent down to a conflict resolution box. If more than one processor references the same memory module, the conflict resolution box will decide in what order these requests will go into the queue in their destination module. Then they will be gated into the queue one per clock pulse. So the clock rate should be carefully chosen in order to secure proper operation. If the queue does not have enough room to contain all these incoming requests, the conflict resolution circuit will block those unlucky ones at the end of the line. When the memory module finishes the request of the last cycle, it will send out a completion signal which will push the queue down one place and the first request in the queue will enter the address register of the memory module. Then the memory module can begin another cycle and the whole thing starts again. Again, we used a random number generator to simulate this model with all parameters having the same meanings as in Model III. Figure 21 T shows the transient state bandwidth B * versus H for different m values av. and m = p. The dotted lines are for queue length 1 and solid lines for queue length 2. As you can see in this model, the longer queue will give 50 p PROCESSOR UNITS CONFLICT RESOLUTION (CLOCKED) ±L SWITCHING NETWORK ±t I • • m MEMORY MODULES and m QUEUES Figure 20. Block Diagram of Model IV 51 50 20 10 5 - m * p = 2 0= I m. 64 m. 32 m. 16 m. 8 10 20 30 40 50 60 -► H Figure 21. Transient State Bandwidth Curves of Model IV ss for Two Different Queue Lengths B QV. * P 'igure 22. Steady State Bandwidth Curves of Model IV for Two Different Ratios of m and p 52 you better results. For other ratios of m and p, the curves have the same shape. The histograms, i.e. B versus T curves, we have collected show that we can also improve our model IV by using a better conflict resolution algorithm. When using least-used -processor-first algorithm, we got almost 27$ improvement over leftmost-processor-first algorithm. The improvement of the histogram is the same as we have shown in Figure 16. In Figure 21, we use the worst algorithm. In Figure 22, we show the steady state bandwidth B * for differ- ent ratios of m and p, again with Ravi's result. As you can see, even when queue length is 1, our model IV is better than Ravi's, Consequently, Model IV is better than Model III. The reason is very simple: no matter what the incoming requests are, the requests left in the queues can only enhance the bandwidth. So it is wise to build the queue in the memory module. Figure 22 also reveals one thing: when we increase the queue length, the bandwidth curve moves upward and approaches the perfect band- width. One experiment shows that for infinite queue length the bandwidth curve moves very close to the perfect bandwidth. So the maximum possible bandwidth is p. Just like Model III, this model is also lower bounded by Model I. Since Model I also corresponds to the special case H = 1 of this model. So the bandwidth might swing from the bandwidth of Model I to the perfect bandwidth. Both Q and H can increase the bandwidth. However, this model has a very big drawback which could make this model useless, viz. the order that the requests will be satisfied is un- predictable. The time a request will be served depends on the place it 53 enters the queue. Hence a request might get satisfied earlier than its predecessors. If the data dependency is important, this model might not work properly. One solution to this problem is to apply this model to a Tomasulo type machine. There all the data dependency relations will be handled in the processor by using the Tomasulo' s algorithm. So there will be no data dependency between requests in the memory and the memory serving order is unimportant. Then our model IV will give a very good performance. However, this does substantially complicate the processor. Besides, for read operations, this model has a big problem that no other one has, that is, more than one data fetched might go back to the same processor. This causes a two-way conflict problem which might seriously degrade the system performance. So, this model unfortunately is not very useful. The performance we showed did not take these into consideration since we only wanted to show the potential of this model. When Q = 1, this model still might be used in a MIMD machine. This is due to the fact that the instruction addresses and data addresses are interleaved in real programs. In the worst case, two consecutive re- quests will be served at the same, but this still preserves the proper order of execution. However, we then need a very complicated switching network to switch two pieces of information back at the same time in order not to degrade the performance. So the cost will be very high. When we delete the queues from our model IV, or equivalently make Q = 0, this model essentially becomes Model III. Our simulation result also shows that when Q = the steady state bandwidth is indeed the same as Model III. 5. COMPARISON OF ALL MODELS 5.1 Comparison of Performances In order to compare the performances, we summarize in one graph the bandwidth curves of all the models we have discussed. Figure 23 shows the bandwidth versus p curves when m = p. Of course, the perfect bandwidth (slope = 1 line) is the best we can ever achieve, no model will pass this line. Also, nobody will fall below the bandwidth = 1 line which is the worst case for any model. All the performances lie between these lines. Line (a) is the analytical result of Model I when m = p. Analyti- cal results can be viewed as the steady state bandwidth. As we explained in the last chapter, this line is the lower bound for both Model III and Model IV, since it corresponds to the special case (or the worst case) of H = 1. Line (e) is the analytical result of Ravi's model when r = 1 (B = (l- /e) *p) which is the upper bound for Model III. So the performance of our model III swings from line (a) to line (e). If you recall, the queue length of this model does not influence the bandwidth. So when we increase H, the number of requests each processor can generate, the bandwidth curve will move from line (a) to line (b) and then to line (d) when H becomes very large. Line (b) is the transient state bandwidth when H is only 5 and line (d) is the steady state bandwidth of Model III. Line (f) is the steady state bandwidth of Model IV with queue length 1. This line is above Ravi's line. When we increase the queue length, the bandwidth curve keeps going up until reaching line (g) which corresponds to the infinite queue length. So the bandwidth of Model IV ranges from line (a) to line (g) depending on H and Q. From this diagram, you can get a rough idea of how good these 55 B av. r-ft m PERFECT BANDWIDTH SLOPE = I ,(g)iv,H = oo Q = 00 (f)IV,H = oo Q=l ^(e)RAVI (d)lll,H = co ^(c)IV,H = 5, Q .(b)lll,H = 5 (a) I HELLERMAN BANDWIDTH = -► P Figure 23. Bandwidth Curves of all Models models can be and what happens if some of the parameters are changed. For other ratios of m and p, the relative positions of these curves remain the same. When r gets smaller (more memory modules than processors), all the curves move up, and when r gets larger, all the curves move down. The most important thing about this diagram is that all the band- width curves, steady state or transient state, are linear with respect to p when we hold the ratio of m and p fixed. This is also true when we plot them against m. This contradicts the often quoted square root result accepted by people for a long time. This linearity has important implica- tions for the design of multiprocessor machines. In Figure 23 » we also plot Hellerman's curve. As you can see, all the other models obey the linearity principle instead of the square root principle. 5.2 Comparison of Costs Another important thing we should consider is the cost to imple- ment a model. When we are choosing a model, we must consider the trade-off between cost and performance. Hellerman's model only needs a very simple control since it stops at the first conflict. So the cost of this model will be very cheap. But this way of handling conflicts is indeed the reason of bad performance. Burnett, Coffman and Snowdon's model uses one more queue and a slightly more complicated control than Hellerman's model. So the cost will be a little bit higher. However, the performance has been greatly improved due to queueing the conflicts. As we said in Chapter 3» these two models can only be used in a Tomasulo type machine where no data dependency problem is involved in handling the requests. This needs very complicated processors and hence 57 the total cost of the whole system will be increased accordingly. In Table II, we only consider the cost for the control portion. Since Ravi did not consider how to handle the conflicts, his model becomes unrealistic. So we are not interested in implementing this model. However, there are a lot of similarities between this model and Model III, thus we may think our model III is a realistic realization of Ravi's model. Although our model I displays the worst bandwidth and utilization among Models I, III and IV, the implementation of this model is the easiest and cheapest. Since there is no queue involved in this model, no extra hardware register and control circuit are needed. The only thing we need here is to build a relatively simple conflict resolution circuit. Since all p requests must be satisfied before new requests can be generated, there is no need to build a fancy conflict resolution circuit. The simplest circuit proposed in the next chapter can be used for this purpose. As we said before, Model II has the worst performance and the implementation is not cheaper. Besides, all Model II can do Model I can do too. So Model II should not be considered. Model III is a fairly good model in both performance and cost. Most processors have several hardware registers in them which can be used as an address queue after adding some control circuit. So this model seems to be the most plausible one. As we mentioned before, the bandwidth can be improved further by using fancier conflict resolution circuit. From a bandwidth point of view, Model IV is the best model. But several reasons prevent us from choosing it. The first reason is that it is very expensive to build queues in the memory due to extra registers and control circuit. Second, it is very complicated and expensive to build 58 the conflict resolution box and the switching network. Third, the useful- ness of this model is very limited when Q >_ 2. So we do not favor this model despite of its excellent performance. Table II below shows the performances and the costs for all models. As you can see, Model III might be the best choice in the sense of cost effectiveness and usefulness. Model Data Dependency Type Performance Cost Usefulness Hellerman D Very bad Very cheap Tomasulo type machines Burnett, G off man and Snowdon D Good Cheap Tomasulo type machines Ravi C Good I B Bad Cheap SIMD machines II B Very bad Expensive SIMD machines III G Good Cheap 1. MIMD machines 2. SIMD machines with extra control. IV D Very good Very expensive 1. MIMD machines when Q - 1. 2. SIMD machines when Q > 1 for some applications. 3. Tomasulo type machines Table II. Comparisons of All Models 59 5.3 Data Dependency Types and Usefulness In Chapter 3» we defined four data dependency types and classified other people's models accordingly. In the last chapter, we explained why our models I and II belong to type B, model III belongs to type C and model IV belongs to type D. We repeat the classification in the second column of Table II. The data dependency is the major factor that decides the useful- ness of a model. Just as we said before, no model is universally good for any kind of machine. If we want to use some model in full power, we must consider the data dependency type it fits best. So before we choose a model, we should decide what type of data dependency takes the major role in our machine. In Table II, we also list the machine type for each model. This essentially depends on the second column. 60 6. LOGIC DESIGN 6.1 The Problems All our models shown in Chapter 4 have four basic components: processor units, conflict resolution box, switching network, and memory modules. We have shown their structures in Figures 6, 9» 12 and 20. The conflict resolution box and the switching network are the most important parts which together control the operation of the whole system. Two interesting things about these logic structures are: how the conflict resolution box handles the conflicts, and how the switching network uses the resolution result as control information to switch the requests. Of course, the complexities depend on the model. As we said before, there are a lot of tricks you can play in conflict resolution. We mentioned three strategies in Chapter 4 and we are going to show the designs of two of them in the next section. Any one of these conflict resolution circuits can be fitted into any model. 6.2 Circuit Design of Conflict Resolution Box Since every processor might reference any memory module indepen- dently, we must use m identical pieces of circuit to meet all the possible cases. Each piece is associated with a memory module which will handle the conflicts happening in this module. In the worst case, all p requests will reference the same memory module, so every piece of conflict resolu- tion circuit should have p inputs. Also there should be p outputs where each output will tell the corresponding processor whether it will be honored by this memory module or not. At any time, at most one output is active. 61 So the problem Is to design a combinational circuit with p inputs and p outputs, such that when some of the inputs are active, only one of their corresponding outputs will be active. By active, we mean a certain kind of signal occurs at the input or output port, which might be a level signal or pulse signal. Of course, the solution to this is not unique. For a different conflict resolution algorithm, you will get a different circuit design. We show two different designs below, one for leftmOst-first algorithm and the other for random-selection algorithm. 6.2.1 Leftmost-first Circuit Figure 2k shows a conflict resolution circuit which will signal a 1 at the output corresponding to the leftmost input that is active. At the beginning of a cycle, those requests who want to reference this module will send a signal to the corresponding input port. In Figure 2k , we also show a possible way of doing this, i.e. attaching a decoder to each address register to decode the first few bits which represent the memory module number. Then a level signal will be sent to the proper conflict resolution circuit. When this circuit starts working, a sense signal will be sent into the leftmost AND gate by the control unit. If I . is the leftmost input port that is active, or has an input 1, then S, = S_ = ... = S . - 1 and 0-=0_ = ...=0._=0 due to L = I = ... = I . , =0. Hence . 12 j-1 12 j-1 j will be 1 since both I . and S . are 1. But I . will cause all of S . ,- to 3 3 3 J+l S be which will cause all of 0.,, to be 0. P J+l P 62 Of course, this circuit works. But just like ripple carry adder, it needs p gate delays to propagate the signal through the circuit in the worst case. So it is not very attractive although it only takes 3P gates per memory module, or at total of 0(mp) gates. However, one can easily get the following set of equations for outputs : °1 = I 1 °2 - h h = I. I_ ... I ,1 p 12 p-1 p So we can build an AND-gate tree to speed up this circuit. This will take 0(p) gates per memory module hut with only log ? p gate delays. The im- plementation is easy and we omit it here. 6.2.2 Random-selection Circuit As we mentioned in Chapter k, the leftmost-first algorithm always displays the worst result. One algorithm that shows the best improvement is the least-used -first algorithm. But in order to find out who is the least used processor, we need a lot of sorting circuits or comparison circuits. This will cause tremendous cost and so we do not favor this scheme. The other one that shows only a little worse performance than the least-used-first algorithm is the random-selection algorithm. The im- plementation of this scheme is very similar to the circuit in Figure 2k, Figure 25 shows the circuit that uses random selection. 63 Sense 2nd Processor Unit Address Register Decoder TT TV Op-i "p-i / p i p s P-i S P Figure 24. Conflict Resolution Circuit by Using Leftmost -first Algorithm 0, I, ° 2 > 2 Op Ip TO S, «• CHANGE SIGNAL Figure 25. Conflict Resolution Circuit by Using Random-selection Algorithm 64 Here we just replace the sense input of Figure 24 by a p flip- flop ring counter. Only one output of this ring counter will be 1 at any time. The outputs of ring counter will be OR-ed with S. 's through control gates . When control signal comes, only one output of those OR gates will be 1 according to the content of the ring counter. Then the closest input with a 1 will have a 1 at the output port, and the rest will all be 0. Equivalently, we just move the sense input to a point decided by the content of the ring counter. The content can be changed arbitrarily, so we have a random selection circuit. The ring counter is just an end around shift register. If we shift the counter one place per cycle, we have something very similar to the polling scheme for handling interrupts. This implementation would take 5P gates plus a ring counter per memory module. The gate delay is p. We can also use the tree connection to speed up this circuit to log„p gate delays. However, the output equa- tions are more complicated than those for leftmost-first circuit. Here we only show the output equation for 0. : °1 " h S l + h I 2 I 3 •'• i p s 2 + x i J 3 \ • • • I p s 3 + I, I s 1 p p Other output equations are similar except the subscripts are permuted symmetrically. Obviously, this equation will take 0(p) gates and hence 65 the whole circuit will take 0(p ) gates. Figures 2k and 25 only show the conflict resolution circuit for one memory module. As we said earlier, a conflict resolution box takes m identical copies of this circuit, so the gate count will be m times that we gave above. Table III summarizes the gate delays and the total gate counts for the four possible designs. Algorithm Connection Type Gate Delays Total Gate Count Leftmost-first Ripple o(p) 0(mp) Leftmost-first Tree 0(log 2 p) 0(mp) Random-selection Ripple o(p) 0(mp) plus m ring counters Random-selection Tree 0(log 2 p) 0(mp) plus m ring counters Table III. Gate Delays and Total Gate Counts for the Pour Conflict Resolution Box Designs 6.3 Switching Network Desip?i The outputs of the conflict resolution box are not only used to signal the processors units whether their requests have been accepted or not, but also used to give the switching network control information for switching requests to memory modules. We have just described the detail of the conflict resolution circuit, so we will treat it as a well-known black box here in order to simplify the picture. The switching problem in our system is not as simple as we might 66 think. Although we can use any non-blocking alignment network, such as a crossbar network, some control problems will make them inadequate for our systems. In this section, we are going to show the design of a switching network by using another method, namely, the jam transfer method. This is a very simple and easy-controlled method, although it takes a lot of gates. The typical configuration of jam transfer is shown in Figure 26. When control signal C is given, the content of flip-flop A will be gated into flip-flop B. In our system, this control signal is provided by the output of the conflict resolution circuit. The OR gates in front of the flip-flop B allow different resources to be connected to B. Figure 26. Jam Transfer 67 All the implementations of switching network for our models are very similar, we only show the system layout of Model I in Figure 27. The other models have the same basic connection but with fancier control cir- cuit, so we only describe the difference in words. In order to simplify the picture, we only show the connection between two processor units and one memory module. The rest can be con- nected in the same way. Again, single lines in the graph indicate individual data or control lines and double lines indicate sets of lines. DR's represent data registers which hold the data that transfer from or to the memory modules. PC's are the address registers that hold the addresses. The first part of PC contains the memory module number and it is connected to a decoder that will send a signal to the proper conflict resolution circuit. Only the second part will be sent down to the memory. In order to provide the returning processor unit number we attach a tag to each address. AB is a bit which will tell whether the request in PC has been accepted by the memory or not. All rectangles in Figure 27 actually represent an array of gates. At the beginning of every cycle, each processor will generate an address in PC. The decoder then decodes the first part of the PC and sends a level signal to the proper conflict resolution circuit. If this processor wins the contest, the second part of the PC and TAG will be gated into the memory. Then the memory cycle begins. The BUSY bit will be set to disable any change on MAR and at the same time the AB bit will be set to block the address in next cycle. At the end of the memory cycle, the memory module will generate a completion signal which resets the BUSY bit to allow new address to be gated into MAR. This completion signal is also used to gate the data back to the processor if it is a read operation. 68 PROCESSOR UNIT U DR TAG PC AND - (?) - © (?:J) + $ C2) - ••• + ( - i ) i (? (Y) - ° ^ A - 2 ) Then start from equation (A.l). Expand the first term in the summation we get: C°(0.k) - (-i) M - (**) c°. l( o, k -i) - £ (**) C°..(0 >k -j) ■ (-Dm - ^^Wi - t (k J X) £o (0 ' k - J) J ~ + ( k ?) ¥ ( k : 2 ) t^i-D The two first terms in the summations are -( _ )C ? (0,k-2) and ( \. )( _ )C _ ? (0,k-2). Combine them and apply equation (A. 2) for n=k-l and i=2, we get ( _ )C „(0,k-2). Repeat the substitution once we have: e£(o.k) - (m-i)^ - ( k 1 - 1 )(,-i-D k . 1 . 1 ♦ (^H^W 75 " ^ % <■?> £ 2 -a<°' k - 2 -J> All three summations have k-3 terms and their first terms are all of the same form a*C _(0,k-3). Their coefficients can be summed together to be -( I ) by using equation (A. 2) for n=k-l and i=3. Then substitute -C 1 *! 1 ) G° o (0,k-3) by using equation (A.l) we generate -(- k : 1 )(m-3-l) Tr Q , j m-j _? k-^-x and a summation. If we keep going, we will produce (-l) J ( . )(m-j-l), . , at the J K-j-l jth step. After k-1 steps, we get Stone's result: C°(0,m) = k Z _1 (-l)H k f)(*-5-l\ * x ™ -1=0 ^ K. J ± B. The Proof of S . Although Stone gave the right solution to S ., he did not really prove that his solution is right and how he got it. We must point out that the proof of S . is not trivial. Here we show a way to derive this number. Again, 3 . is the total number of sequences that are of length k and have at least j a -transitions. First, let us release the restriction put on the first element of a sequence, that is, we allow it to be any number from to m-1. The total number of transitions in a k-length sequence is k-1. Of course, there are ( . ) ways to select j positions for a -transitions. %j For a certain fixed selection, without loss of generality, we can assume that these j a -transitions are distributed into w disjoint groups and each group has u. a -transitions. Hence vu+u + . . . +u = j. By a group we mean those oi -transitions that occur consecutively along the line. For example, let k=10 and j=6, then the selection 0aaa3a3aa has three groups (w = 3) 76 with u =3, u =1, and u =2. Originally, we have m possible a -transitions to be chosen, namely, 01, 12, ... , (m-2)(m-l), (m-l)0. We can think of them as a circle of m numbers. Obviously, there are m ways to select a sequence of u_+l con- secutive numbers (u.. a -transitions) to be the first group. After that. we have m-u. -2 a -transitions left to be used. Since no matter how we choose the first group, the two a-transitions at the ends can not be chosen by any other group. You can easily convince yourself of this by our definition of a group. Now the circle has been broken and we can think of the remaining a-transitions as lying on a line. Our problem then is to choose w-1 group: from this line, no two adjacent to each other, and place them back into the sequence. Actually, the length of a group is not important since the first component of a group will decide the fate of the whole group. So we can think of our problem as finding out the number of ways of selecting w w-1 objects, no two consecutive, from m-u -2- £ (u. -l) or m-j+w-3 objects i=2 x arrayed in a line, then permute them in w-1 positions. The reason we w subtract Z (u, -l) is to represent each group by its first element. i=2 To find the number of such selections is actually a famous problem, namely, the Kaplansky Lemma, and the solution is: / (m-j+w-3) -(w-1)+1n or (Hi-j-lx v w-1 w-1 Then we have (w-l)l ways to place them into w-1 positions. Since these w groups use w+j numbers, the number of ways to pad the remaining positions is (m-w-j) (m-w-j-l) ... (m-k+l). Combine all these results, we have: 77 - ^ k_1 > - /m-j-1 Sj - C~ X ) m r^) (w-l)l (m-w-j) (m-w-j-l) .... (m-k+l) - ( k f) m (■-J-Dfc.j.j, This is the sane as Stone's S. except with a factor m which comes from the J fact that we assume the first element of a sequence can be any number from to m-1 . If the first element must be , S . will be /m of the above J result due to symmetry. C. Simplification of Ravi's Bandwidth Equation Ravi's bandwidth equation (equation (2.8)) is repeated here: t ki s(p,k) (£) B = Z k where t= min(m,p) (C.l) k=l m This equation can be simplified to a simple closed form of m and p only. Since the upper limit t has two different possible values, we will split the derivation into two different cases: one for t=m and one for t=p. Amazingly, the results of the two cases are the same. First let t=p, or when p <_ m. Equation (C.l) becomes p k kt S(p,k) (?) B = z — av. , , p k=l m r Let us take a look at the numerator. Since k kl (£) = m (m-l) (m-2) . . . (m-k+l) k = m (m-l) (m-2) ... (m-k+l) (m-(m-k)) = m (m-l) ... (m-k+l) - m (m-l) ... (m-k) = m k! Q - m k! C*" 1 ) so the numerator becomes: ■ z (ki ff) - ki r; 1 )) s( P ,k) k=l K k 78 - m Z kl (?) S(p,k) - m i kl C? 1 ) S(p,k) k=l K k=l K By the definition of Stirling number of the second kind* m P = I kt (?) S(p,k) k=l * the above expression becomes m^ 1 - m (m-l) P Hence when p <_ m, equation (G.l) can be simplified to be: m P+1 - m(m-l) P B av. p nr = m i- (i-V-) p Now let t=m, or p > m. We can not play the same trick again since the upper limit is not p any more. So we need to find another way. One k k thing we can do is to substitute kls(p,k) by Z (-1) (.)(k-i) , then we i=0 1 get k m k ( E (-l) i (^)(k-i) P ) Q B = Z i=0 aV * k=l m P The upper limit of the summation in the numerator can be changed to k-1, since when i = k, (k-i) becomes 0. m(l+m) Now, if we just expand these two summations and place the — terms in an upper triangle (k=l, we get one term, k=2, we get two terms, etc.), then all terms on a diagonal have the same factor and the coeffi- cients change regularly. So, if we start from the upper right corner and sum them diagonally, we will get expression (C.2) for the numerator. m-1 P+l m^ - m (m-l) p + E a (m-i) p (C.2) 1-2 79 where i ± - Z (-l) Vi+j) (^ +j )( m ~^ j ) • It is very easy to prove that a. is equal to for all i: i (-D j (m-i+j) < r + j r^ +j ) • q m-i+j J = E (-1)^ (m-i+j) (J) (\) - (?) m z (-D j ( x ) - i z (-D j 6 + I (-D j j C 1 .) j=0 J j=0 J j=0 J = i . . i since Z (-1) J ( 1 .) = and Z (-l) J j( X ) = 0. j=0 J j=0 J So only the first two terms in expression (C.2) survive and the others become 0. Thus for p > m, B is the same as that for p <_ m, or B = m av. 1 Pi 1 - ( 1 - — ) m It) D. Derivation of lim (l- /m) p = p-x» e Since r = P /m, then 1 p r p (1- ) = (1 ) s m v p ' (P - r ) p (p-r) P " r (p-r) : By using Stirling's formula pi - p P e~ P /27rp, we can change the above expression to: 80 %j (p-r)t e P " r (p-r) 1 p! e p /2 7T (p-r) 1 p-r p-r p-r r e p p-1 p-r+1 / p-r as p-*» , all fractions except the first go to 1. So the limit of (l- /m) P is / r . E. Solution of f (m,s) n N ' f (n,s) is the number of ways of putting n distinct objects into m distinct boxes with each box containing at most s objects. This is the coefficient of the t /nl term of the following generating function: ( 1 + t + t /2 + .... + t S /sl ) = If (m,s) t n /ni We can derive the following recurrence: f (m,s) = m f .. (m,s) - m ( ) f , (m-l,s) n v ' ' n-l v s ' n-l-s v ' ' For any set of n, m, s values, we can calculate f (m,s) by using the above equation and the following facts: f (m,s) = m if n < s n v ' — f s+1 (wts; = m - m f s+2 (m,s) = m S+2 - m - (m) 2 (s+2) f s+ ^(m,s) = m S 3 - m - (m) 2 ( S 2 ) - (m)^ ( S 2 ) F. Solution of g n (m,s) g (m,s) can be said to be the complement function of f (m,s). It 81 is the number of ways of putting n distinct objects into m distinct boxes with each box must contain at least s objects. The generating function is: ( e l _ i . t - t 2 /2l -....- t S 'V(s-l)l ) m = I g^m.s) t n /n\ The recurrence for g, (m,s) is: g n (m,s) = m g n _ 1 (m,s) + m ( g " 1 ) g n _ s (m-l,s) For s = 2, the boundary conditions are: g (m,2) = if n < 2m °n ^(-.Z) - ^J (2»)« G. IBM Random Number Generator The random number generator we used in simulation is the IBM library routine RAHDU. We show our version here: INTEGER FUNCTION RANDU ( ISEED, M ) ISEED - ISEED * 655^7 IF ( ISEED .LT. ) ISEED = ISEED + 10737^-182^ + 10737^1824- RANDU = ( ISEED * O.4656613 E -9 ) * M RETURN The initial value of ISEED is 637823409 which is suggested by [lU] . 82 LIST OF REFERENCES [I] Flores, I., "Derivation of a Waiting-Time Factor for a Multiple- Bank Memory," Journal of the ACM . Vol. 11, No. 3, pp. 265- 282, July 196A-. [2] Sisson, S. S. and M. J. Flynn, "Addressing Patterns and Memory Handling Algorithms," AFIPS Conference Proceedings, 1968 Fall Joint Computer Conference , Vol. 33, Part 2, pp. 957- 967, 1968. [3] Hellerman, H. , Digital Computer System Principles , pp. 228-229, New York, McGraw-Hill, 1967. [^] Knuth, D. and C. Rao, "An Activity in the Interleaved Memory System," Unpublished Paper, 1975. [5] Burnett, G. J. and E. G. Coffman, Jr., "A Combinatorial Problem Related to Interleaved Memory Systems," Journal of the ACM , Vol. 20, No. 1, pp. 39-^5 1 January 1973. [6] Stone, H. S., "A Note on a Combinatorial Problem of Burnett and Coffman," Communications of the ACM, Vol. 17, No. 3, pp. I65-I66, March 197^. [7] Liu, C. L., An Introduction to Combinatorial Analysis , pp. 110- 111, New York, McGraw-Hill, 1968. [8] Chang, D. Y. , "Another Note on the Combinatorial Problem of Burnett and Coffman's Model," Unpublished Memo, 1975. [9] Burnett, G. J. and E. G. Coffman, Jr., "A Study of Interleaved Memory Systems , " AFIPS Conference Proceedings, 1970 Spring Joint Computer Conference , Vol. 36, pp. kG^-k^k, 1970. [10] Coffman, E. G. , Jr., G. J. Burnett and R. A. Snowdon, "On the Performance of Interleaved Memories with Multiple -Word Bandwidth," IEEE Transactions on Computers , Vol. C-20, pp. 1570-1573, December 1971. [II] Burnett, G. J. and E. G. Coffman, Jr., "Analysis of Interleaved Memory Systems Using Blockage Buffers," Communications of the ACM , Vol. 18, No. 2, pp. 91-95, February 1975. [12] Ravi, C. V., "On the Bandwidth and Interference in Interleaved Memory Systems," IEEE Transactions on Computers , Vol. C- 21, pp. 899-901, August 1972. [13] Riordan, J., An Introduction to Combinatorial Analysis , New York, John Wiley and Sons, 1958. 83 [14] Richardson, B. , "A Comparison of the IBM-SSP Random Number Generator with the Payne Feedback Shift Register Generator," CSO Document, University of Illinois at Urbana- Ghampaign. [15] Budnik, P. and D. J. Kuck, "The Organization and Use of Parallel Memories," IEEE Transactions on Computers , Vol. C-20, pp. I566-I569, December 1971. [16] Lawrie, D. H. , "Memory-Processor Connection Networks," (Ph. D. thesis) University of Illinois at Urbana-Champaign, Depart- ment of Computer Science Report No. 557, February 1973* BIBLIOGRAPHIC DATA SHEET 1. Report No. UIUCDCS-R-T5-T 1 +7 2. 3. Recipient's Accession No. 1. Title .hkI Subt it lc ANALYSIS AND DESIGN OF INTERLEAVED MEMORY SYSTEMS 5. Report Date August 1975 6. '. Author(s) DONALD YI -CHUNG CHANG 8. Performing Organization Rept. No -UlUCDCS-R-75-7U7 >. Performing Organization Name and Address 10. Project/Task/Work Unit No. University of Illinois at Urbana-Champaign Department of Computer Science Urbana, Illinois 6l801 11. Contract/Grant No. US NSF DCR73-07980 A02 12, Sponsoring Organization Name and Address 13. Type of Report & Period Covered National Science Foundation Master's Thesis Washington, D. C. 14. lb. Supplementary Notes 16. Abstracts High-speed computer systems usually organize their storage into several modules. Each module can operate simultaneously. Hence, several modules can be accessed at the same time. This effectively increases the throughput and the computation speed of the system. In this report, we first describe several interleaved memory models designed and analyzed by other people. We repeat their results and also show some further improvements. We then present four new models and derive appropriate performance figures. The difference between previous models and our new models is the way we handle the requests and conflicts. Some of the circuit design problems are also dis- cussed briefly in this report. 7. Key Words and Document Analysis. 17o. Descriptors Conflict resolution Data dependency Interleaved memory system Normalized bandwidth Queue ing Steady-state bandwidth Transient-state bandwidth 7b. Identifiers/Open-Ended Terms 1 1 7c. COSATI Fie Id /Group |B. Availability Statement ; Releast Unlimited 1 1 19. Security Class (This Report) UNCLASSIFIED 21. No. of Pages 88 20. Security Class (This Page UNCLASSI1 22. Price | JKM NTIS-35 (10-70) USCOMM-DC 40329-P71 >•$ # \