The person charging this material is re- sponsible for its return to the library from which it was withdrawn on or before the Latest Date stamped below. Theft, mutilation, and underlining of books are reasons for disciplinary action and may result in dismissal from the University. To renew call Telephone Center, 333-8400 UNIVERSITY OF ILLINOIS LIBRARY AT URBANA-CHAMPAIGN MM i j | B2 **2$fcV; I APR 03 1996 JAN 6 1997 L161— O-1096 r£A h /J d I *■ Report No. UIUCDCS-R-77-908 ( ^Z&f ■ A- FURTHER RESULTS REGARDING MULTIPROCESSOR SYSTEMS UILU-ENG 77 L71 ; by Donald Yi-Chung Chang October 1977 NSF-OCA-MCS76-81686-000030 Digitized by the Internet Archive in 2013 http://archive.org/details/furtherresultsre908chan Report No. UIUCDCS-R-77-908 FURTHER RESULTS REGARDING MULTIPROCESSOR SYSTEMS by Donald Yi-Chung Chang October 1977 Department of Computer Science University of Illinois at Urbana-Champaign Urbana, Illinois 61801 This work was supported in part by the National Science Foundation under Grant No. US NSF MCS76-81686 and was submitted in partial fulfillment of the requirements for the degree of Doctor of Philosophy in Computer Science, October 1977. 1 1 1 Acknowledgements I would like to express my very deep appreciation to my thesis advisor, Professor Duncan H. Lawrie, for his continuous encouragement and guidance throughout this research. It is his pleasant and humorous personality that makes my study at the University of Illinois really enjoyable. I would also like to thank Professor David J. Kuck for his valuable advice, kind understanding, and most importantly, the long-time financial support. Thanks are also due to a very dear and special person, Professor C. L. Liu, for his constant moral support and all kinds of help throughout my study here. Fellow students Wilson K. Wen (now with Sperry-Univac), D. Stott Parker, Jr., Bruce R. Leasure, and Jackson K. C. Hu, who provided valuable comments, encouragement, and friendship are greatly appreciated. Special thanks go to Mrs. Vivian Alsip and Mrs. Gayanne Caprenter for their advice and support. Also, I want to thank Cathy Gal lion for an excellent job of typing. Finally, I want to thank my lovely wife, Li, for her love, patience, and understanding throughout this long undertaking and my great parents for providing me the chance to receive the best education. IV Table of Contents 1 . INTRODUCTION 1 1.1 A Brief Survey of Multiprocessor Systems 1 1.2 Comparison of Three Multiprocessor Systems 4 1.2.1 PRIME System 5 1.2.2 C.mmp System 10 1.2.3 NonStop System 14 1.2.4 Overall Comparison 17 1 . 3 Major Design Questions 22 1.3.1 Software Related Questions 22 1.3.2 Hardware Related Questions 28 1.4 Thesis Outline 30 2. SYSTEM PERFORMANCE MEASUREMENT 32 2.1 Queueing Analysis 32 2.1.1 Our Queueing Model 33 2.1.2 Avi-Itzhak and Heyman's Method 37 2.1.3 Konheim and Reiser's Method 43 2.1.4 Brown, Browne and Chandy's Method 50 2.2 Simulation 56 2.2.1 Memory Bandwidth Problem 57 2.2.2 The Simulator 64 2.2.3 Definitions of System Measurements 74 3. EXPERIMENTAL RESULTS 76 3.1 Results for Software Related Questions 76 3.1.1 The Workload 77 3.1.2 Monoprogramming versus Multiprogramming 85 3.1.3 Memory Allocation Schemes 97 3.1.4 Job Scheduling Algorithm 121 3.1.5 Effects of Job Characteristics 128 3.2 Results for Hardware Related Questions 139 3.2.1 Hardware Quantity Effect 143 3.2.2 Hardware Speed Effect 159 3.2.3 Partial Connection 165 4. CONCLUSION 191 4.1 Summary 191 4.2 Some Design Problems 203 4.2.1 Address Interleaving 203 4.2.2 1/0 Connection 217 4.3 Further Problems 223 References 225 Appendix A 230 Vita 235 Chapter I INTRODUCTION 1.1 A Brief Survey of Multiprocessor Systems The advent of large scale integrated circuit has created a tremen- dous impact on computer system design. In particular, LSI technology has made great strides in the areas of memory packaging and microprocessor de- sign. The existence of inexpensive but powerful microprocessors and extremely high density semiconductor memory chips have led people to consider the design of large computer systems incorporating a large number of processors and memory modules. In fact, the idea of using multiple processing units to handle various functions of the whole system is not new. People started thinking and building machines with multiple PEs at least twenty years ago. Back in 1958, Unger [1] designed a machine to perform pattern-recognition processing, which consisted of a central control computer and a processing element array. During the same year, three other systems were designed and manufactured, namely, National Bureau of Standards' PILOT system [2] and USAF's AN/FSQ-31 and 32 air defense systems. Although these old machines do not quite fit into the definition of multiprocessor system commonly accepted by people today (since some of them we prefer to call Multiple-Computer Systems), they do show the approach people used to speed up their systems, i.e., using several processing units to carry out several operations at the same time. Before we go on, we would like to present a definition of a multi- processor in order to clearify some ambiguities. According to the American National Standard Vocabulary for Information Processing [3], a multiprocessor system is defined as "a computer employing two or more processing units under integrated control." A better definition was proposed by Enslow [4]. He defines a multiprocessor to be a system with: • Two or more processing units • Shared common memory • Shared I/O channels, control units, and devices • Single integrated operating system • Hardware and software interaction at all levels We will use this definition throughout this report. Obviously, a group of computers connected by some communication means, such as ARPANET which does not have all these five characteristics, does not qualify to be and will not be called a "multiprocessor" system. So, the first "true" multiprocessor under this definition should be Burroughs' D-825 system [4,5,6,7] announced in 1960. A lot of multiproces- sor machines have been designed and built since then, for example, Burroughs B-5000, IBM 704X/709X, CDC 6600, Univac 1108/1110, etc. A complete list can be found in [5], and a wery good bibliography in [8]. Most of the multiprocessor systems only have a small number of processors, say 2 to 10. This is not surprising because they were built before LSI became popular and the hardware was still very expensive. Only a few machines were designed to have a large number of processing units. The most famous one is, of course, ILLIAC IV as well as its two predecessors, SOLOMON I and SOLOMON II, which were designed by Slotnick, et al [9,10] to work on problems involving differential equations, linear algebra, and weather data processing. Since all these problems contain a lot of matrix operations, sometimes thousands by thousands, they do need a machine with a large array of PEs in order to get a reasonably fast response time. In the past few years, the tremendous improvement in circuit per- formance and the drastic reduction in hardware price have made the multi- processor design even more attractive. In particular, the advent of the LSI microprocessor has brought the system designer into a new world. People start building systems by using tens, hundreds, or even thousands of cheap but very powerful microprocessors. Recently, several projects have been proposed, e.g. [11], to construct systems with 1024 or more processing ele- ments. Only the very low cost, say a few hundred dollars per PE, can make this kind of design possible. This was still a dream even five years ago. Of course, a lot of questions arise in this kind of new design. For example, how do we interconnect so many processors, how do these processors share resources and communicate with each other, how do we control the opera- tion of the whole system and fully utilize the hardware. Needless to say, all these questions need to be answered satisfactorily before we can come out with a good design. People are getting more and more concerned with these problems. It is our intention to make a thorough study of these problems in order to get a better understanding of how to design such a multiprocessor system. Before we try to answer those questions, we would like to briefly discuss three well-known systems to give readers an idea of what kind of system we are dealing with. They are: PRIME system at the University of California at Berkeley [12], C.mmp system at Carnegie-Mellon University [13], and Tandem 16 NonStop system by Tandem Computers, Inc. [14]. All these systems are made up with a certain number of microproces- sors and memory modules. However, due to a different set of design objectives, e.g., degree of resource sharing, expandability, etc., each system has a completely different architecture and operating system design philosophy. For example, they use three fundamentally different interconnection schemes to connect their processors and memories, namely: • Multiport memories • Crossbar switch • Time-shared common bus We would like to list their differences, and try to compare their advantages and disadvantages. Hopefully, we can learn something from this study which can be used as a valuable design guide in the future. 1.2 Comparison of Three Multiprocessor Systems 1.2.1 PRIME System The PRIME system is a medium-size, general -purpose time-sharing system whose design is aimed at improving the cost/performance ratio, relia- bility and privacy of current time-sharing systems. Figure 1 shows the sys- tem architecture of the PRIME system. It consists of five Meta 4 microproces- sors (by Digital Scientific Corporation) and thirteen four-port memory modules, Every processor is connected to eight memory modules via a dedicated proces- sor bus, so it is a multiport memory connection. (Since each processor only connects to about two-thirds of the total memory, we will call this a "partial" connection in later discussions.) Meta 4 is a microprogrammable microprocessor that has a processor cycle time of 100 ns. It operates on 16-bit operands and has a 32-bit micro- store. The MOS memory is 16 bits per word, which has a 400 ns access time and a 600 ns cycle time. Each four-port memory module has 8K words made up from two 4K word submodules. There is a four-by-two switch inside each DISK ... DISK DRIVE DRIVE EXTERNAL .... EXTERNAL DEVICE DEVICE INTERCON ( EXTERN NECTION NETWORK AL ACCESS NETWORK ) I/O 1 — CONTROL LUGIC * ^FIGURATION [C * * * * * r r PE r( LOG EACh LINE REPRESENTS 16 T^RM nal con,\i;c"Iions 50RS 1 2 3 PROCES! 4 5 MAP MAP MAP MAP MAP 1 L _L _L _L _L 1 I _L _L _L 1 12 3 4 5 6 7 8 9 10 11 12 13 13 MEMORY MODULES ( EACH CONSISTS OF TWO 4K BLOCKS ) Figure 1. Structure of the PRIME System. module which can connect a memory port to any submodule. This memory orga- nization allows two-way interleaving inside a module. All the peripheral devices (except user terminals) are connected to the processors and the memories via a big interconnection network called External Access Network (EAN), which is essentially a crossbar network [15]. The network is controlled by five I/O control boxes. These I/O control logic units not only contro the information flow in and out of the peripheral devices but also control some inter-processor communication. At any given time, the whole system is partitioned into five physical ly separated subsystems [16]. No two subsystems will share the same memory module or disk space. This is to achieve high privacy, which is very important in a multiprocessor system. One of these five subsystems is as- signed to be the "control subsystem," and the rest "program subsystems." All the users will compete for these four program subsystems. The operating system [17] is also partitioned into two parts, namely, the Central Control Monitor (CCM) and External Control Monitor (ECM), as shown in Figure 2. The control subsystem is assigned the Central Control Monitor and each program subsystem has a copy of External Control Monitor. CCM is the centralized part of the system wide operating system which controls all the system tasks like job scheduling, resource allocation, interrupt handling, and inter-processor communication. CCM also monitors the I/O con- trol boxes to determine the connections made in the interconnection network. Whenever a program processor wants to talk to the other program processor or access a peripheral device, it must send a request to CCM and seek the permission from CCM. Then, CCM will make the connection by telling the UP LM PROGRAM SUBSYSTEM p ~ ECM I i cor r SU fTROL JSYSTEM 1j CCM \ \ \ n o r~ -o m * \ \ \ ^ i I \ \ ■ \ \ > \ \ \ r \ ECM LM i CCM CM ECM j LM UP Figure 2. Structure of the PRIME Operating System. interconnection network to do so. ECM, on the other hand, is the local representative of the operating system at each subsystem, including the control subsystem. It will perform the local management functions related to processes running on that sub- system, e.g., teletype I/O for the teletypes physically connected to the subsystem's processor, swapping out the current process and swapping in the next process, etc. It also controls the communication between user processes and the CCM, and does the independent verification of CCM decisions. In fact, all five ECMs are working like a communication subnet. Each subsystem also has a Local Monitor (LM). Every user can de- fine his own LM to control all the intra-subsystem tasks, e.g., the manage- ment of resources allocated to him the generation of interrupts, etc. So, the software is modularized and partially distributed into all the subsystems. This is a \/ery important factor for achieving high availability. For reliability reasons, any subsystem can become the control sub- system. Whenever there is a failure in the current control subsystem (which may be detected by other subsystem's ECM), any other subsystem can take over the job immediately. If one program subsystem goes down, the whole system only suffers a performance degradation of 25%. Hence, the PRIME system is a highly reliable, highly secure, and highly available system. Besides, due to its multiport memory connection, it is yery easy to expand and reconfigure. Of course, the physical boundary between two subsystems essentially eliminates the possibility of code sharing by two processes. Thus, some software duplication is needed which effectively reduces the available memory in each subsystem. However, the designers do not consider this as a drawback. A paper by Ravi [18] points out that code sharing will actually generate more cons than pros, e.g., the system will need higher memory band- width due to the higher memory interference caused by competing processes. There are two yery interesting things we would like to point out. First, each program subsystem, and hence each processor, is dedicated to one user job until this job is swapped back onto the disk. A user job will not be brought into the main memory unless there is a free processor and the available memory space attached to it is large enough. So, at any given time, at most four user programs reside in the main memory being executed. In other words, the PRIME system does not allow more than one job to be executed by a processor subsystem at the same time. It would seem that they are not fully utilizing the processors. However, there is no overhead due to changing of jobs, e.g., swapping out the status information of the current job and reconfiguring a new subsystem. Furthermore, the operating system is much simpler which increases the software reliability. The second thing is the partial connection scheme. The physical partition sometimes will eliminate the chance for a new job to enter the system, even though the total free memory space is large enough and there is some processor available. Of course, there is some amount of performance degradation due to this fact. We are interested in finding out how bad this is, compared to a more expensive full connection scheme 1 ike a crossbar switch. Needless to say, the PRIME system does provide us a lot of interesting subjects to study. We will discuss them in more detail later. 10 1.2.2 C.mmp System C.mmp is the multi-mini-processor system at the computer science department of Carnegie-Mellon University [13,19]. The overall architecture is shown in Figure 3. It consists of sixteen PDP 11 minicomputers connected through a crossbar switch to sixteen memory modules. Every processor (Pc, a modified PDP 11 processor) can access any memory module via the crossbar switch. So, the memory is completely shared by all processors. This is one basic difference between PRIME and C.mmp. We will call this kind of connection a full connection. C.mmp is designed for solving large artificial intelligence prob- lems. This kind of problem needs a number of processors to work on a large common data base simultaneously in order to obtain the answer in real time. So, complete memory sharing is crucial, and that is why it uses a 16 x 16 crossbar switch for the memory- processor connection. Of course, there will be some memory contention due to memory sharing. In the next chapter, we will give an analytical solution for this problem. However, the memory contention can be reduced by using local (or private) memory. In C.mmp, each processor has a 4K local memory which is not shared by other processors (Figure 3) . Each processor can be a slightly modified version of any model in the PDP 11 family. In the first stage of implementation, five PDP 11/20's were installed. Another four PDP 11/40's were scheduled to be added in the sunmer of 75. The PDP 11/40 operates on 16-bit operands and has a processor cycle time of 650 ns. Notice that, although the processor cycle time of PDP 11/40 is much larger than that of Meta 4, it does not mean the PDP 11/40 has a smaller instruction execution rate. The instruction execution rate MEMORY MODULES 11 15 PROCESSOR LOCAL MEMORY Pc SWITCHING NETWORK ( 16x16 CROSSBAR ) • • • • MAP Pc 15 CONTROL L CLOCK L_ INTERRUPT MAP — EXTERNAL DEVICES Figure 3. C.mmp Architecture. 12 is determined by the number of cycles each instruction will take. For PDP 11/40, most of the instructions only take one or two processor cycles. It averages roughly 0.44 million instructions per second. On the other hand, Meta 4 has the similar rate since each instruction needs several micro- instructions to execute. Both PDP 11/20 and 40 use core memory which has a 500 ns access time and 1.2 ms cycle time. Although the memory speed is not wery fast, we can interleave a program into all 16 modules to get a high memory band- width. One other area where C.mmp differs from PRIME is that peripheral devices are not shared. Each peripheral device is connected to the unibus of a processor and can only be used by this processor. Hence, the processors must use the primary memory for interchanging information. Both this and memory sharing are possible sources of violating privacy. Software protection is a yery important issue in the operating system design. Although the main purpose of C.mmp is to use the system as a whole to work on a large program, it can also be partitioned, either dynami- cally or statically (manually), into several independent subsystems and operated in a fashion like PRIME. Due to the partitionability of the cross- bar switch, the hardware can be partitioned into two, three, or even 16 totally separated subsystems. This provides a great ease for maintenance, since if any processor or memory module is down, it can be isolated from the rest of the system and turned over to the hardware engineer for replace- ment. Thus, it would not require taking the entire machine down for maintenance. Unlike the PRIME system, C.mmp does not designate a single 13 processor as the control processor. This is because C.mmp is designed to have up to 16 processors, and when the number of processors increases, the master (or control) processor quickly becomes a bottle-neck. This is the reason why PRIME can only have a small number of processors (5), since Meta 4 is a relatively slow minicomputer. However, this means that each processor in C.mmp should have its own copy of the operating system if it is working alone. In order not to occupy too much memory, the size of the operating system should be somehow minimized but still meet all the users' requirements. Certainly, this needs a special kind of operating system design. HYDRA, the operating system for C.mmp, is designed for this purpose [20]. The central core of HYDRA is a "kernel" set of operating system facilities which provide both basic protection and management of the hardware resource. However, the kernel does not provide software for things like the file system, job control language, or scheduling policy. These are supplied by the user. This approach has several advantages. First, the user has the freedom to define his own operating system, for example, a job control language. This not only allows the user to minimize the size of his operating system, but also allows him to specify some facility not provided by the existing programs or to replace an existing facility by one more closely attuned to his own needs. Second, an error in one user operating system can only affect his own program. It will not crash the entire system. This greatly increases the reliability of the software. Since the kernel is rather small and well-defined, an error is \/ery rare. In the C.mmp system, a program usually can be run on any available processor. This is why it needs a crossbar switch in order to provide a full 14 connection. Of course, the processor utilization will be higher than that on PRIME. However, we can see there must be a big overhead associated with job swapping, especially when each user has defined his own operating system. We will show later that this scheme might not be a good idea in some cases. The use of a crossbar switch, of course, has some disadvantages: first, it is very expensive; second, it is not easy to expand. C.mmp can have a maximum of sixteen processors and sixteen memory modules. Although if we use PDP ll/40s, the system can yield up to 7 million instructions per second (mips), which is comparable to an IBM 370/158, it will be very dif- ficult and expensive to expand the system beyond sixteen processors. This scheme certainly will not work in future systems where we might have thousands of processors. But, in general, C.mmp is a very reliable, both in software and hardware, highly available, and easy to maintain system. In particular, the ideas of HYDRA will be very helpful in operating system design for future multiprocessor systems. 1.2.3 NonStop System Figure 4 shows the architecture of a recently announced multiproces- sor; Tandem Computers' NonStop System [14]. This system, configured with up to 16 minicomputers, is designed to handle heavy banking transaction processing and provides very high availability. The basic difference is that the processors are connected together by a pair of time-shared common buses. Processors will communicate with each other via these buses. The use of common bus offers the advantages of very low cost and 15 DUAL REDUNDANT COMMUNICATION BUSES o PROCESSOR MODULE DISK CONTROLLER => u ii _ _ ii_ _ o TAPE CONTROLLER i i I I H COMMUNICATION CONTROLLER Figure 4. Tandem 16 NonStop System. 16 the ease to modify the hardware configuration. For example, we can add or remove a functional unit fairly easily. However, the overall system performance is limited by the bus transfer rate, and the failure of the bus will cause a catastrophic disaster. Hence, NonStop uses dual redundant buses to increase the transfer rate and the availability. Each processor module is actually a complete minicomputer, having its own control unit, arithmetic unit, private memory, and its own copy of the operating system. So, e\/ery processor has the ability to keep working even if all other processors are down. Also, whenever a processor goes down, other processors can take over without much difficulty. There is no memory sharing, instead the processors share the peripheral devices (e.g., disk and tape) via the controllers. This is be- cause the system is designed for handling banking transactions and all the processors are supposed to work on a big data base on disks or tapes. Therefore, unless each minicomputer can provide a large amount of primary memory, the use of this kind of architecture is perhaps only appropriate for data base management. The number of processor modules that can be attached to the common data bus seems to be unlimited. However, as the number of processors in- creases, the bus contention increases drastically. This will seriously degrade the system throughput. Besides, the longer the bus is, the larger the time skew is, and the slower the clock rate will be. So, the expanda- bility is limited by a number of constraints. When the workload grows beyond a certain limit, this architecture will no longer be able to expand and perform satisfactorily. The I/O is controlled by the communication controller. The 17 controller can assign a task (transaction) to any available processor. This can achieve a high availability and utilization of processors. Perhaps the most appealing design is their software. Guardian, the operating system of NonStop, is a virtual memory system which contains automatic re-entrant, recursive and shareable codes. Whenever a component fails, Guardian automatically reassigns both processor and I/O resources to ensure that in-process tasks including file updates are completed correctly. This guarantees the process can be restarted in a very short time. For a system that provides high availability, this type of action is extremely important. When one of the disks fails in the middle of a file update, Enscribe, Tandem's NonStop data base manager, ensures that the damaged record or file is restored. Enscribe uses a duplicate file technique to continue the operation by using the back-up file. Hence, the faulty disk will not cause any interruption of service. Overall, NonStop uses redundant hardware and duplicated software in order to give the user continued service without any interruption or termina- tion. This is very important to a system where the user cannot afford any system downtime, for example, a telephone switching network, or an online banking situation. However, this might not be a good candidate for a scientific research environment. 1.2.4 Overall Comparison We went through three multiprocessor systems very briefly in the last three sections. As we can see, each system has a different architecture, and its own advantages and disadvantages. It is not fair to say which system is the best or which one is better than the other one since they have 18 Q. O CO C o > XJ cd 3 3 CO O CD ro X -»-> LU 03 O QJ c 03 o S- 00 p - <_) >- CO r- CI) •r- 00 3 3 XI CO E 3 CD -C CO 4-> 4-> co o 03 >,X> -i-> co <0 ^4- Q • r— X E E cd C 3 -t-> 3 r— O ■i- S_ .e x: •f— Q. E XI 03 en cn XJ E E CD a. ■r- •i— cd O CO O s: ro :r 33 2: C_J r— O 3 co S_ - CO S- O CO CO cr CD c •1— O 5- S- ■M 0. CD ■r- E LT, C — - ■"- >> s- o ro ro -C > CO -i- E r- CO >> 3d) co 5: >- C CO O CJ •r— •r— 4-> s- ■1— -t-> E i- CD (O 2. a. -l-> i. S- CD O 4-> F Q. 4- 3 •r— ro •r— -l-> ^— ^ XJ 1 CD 3 O CD >- o Cn C E E ro i- cn o i_ O- O C CO O CD >> Q. 3 S- CO O O) E -C S- CD -t-> X ro y CD -O • — ' -i^ r— CO ro x: Q. CO z CD cn 3 E O 3 S- •r— O O CO S- _l X) re _i O ■— C_) 1 1 , , i- i. CD 0. 4-> CO 1— rO CO i- • 1— 4-1 -0 c CD CJ CO 3 4- ro O <_> OJ •4-> 00 i- XT 00 =3 cn CD 1 — •^ (_> •I— E O ra s_ Li_ O CD Q. O CO _c >> cL s- CD 3 CO cn CO E c O CD •r— ro •1— •1— —1 u ZC LlJ CO 00 2: z: c •^" +-> (O CO c ro S_ 00 r- 3 CO cn fO •^ 4-> _i<: cn ro c c O ro t~ CO 00 c co E >, CD 3 E > •r- E ro O ■O O CD i- CD O problems in u an support a p imultaneously 2: O co 00 1— • O CO O . CO S. 00 Ct r— CD •F- ro co i„ >> CD 3 (O t- cn - D- O V. CD *4_ E E ro E O O CD r— 'r- C_) 2: -CJ i- E CD 1 CD 3 •0 C > 1— X) •c- c »r— f- ro E ■0 ro n3 O CD 3 CD 2: CO i- C 2: ro * | * 3 -*: S- CO CO i. CD r— O XI 4-> 3 X O 4- O CD CO enco S- . — ia **v ^~ O r^ n +j cu > cn 0) 1/1 s- e 4) •» j-j 00 i,-. ■M >i CO 1/1 CD 0) XI s- 03 ro i- S ro -n a i- E ro CJ rn on CD C- 3 ■!-> 03 CD CD E CD sz o co c o ■4-> (J CD C c o o 1- CD >> 5- O E CD 2: cn c CD > CD Q CD Q. ai CO >> ro •a c: 3 CO O E CO CD ■4-> 1 — CO ro >^ <_> 00 •1- J3 1/1 3 >,CO C|- CD O -t-J co CD >, ■a co o 2: cn c C ■!- O -M ■^ ro ■^ S- ro CD i- Q. CD O a. O c +j a o c 3 o E CD CD ■r- CO 4- >> CD CO o cn o c ■4-J ••- 4-> >, ro -M S- •r- CD •— Ql ■1- O XI , CD a: CD s- ro 3 ■o 1. rO CD ct: ai i- ro 3 O CO X >, CD +-> . — •r- Q. r— E ■M •■- O •1- XJ c_> 1 — ro ■r- c QJ XI •!- S- ro ro ro •1- C +-> ro •■- »4- > ro O <£. 2: CO 00 E S- CD O ■•-> co co co >> CD CO CJ O C S- •>- o_ XI C)_ CD O C cn S_ •!- CD 00 XI CD E Q 3 O CJ CD 1— CD Q. U 3 enen CD Q 4_ o 00 CO CD CJ O S- o_ •- c CD O +-> •>- c +-> 1—1 ro O S- 1- o c 4- 3 00 E c o ro C_) CD 2: rO O C3 c cn CD o o a' c Cl -r " 00 1/1 „ c t_ (Q •r- X) fJ 1/1 ■ — CO c ■1— S. u- u OJ T3 1 — rn CD •r— xr ■!-> *-> c. O 0) O CD CO O O •<- x: f .C • 1 — 00 E £ +-> >-. -*-> 4-> CU >> cu E ■■- i- E s_ oo 03 S- cu 2 O a; E INI c • 03 ZJ cu r— •l— 00 S_ r— CU 4- 4- JD CU CD -i- S- ZJ CD •f— .a aj s_ 4- -C C 00 +-> Q.-D CD 14- +-> o x: Q. s O E c O 03 O 13 •r-5 U cu •i- 03 •r— 03 E -i-> 1/1 03 ^1 cu +-> S- c +-> >) >> 4-> -C CU CJ) E CU 03 03 C i — c s- CU -M <— O CU .Q > ZJ r— 03 CU > •> CU 5- +-> QJ -Q O Z3 E -c 03 CU -C X5 CL oo r— i- 03 E 4- +J CU • i — oo >> r— 03 oo 03 CO O i — 00 -Q 03 S- i— oo •i- x: •i — 4-> 03 S- CU 03 S- O i— .a 2 oo • Q +-> O ai CU r— •r- Z3 r— C 00 cn+-> 03 03 C O +-> c -o CU r— O 4-> 1 c 00 s_ O 03 , — ■r- o S- i— •r- o s- ■r- Z3 cn •i— o i — 00 E •r— 4-> CU cu > E O 4- 03 03 >> >> 2 ■r- 4- 4-> 03 S- • t— r • i— tz s- -O 4- c CU E Q. C T3 >> +-> >> C 2 CU CU X> 03 •r— •— 03 cn cu s- O i- o o > 1— 03 S- S- S- •r- 4-> O C O 3 r— cu cu cn cu i/i i/i E c E c: oo +■> -o CU r— > +-> O -c ns cu 03 a; 03 +-> o o x: •!- 03 e s- +J «=C 3 E o E C_) •!- 2= E 1— 2 nz i— i Q. O 03 -O u 03 JD u 03 >, cu >> 1 >> c — (1) c s- •i — S- E CU c o CU CU 4- . o X o O 10 o -C i— o cu E 03 E 00 O -M +J ZJ cj cu E cu i — ■r- x: 4- o -O E c E E i — +-> O (1) ' o o cu 03 "i— cu 4- c E ••- S- cu • cu 2 cu 1 — cu cn 1 — C T- S- 03 03 03 03 4- o > c o cu cu Z3 +-> i- S- -C 03 •<- sz 1 — s- s- i— >> £Z 4- CU O) s -C > 2 Z3 CU =3 •i — i — •i- O Q. +-> 03 T3 • 4- O 03 C 03 O c cu O CU cu O 00 S- O 00 M- O E c •1 — jC +-> i— x: E E CU O CD o cu +-> S- +-> 03 -M cr> > +J • cu cu 03 +-> cu c cn •r— r— +-> +-> »r- 03 cu E +-> N S- C N o O e c 2 i — 4-> CU 00 ■r- CO-i- "i — i- E o 03 #v "O CU CJ +-> >> i — O i — 4- Q. cu > a) cu CT3 (U W oo •i— S- 4- •!— O +-> c cu -a 03 4- >> .a 4-> Q. O +-> 2 00 03 i — d .O =3 a. s- 4- oo zj ^ rs CU cu >, U =5 03 T3 X O 03 00 -C CU s- 4- 00 "D ■r- O CU cu 1 >> O CU >> rs -Q "O O i— E C +-> r— s- 03 S- i — r— 03 3 C E cu o o o o cu i — cu cn i— "1 — 00 03 s- >> +->•■- tz sz 4-> 3 CO Z3 03 +J 1 "O S- +-> 2 c 4- 2 tD 4- 4- O s- .— cu (D o • >v- 1— • 1 — o CU CU i — J- S- E -Q 00 "O i — CU C r- E £Z a> 4- +-> 03 03 o cu o ra -a -r- jc o 03 i — 3 03 s: 4- E E -E s: E •■-) LU 03 2 +-> z: C_) 'r— .a .c +-> •r- o •r— s- T3 oo +-> S- 03 Q- 00 •r- Q CU X •i— s: oo cu E cu -C o oo c o *r— +-> 03 CJ o s- o E CU cu cu s- 4- O c o 00 •i — S- 03 Q. E o <_) CM cu -O 03 27 Now, our second question is: • What kinds of performance will be given by these schemes, and which scheme is the best to use? In Chapter 3, we will measure all these schemes, and discuss some problems related to them, for example, how do we interleave the addresses if we want to use the partitioned system or the mixed system. The next question we are interested in is: • What kind of scheduling algorithm should we use? Every operating system designer must face this question. In order to answer it, we have to know what type of system we are dealing with, what kind of measurement we are interested in, and what penalty we will suffer if we let some job wait in the queue. In general, people measure the goodness of a scheduling algorithm by using the average turnaround time it produces. So, most people use round-robin in a time-sharing system, shortest- job-first in a batch system, and give higher priority to time-sharing jobs in a mixed system. In our study, we will assume that we are dealing with a batch sys- tem. However, this does not imply the shortest-job-first (SJF) algorithm will always win if we consider the average turnaround time, since we will deal with some special architectures like a partially connected system. Some other algorithm, e.g., the best-memory-fit-first, might perform better than SJF under that circumstance. We are even interested in seeing how bad the first-come-f irst-serve algorithm will perform, since FCFS does not involve any scheduling and that means a simple operating system. The fourth question we want to answer is: • How do the job characteristics affect the system performance? 28 The job characteristics include the mean, the variance, and the distribution of the job size, the processing time, the inter-arrival time, and the number of I/O requests. These parameters certainly have great in- fluence on the system performance. For example, if the mean job inter-arrival time is too small, i.e., the jobs come in too fast, the system will become over-saturated and the queueing time might go to infinity. In order to avoid all these undesirable phenomena, we must under- stand how the system responds when a certain parameter changes, and how sensitive it is. Then, we can always keep our system in a safe region. Whenever the system load changes, we will know what we should do in order to maintain satisfactory performance. 1.3.2 Hardware Related Questions The cost-effectiveness is perhaps the subject people are most con- cerned about in system design. Every designer wants to know how much per- formance improvement he can get for a certain piece of hardware he adds. Of course, everyone will try to invest his money where he can buy the best improvement. So, this trade-off problem usually is the first thing people will solve. Like most people, our first hardware related question is: • How is the system performance affected by the architectural parameters? We want to know how the job turnaround time, processor and memory utilizations vary when we change certain hardware parameters, like the num- ber of processors, the number of memory modules, the number of I/O devices, or the size of a memory module, etc. We also want to know how the system performance will be affected by the hardware characteristics, like memory 29 cycle time, the processor cycle time, and the I/O speed. As we mentioned in Table 1, the architectural difference between those three multiprocessor systems is the interconnection schemes they use, namely, the crossbar switch, themultiport memories, and the time-shared common buses. In fact, we can classify them into two groups according to the degree of connection each scheme provides. The crossbar switch used in C.mmp will be called a "full connection," since every processor can access any memory module via this switch. The multiport memories used in PRIME or the common buses used in NonStop, on the other hand, will be called a "partial connection," since every processor can only access part of the memory. Naturally, we would like to know: • How much degradation will we suffer if we use a partial connection instead of a full connection? Of course, the degradation goes up as we decrease the number of connection points. But at the same time, we reduce the cost of the whole system. Obviously, this is a trade-off problem. We will do some compari- sons in Chapter 3. There is another very interesting problem associated with the partial connection, and we call it the "connectivity" problem. Since in a partial connection, say multiport memories, each memory module will be con- nected to only some of the processors, so the question is how many memory modules will be connected to a particular processor. For example in Figure 1, each memory module has four ports (but only three have been used, except one module uses 4), and each processor is connected to eight memory modules in a fairly regular manner. However, this uniform connection might not be 30 the best way. Especially when the number of ports is small relative to the number of processors, then some uneven connection might be necessary in order to meet some requirements, for example, one processor must be con- nected to half of the memory modules in order to take care of big jobs. Again, we will devote a section to discuss this interesting problem. 1 .4 Thesis Outl ine In order to answer the questions we raised in the last section, we need to do some performance measurements. Two methods are commonly used by people for measuring the system performance, namely, queueing analysis and simulation. Usually, queueing analysis can reveal more insight about the system behavior, since we can see from the analytic solution how a certain variable affects the system performance. In the first part of Chapter 2, we will discuss some analytic work people have done in measuring computer systems. However, queueing techniques are only good for simple models with some simple assumptions. As the complexity of a system model increases, the queueing analysis will soon become intractable. Therefore, people switch to simulation. The nice thing about a simulation model is you can put in as many parameters as you want, and as many constraints as you like. So, the simulation technique can be applied to yery complicated model. Unfortu- nately, it can be very costly. Since our model is rather complicated, we will use a combination of the analytic approach and simulation to measure the relative performance of various systems. In the second part of Chapter 2, we will describe the simulation model we use. We will also talk about some memory bandwidth 31 problems, since we will use memory bandwidth to determine how we advance the virtual clock of our simulator. In Chapter 3, we will present all the results and try to answer the questions we raised in the last section. Finally, in Chapter 4, we will discuss some logic design problems and give a summary of all our results. 32 Chapter 2 SYSTEM PERFORMANCE MEASUREMENT 2.1 Queueing Analysis In this chapter, we are going to talk about how we measure the performance of a multiprocessor system. As we just mentioned at the end of the last chapter, two common techniques can be used for this purpose: queueing analysis and simulation. We will start by looking at some queueing models that have been proposed for analysis of multiprocessor systems. Using queueing techniques to study the system performance is a very old subject. People have been active in this area for quite a number of years and a lot of papers have been written on this subject. However, most of the effort has been spent in the following areas: (1) Performance analysis of auxiliary and buffer storage like disk [21,22], drum [23], or magnetic-bubble [24]. (2) Waiting time analysis of job scheduling disciplines [25,26]. (3) Performance analysis of single processor multiprogram- ming systems [27] and single processor time-sharing systems [28]. (4) Performance study of communication networks like ARPANET [26] or ALOHA [29]. Relatively few papers have been written about multiprocessor sys- tems. Perhaps the biggest difficulty is to formulate the resource conten- tions into the model. Besides, if we want to consider the finiteness of memory size and I/O operations, then the whole system will become a queueing network with blocking. This is a well-known tough problem to be solved exactly. In some papers, e.g., [30], people just ignore all these problems and treat the multiprocessor as a M/M/p queueing system, which yields a bad 33 approximation. Of course, we are interested in a more accurate solution. Recently, three papers have been written which provide some analytic methods of studying multiprocessor systems [31,32,33]. In two of these methods, we can accurately include the effects of finite memory size and workload memory requirements in the queueing model. We feel that these papers deserve to be discussed in some details in order to let readers have more understanding about the problem and realize the strength of these analytic methods. We will show how we apply these methods to our queueing problem, and discuss the advantage and disadvantage of each method. How- ever, due to the complexity of the systems we are going to study, we will have some difficulty including all the effects of system architecture and resource contention into a queueing model. Unless we can nicely formulate everything, we will not be able to get accurate results from any of these methods. After we discuss these three papers, the reader should realize why we will rely on simulation rather than queueing analysis. Before we talk about these analytic models, let us first describe our queueing model for a multiprocessor system. This will aid in under- standing the later discussion. 2.1.1 Our Queueing Model Figure 6 is the basic queueing model we are interested in. We assume that the system has p processors, r I/O devices, and a total of M kbytes of primary memory divided into m modules. When a job arrives, if either there are already D jobs in the system or the available memory is not big enough, it will be queued in the outside waiting queue. Otherwise, this job will enter the service box and queue in the processor queue for the first service. If the job gets a processor, it will be served for some amount 34 X o CO C£ LU OO en 1 «=£ ^ Q_ LU Q r b i 1 OO 1 LLi b O t — t 1 > >-"" ^\ -* 1 ^ J r \ Q $■) 1 o \ v J 1—1 s r 1 S- LU ZD 1 LU i :r> c 1 o i i — i L 1 oo 1 C£ o 1 oo 1 <- r > > ^— '""N. LU , r \ ^ 1 o / r-\ 1 ° I =*• q: V / 1 Q- V y Q. J L LU 1 ID 1 LU . ID o- C£ o I oo l LU 1 O 1 Ul 1 o C£ D_ J u c< r l CD I 1 >-Ll 2: s i ID ,_H =c LU 1 =D _ i h- 1 ID •— i 1 X 1 CD" O CC o LU O ' -i- CQ CD 31 CD | 1 ^ O _J | LU I >— i oo — i i i 1 => ' < cy | ir: 1 _i 1 - i r< en c s CO O •"3 i— x en o >- en o a: o oo CQ O ■"D X ■d o CD CD O) S- o UD CT) 35 of time nonpreemptively until it requests an I/O operation. Then it proceeds to the I/O queue and waits for an I/O operation. After the I/O, this job will depart the whole system with probability l-o, or return to the processor queue with probability o and the cycle starts again. The outside waiting queue corresponds to the HASP queue in our IBM 360/75 which holds the jobs that are blocked from service. The average number of jobs queued here is an indication of how well the system performs. A good scheduling algorithm could be used here to reduce the average queue length. The number D indicates the maximum number of jobs allowed in the service box. It is equal to p under monoprogramming and some constant d under multiprogramming (d is called degree of multiprogramming). Each job in the service box will cycle through the tandem queue for a certain number of times. This number will have a geometric distribu- tion: C -1 Pr {a job needs 2 cycles} = (l-a)a , 2=1,2,... with mean a = 1/1-a. a is acutally the average number of I/O requests for a job. The parameter a can be arbitrarily defined or obtained from analyzing some real data. Figure 7 shows the timing diagram of a job from its arrival until its departure. Of course, in any stage, if the resource is available when the job arrives, it will get served immediately without waiting. In order to make analysis easier, people always assume the job arrival is a poisson process. In other words, the job arrival rate is con- stand or the inter-arrival time is exponentially distributed. Literature on queueing analysis has shown evidence that this is a pretty acceptable assumption. 36 ARRIVE -p - —10 . -p - . - — io _j,v~ w j |xxxxx| 1~^ — -j 1 1 |xxxx| DEPART ? 10 QUEUEING TIME IN THE QUEUE BOX QUEUEING TIME IN THE PROCESSOR QUEUE EXECUTION TIME QUEUEING TIEM IN THE I/O QUEUE I/O TIME Figure 7. History of a Job, 37 However, the most controversial part is the service rates of processor and I/O, i.e., /.t, and v ? in Figure 6. In order to use the nice results of queueing theory, we will have to assume they are constant, that is, to assume both the processing time and the I/O time are exponentially distributed. This is a \/ery strong assumption to make. For the I/O service rate, if we neglect the interference between I/O requests, this assumption may be alright. But, this cannot be true for the processor. The service rate of a processor should be a function of system architecture, memory allocation strategy, the number of jobs currently being executed, the memory interference they create, and the original inter-I/0 time distribution. In general, this is not easy to formulate. In addition, due to the finiteness of the memory size and the maximum number of jobs allowed in the system, this model becomes a queueing system with blocking. It is a tough problem and no exact solution is known yet [34]. The best thing people can do is to use an approximate model. One example is Avi-Itzhak and Heyman's model which we are going to discuss in the next section. 2.1.2 Avi-Itzhak and Heyman's Method The approach adopted by Avi-Itzhak and Heyman consists of two stages [31]. The first stage is to view the system as a closed queueing network with a fixed number of jobs, and obtain the average cycle time for each job. Then we approximate the open system by an M/G/D queue, use the result of stage one to solve the state balance equations, and get the ex- pected time a job will spend in the system. 38 By a closed queueing network, we mean a network with no job coming in or going out. Figure 8 shows the closed queueing network used in Avi-Itzhak and Heyman's analysis. It consists of k service stations with a fixed number, say n, of jobs cycling through the stations. Station 1 represents a group of processors, and stations 2 to k represent various groups of peripheral devices such as disks, tapes, etc. Station I contains r. servers operating in parallel with a common queue, and each server has the same expected service time E(S.)- When service at a processor is com- pleted, the job moves to station I with probability 7r., where 7r,=0 and k 2 7r . = 1 . Upon completion of service at station I, the job moves back 1=2 < to station 1 and the same process is repeated. The exact solution of this queueing model has been obtained by Jackson [35], and Gordon and Newell [36]. We will repeat their result here and show how to relate it to our model. Let p - be the steady-state expected number of busy servers at station I. The average number of jobs flowing into a given station must equal the average flow out of the station. Therefore, [p 1 /E(S 1 )>^ = P^/E(S^) S for I = 2, ..., k, a.± [E(S.)/E(S^, ?i = a l Pi ' Define we obtain with a,=l by definition. Assume p(x^, X2, ..., x.) is the steady-state joint probability of there being x ., i= 1, 2, ..., k, jobs at station I, then we have 39 STATION k Figure 8. The Closed Queueing Network used in Stage One, QUEUE Figure 9. Approximated M/G/D Queueing Model 40 k x . p(x,, x 2 , ..., x k ) = c n [p^ V^(x^)]. (1) k In this equation all x- ^ 0, 2 x. = n, c is the normalization constant, and \ x . ! i f x . <_ r - , W = * x ^ if ,* lr-!r- if x. > r-. The summation of p(x,, x 2 » • •> x.) over the set D n = {x: x _> 0, k 2 x- = n} must yield the value 1. Therefore from equation (1), we have 4=1 * k x k x 1 = c 2 n [p//0 .(x.)] = c p" 2 n [a//0(x.)] (2) D 4=1 ^ ^ ^ ' D n 4=1 * * /C However, the expected number of busy servers at station 1 is p, , that is I n P 1 = 2 x 2 p(x,,x 2 ,...,x. ) + r, 2 2 p(x ] ,x 2 ,. . .,x k ), (3) x l =1 V x l x f r l V x l k where D -x, = {x: x > 0, 2 X; = n-x,}. Substitution of (1) and (2) into n ' 4=2 ' (3) yield r r l p = { 2 x, 2 A + r 2 2 A} / 2 A x l =1 V x l X l =r i D n~ x l D n k x. with A = n [a. *-/fi .(x.)]. ._! 4 4 4 4=1 The only thing we need now is an algorithm to generate all the elements in D and D -x, , then we can easily enumerate p, . After obtaining p, , we can get the average inter-arrival time at station 1, i.e., E(S-,)/p-,, by applying Little's theorem. Since there are n jobs in the system, the average cycle time for each job, i.e., the time between two visits to sta- tion 1 by each job, should be 41 T(n) = n E^) / p y This is the result we want in stage one. Now, let us come back to the original open system. Avi-Itzhak and Heyman propose to view the open computer system as an M/G/D queue, as shown in Figure 9, with arrival rate X, D servers, and state-dependent service rate p . If each job takes a cycles on the average, then clearly, Jn/at(n) n = 1.....D. M = < n HVaT(D) n > D. which is the rate jobs depart from the system. Denoting the steady-state probability of having n jobs in the system by P , we can have the following set of balance equations: r o p o ■ Vr (r n + "n )P n = Vl P n-1 %] P n + 1 ■ n = 1 ' 2 The solution to this set of equations is oo 2 P = 1. n=0 If we assume X > n~, and X_ = X, = ... = X, we get r "p () X n / J u 1 M 2 ...M n , n = 1,2,...D. P D (X/ju D ) n " J . n > D. By Little's Theorem, we obtain the expected time a job spends in the whole system 42 E(T) = 1/X 2 n P s n=0 n This is the final result we are looking for. For our queueing model, the calculation in stage one is much simpler since we only consider two stations (k=2). Therefore, there are only n elements in D and one in D „ , i.e., n=x n . The equation for p, J n n-x, l n r I then becomes: -\ V 1 n-x n-x- *., x i ir^WF^T + r. n 2 a, r ^(x^n-x^ n-x X] =l ^l (x l )/3 2 (n - x 2 } If we assume monoprogramming, i.e., n <_ r-i = P. the above equation can be further reduced to: n-x. n-x. a„ n cu '1 " *, *1 Vtyn-x,) ' S „ x^^n-x,) n 2 x. The rest of the calculation remains the same. The advantage of this method is, obviously, the ease of enumera- tion. However, there are several problems with this model. First, we do not know how accurate it is to approximate an open computer system by an M/G/D queue. Second, this model does not consider the effects of finite memory size and things like the scheduling algorithm. Third, and the big- gest reason, we do not know the value of a« (the ratio of service rates). Of course, we can figure out how a~ affects the performance. How- ever, we do not know how the system architecture, memory allocation strategy, and other factors affect the value of a ? . An analytic formula is very 43 difficult to get. One way, perhaps the only feasible way, is to get this value by simulation. However, if we must simulate to find a„, we might as well use simulation from the beginning. 2.1.3 Konheim and Reiser's Method Figure 10 shows the queueing model studied and solved by Konheim and Reiser [32]. It is a two-stage tandem queue with feedback and a finite intermediate waiting queue. The arrival of jobs is a Poisson process of rate A, and service times in both stages are exponentially distributed with a and ]3 , respectively. The principal characteristic of the service in this network is blocking. The first-stage server is blocked and ceases to offer service whenever M jobs are enqueued in the second stage. In other words, stage two cannot accummulate more than M jobs. Service resumes in the first stage when the number of jobs in the second stage falls to M-l. As usual, when a job leaves the second server, it will depart from the system with probability 1-a or rejoin the first queue with probability o. This queueing model is, of course, quite different from our model shown in Figure 6. The only thing we are interested in is the method Konheim and Reiser use, which we can apply to our problem with a little modification. In fact, the method they use is very basic, namely, using the state balance equations. For the queueing network shown in Figure 10, they come out with the following equation: 44 J^M Figure 10. A Tandem Queue with Finite Waiting Room and Blocking. 45 [X + Ol, . n . M % + jJI, . n %] P; ; = XI, . n xP. , • L U>0,j0) J >c,j U>0) -c-l.j + aI U>O,/o) p -t+i,/-r o + P (^+l)(y-l)(fe+l)9(fe + l)(l-^) P r (a Job can enter|sys- tem is in state (^c+1 ,y-l ,fe+l )} for I > 0, D > y+fe > 0. (4) Where f(j) is the total service rate of processors when / jobs are in the processor stage, and similary g(fe) is the total service rate of I/O devices when k jobs are in the I/O stage. We drop the indicator function from the equation. Of course, any term on the right hand side will vanish if it has a negative index. The calculation of those conditional probabilities is a yery interesting subject. It depends on several things, for example, the job scheduling algorithm, the memory allocation strategy, the total memory size (M), and the job size distribution. Of course, it also depends on the state (X,y,fe). Perhaps the best way to explain this is to give an example. Let us look at the first conditional probability, namely, the probability that the new arrival cannot enter the memory, given that the system is in state (^c-l,y,fe). If we assume a f irst-come-first-serve (FCFS) scheduling algorithm, then, trivially, this probability should be 1 when there is more than one job in the waiting queue, i.e., -t-1 >_ 1 or I >_ 2. However, when i=1 , i.e., the system is in state (0,y,fe), this probability will become the probability that this new job can enter the memory given there are already j+k jobs in there. To answer this question, we must know the last three things we mentioned in the last paragraph. Let us assume we use distributed memory allocation (cf. Chapter 1), and the job size is 2 normally distributed with mean n and variance v , then the probability will be: ^ ( M-Q>fe)M } _ ^ ( M- ( , / + fe+1)M ) //+fe v /y+fe+i v r x " t2/2 dt where 4>(x) = l//2w J e — oo As we can see, all the terms on the right hand side might have quite different forms from equation to equation, since they depend heavily on the state and other factors. In general, it is impossible to solve equation (4) analytically by using a transformation technique. This is the same problem Konheim and Reiser encountered, although their equation is much simpler than ours. If we only want to solve the state probability, then we can use a numerical method Konheim and Reiser use for their model . The method is very straightforward. However, it requires two huge arrays. If we let P = [P-'jl] be the state probability "vector," then we can rewrite equation (4) in the following form: P = PA where A is the state transition matrix. Each entry in A can be enumerated by the method we just described. Figure 11 is a three dimensional representation of equation (4). We can see equation (4) forms an irreducible Markov-chain since every state can be reached by any other state. For example, from state U,j',fe-1) we can go to state (^c,j,fe) via state (-c,j'-l,fe). Only the direct transitions to and from state (0, all c, 0, for n>£ , all y. v.. Then, g(n,y|C) can be computed from the following recursive equation: 55 g(n,y|«) = g(n,y|fi-l) [l-P(c-y) y + 2 g(n-l,x|K-l) p(y-x), for £ = 1 ,2,.. . ,m. x=0 Where p(x) is the probability that a job will be of size x, and P(x) is the x cummulative probability, i.e., P(x) = 2 p(y). The proof of the above y=o equation is wery simple and can be found in [33]. After getting all these conditional probabilities, we can calculate h(n|m) by summing over all c values of y, i.e., h(n|m) = 2 g(n,y|m). y=0 In the third step, we will consider the queueing network shown in Figure 13(b), which is called the overall model. This model consists of just user terminals and a single queue/server called the computer system. All the servers in the device model are coalesced into a single server. The number of jobs in this model is the total number of jobs in the system, i.e., M. The ready jobs are those in the computer system. Let q(m) be the expected number of jobs serviced by the computer per unit of time given that there are m jobs ready. It can be computed by m q(m) = 2 T(n)h(n|m), for 1 < m < M. n=l Therefore, what we have now is the queue-length-dependent service rate of g(m) when there are m jobs in the computer queue. We can then apply the balance equation technique to solve the state probabilities, and calculate the system throughput. Basically, this method uses the same technique as the first method by Avi-Itzhak and Heyman. Both methods want to find out a state-dependent service rate first and use the classical balance equation method to solve 56 the rest of the problem. However, this method is more powerful and accurate since it considers several things the first method does not include. Unfortunately, like the other two methods, this method still does not include the effects of system architecture and resource contention. In addition, the calculation of g(n,y|C) is only limited to yery simple scheduling algorithms. Since we are interested in the effects of all these factors, we have decided not to use an analytic approach. Instead, we will use simulation which allows us to consider as many parameters as we want. In the next section, we will talk about how we do the simulation and discuss some problems associated with the simulator. We will also give some definitions of the parameters we are measuring which will be wery useful in our discussion in the next chapter. 2.2 Simulation In the last three sections, we talked about some analytic tools for studying computer performance. Although we devoted quite a number of pages to these methods, what we really wanted to do was to show their limi- tations and to explain why we cannot use them for our work. In Chapter 1, we discussed several problems we are interested in: we want to compare several memory allocation schemes; we want to study the effect of scheduling algorithms; we want to compare partial connection with full connection; we want to see whether we should use multiprogramming or monoprogramming; and so on. All these make our system extremely complicated, and none of those analytic models can cover all the problems. Hence, the simulation technique must be used to meet all our requirements. Although simulation is a very expensive method, it can handle any 57 arbitrarily complicated model. We can increase or decrease the system com- plexity at our will. It is this ability to cope with reality that makes simulation more useful than queueing analysis. Before we describe our simulator, we would like to discuss a problem, namely, the memory bandwidth problem. By memory bandwidth, we mean the number of words we can access in the main memory in a unit of time. Usually, bandwidth will be measured in number of words per memory cycle. In other words, the memory bandwidth represents the information flow rate in or out of the main memory. Since most of the processor operations are related to the memory, e.g., the instruction and the operand fetches, memory bandwidth significantly affects the system throughput. The higher the bandwidth is, the faster the processors operate. Hence, in order to determine how fast the system operates, we must know how much memory bandwidth we can get from the system. As we will see later, this is what we use to advance the virtual clock in our simulator. In the next section, we will first derive a simple bandwidth equa- tion. Then, we will show a general equation we use in the simulator. This general equation can be used to handle different kinds of memory allocation and different types of system architecture. We can also put in some parameters, e.g., the memory-processor speed ratio, in order to take care of things that might exist in a real computer system. 2.2.1 Memory Bandwidth Problem We just defined the memory bandwidth to be the average number of words we can access during one memory cycle. From the memory's point of view, it is the average number of busy memory modules per unit of time. Of course, this quantity depends on a lot of factors, for example, how many references 58 a processor will make in one memory cycle; the inter-relationship or pat- tern of these references; how many processors are accessing the memory; how do they interface with each other; etc. In general, this is not a simple problem to solve. However, if we can make some reasonable assumptions, then we might be able to come out with some closed form solutions. Let us start with a simple problem. Assuming there are m identical memory modules operating in parallel, and a processor is generating s randomly distributed references to these memory modules per memory cycle, what is the memory bandwidth of this system? For example, the processor is s times faster than the memory and it is accessing s items at random addresses. Or equivalently, there are s (usually we use p, so s=p in this case) indepen- dent processors and each one is only making one memory reference per memory cycle. This is an interesting combinatorial problem whose answer is given by the following theorem [38]. Theorem : Given m identical memory modules operating in parallel, if we generate s randomly distributed references, then the average number of busy memory modules (bandwidth) will be: t ( m ) * ,, k!S(s,k)V rill) Bw k=l where t=min(m,s) and S(s,k) is the Stirling number of the second kind. We prove in [39] that this equation can be reduced to a very simple closed form, that is: B w = m[l - 0-l/m) S ]. (5) 59 If we are interested in asymptoctic behavior, then the above result can be transformed into the following linear form as we keep the ratio of s and m fixed and let m and s go to infinity. B = m[l - l/e r ], where r=s/m. w L J What this result implies is that, when we double the number of processors and memory modules, the memory bandwidth we will get is also doubled, provided each processor is generating one independent reference per memory cycle. This contradicts what people have always called the biggest disadvantage of a multiprocessor, namely, doubling the cost will not double the performance. The most famous result is, of course, the square root equation proposed by Hellerman [40], which says that the memory band- width of an interleaved memory system only grows as the square root of the number of memory modules. So, when we double the modules, the memory band- width only increases by roughly 40%. Apparently, this result is too con- servative. In fact, the result in equation (5) can be obtained in another way. Let us look at one specific memory module. Since each processor (or reference) will access this module with probability 1/m, the probability it will not be accessed by a processor is 1 - 1/m. Since all processors are independent, the probability that none of them will reference this module will be c (1 - 1/m) . Therefore, the probability that at least one processor will c reference this module is 1-(1 - 1/m) , which is the probability that this memory will be busy. Summing over all the memory modules, we will get equation (5) as the average number of busy modules, or the memory bandwidth. This method is s/ery useful, and later we use it to get our general bandwidth 60 equation. Now, let us go back to the first problem. It is generally acknowledged that references generated by a processor should not be con- sidered as randomly distributed. Instead, there should be some kind of re- lationship between two references. Hence, Burnett and Coffman [41] intro- duced the effects of serial ity. They assume that the probability of the next reference addressing the next module in sequence (modulo m) will be a, and the probability of addressing any other module will be j3, where /3 = (l-a)/(m-n). Or formally, let r. be the module number of i reference: then V r i+1 = 0^+1) mod m} = a, V r i+1 * ( r i + D mod m} = 0, for i = l ,2, . . . ,s-l . Where s is the number of references generated per memory cycle. Then, the memory bandwidth is given by the following theorem [30,42]. Theorem : Assume the processor generates s memory references per memory cycle. If the next reference in line will address the next module in sequence (modulo m) with probability a and any other module out of sequence with probability = (l-a)/(m-l), then the memory bandwidth will be: fe-1 2 2 J H k ~ l - j C m (/,fe), B = w m y=o where t = min (m,s), and C (j,fe) = ( fe "- 1 ) Z (-l) n ( fe_j " 1 )(m-j-n-l), - , . m J ' 1 n n ' J fe-f-n-1 J n=0 J 61 If we plot B against a, we can see the bandwidth grows exponen- tially as a increases. This implies that, if a program has a high serial ity a and if addresses are distributed across the memory modules, then we can get yery high bandwidth out of the memory. However, if there is more than one processor the problem becomes yery complicated, since we should also consider the interference between processors. The solution for one processor is already very messy, it will be even more difficult to use the same kind of approach. Therefore, we do need a new technique for finding out the memory bandwidth. Recall the probability approach we just described to derive equa- tion (5). If we can figure out the probability that a certain module will be busy, then we can sum all these quantities together and obtain the total memory bandwidth. This is a \iery basic and useful idea of getting memory bandwidth. Using this idea, we can write down the following general band- width equation: m p B = 2 [1 - H (1-p )], (6) w y=i w v where m, p are the numbers of memory modules and processors, and p.. is the probability that the i processor will reference the j module. Of course, the only problem is to find out all the p..'s. For example, let all p.. = 1/m, then equation (6) is reduced to equation (5). Let q.. = 1-p.- be the probability that the i processor will not reference the j module. The above equation can be rewritten as: B = m - 2 2 q . . j = \ 4=1 J 62 Sometimes it is easier to get q.. then p... Now we can solve the multiprocessor bandwidth equation problem mentioned above. We state the problem and the solution in the following theorem. The only thing we have to do is to figure out p.-'s and substitute into equation (6). Theorem : Assume we have p processors referencing m memory modules. Each processor generates s references per memory cycle and they have the serial ity relationship stated in the previous theorem. Let p\,' be the probability that the i processor will generate a fe in the 2 posi- tion and no j occurs before, i.e., the probability of the event shown in Figure 14, then the bandwidth equation for s > 1 will be: B = m - 2 S q( s .) j=\ xC=l J where D (s-1) D (s-1) q (s) . (1 _ (j-1), „ . [PilizILa + (1 - ^'Zll-) iz£] } " j " j l-p^ l-p^ ^ Is) The proof of this theorem and the calculation of p v - - are given in Appendix A. Another derivation by using Markov chains is also given there. This theorem shows the usefulness of equation (6). We can use it for deriving the memory bandwidth of a great variety of systems. We will also use equation (6) in our simulator. However, the real meaning of a is very vague. Every program has a different value of a. Unless we do a thorough analysis of program traces, we are in no position to say what the generated by ith processor 63 r l r 2 r 3 1 — ^ — k 1 * 1 — — 1 ' ff ' V- r~ no j J («) Figure 14. The Event for Computing p., 1 K 64 a value for program should be. We cannot afford doing this in our study. Besides, when s is small, say 2 or 3, moderate change of a really does not make too much difference in the result. Therefore, we prefer not to use the above recursive solution. Instead we just assume no serial ity between references, i.e., q ( - S) = (1 - P--) S A good discussion of memory bandwidth can be found in [43] or [39]. 2.2.2 The Simulator Our simulator uses the so-called event-driven technique, that is, the whole simulation process is driven by a sequence of event times. An event time is the time that an event occurs, which could be the arrival of a job, the completion of a processing period or I/O operation, the depar- ture of a job, etc. The virtual clock is advanced from the current time to the next event time. Every time we advance the clock to a new event time, we will calculate all the statistics we want between this new event time and the previous event time, and update the system status. Then, we will use this new status information to figure out the time that the next event will occur. This process will keep going until a certain number of jobs has been simulated. In a simulation where the timing is the most im- portant statistic, the event-driven technique is the most convenient and useful tool . Figure 15 is the overall structure of the simulator, and Figure 16 is the flow chart of the main program. Obviously, the most important part is how to generate the next event time (the dotted box in Figure 16). 65 (keep the status information ) (manage the waiting queue) 1 SCOREBOARD ^ MA HI PROGRAM SCHEDULER ^ pp i r STATISTICS COLLECTOR MEMORY MANAGER SEGMENTER (collect all the (assign memory statistics wanted) to each job ) (partition the CPU time ) Figure 15. Overall Structure of Our Simulator, 66 INPUT MACHINE PARA- METERS AND SET UP ALL THE SWITCHES V GENERATE (OR INPUT) JOB STREAM V SCHEDULE ARRIVED JOBS INTO THE WAITING QUEUE ^ / * \ v INPUT AS MANY JOBS AS POSSIBLE INTO THE MEMORY i ' PARTITION THE CPU TIME OF EACH JOB JUST ENTERED \ r_ RECORD THE PARTI- TIONED SEGMENTS ON THE SCOREBOARD r 1 i r CALCULATE THE INSTANTANEOUS BANDWIDTH AND DISTRIBUTE TO ACTIVE PROCESSORS i t FIND THE NEXT EVENT TIME L J \ f ADVANCE TO THE NEXT EVENT TIME »rr^ AND COLL STATISTIC ECT \y Figure 16. Flowchart of the Simulator, 67 UPDATE THE SCOREBOARD NO RELEASE ALL THE RESOURCES OCCUPIED BY THIS JOB YES OUTPUT THE RESULTS AND "STOP" Figure 16. Flowchart of the Simulator. (Continued) 68 The input to the simulator is a sequence of jobs. Each jobs is a four-tuple that consists of the arrival time of this job, as well as the CPU time required (in milliseconds), the number of I/O requests, and the memory space (in K bytes) required by this job. Usually, people call this kind of information a "workload." We use two kinds of workload in our analysis, namely, the real workload and the artificial workload. The real workload is obtained from the System Management File (SMF) in our IBM 360/75 system. The SMF routines store on tape a complete record of the processing information of all the jobs run on the 360/75 every day. We pull off all the above information from the SMF tapes to constitute the input workload of our simulator. On the other hand, we can use random number generators to generate an artificial job stream. The means and variances of the job parameters can be obtained by analyzing the real data we got from the SMF tapes, and these can be used by the random number gen- erators to produce fake jobs. The real workload can reflect what really happened in a computer system, but the artificial workload is easier to modify. We will use both and compare their results. When a job "arrives," that is, when the content of the virtual clock is equal to or greater than the job arrival time, it will be placed into the waiting queue according to some scheduling algorithm. The scheduling algorithm wil 1 greatly influence the system performance, especial- ly the average turnaround time. In the next chapter, we will compare some non-preemptive scheduling algorithms. The jobs in the waiting queue are then considered for entering the system according to their ordering. If the memory has enough room for the job under consideration, this job will join the system and start getting 69 service. Of course, in a monoprogramming system this job also should have a free processor assigned to it. In the job scanning, we usually allow a certain distance of look-ahead. This means, if a job cannot be selected for service, we are allowed to go down the line and look at the next job. This scheme might improve the performance. It is particularly true for short look-ahead distance. But for long look-ahead distance, we might get some negative result, since allowing large look-ahead tends to reduce the effect of the scheduling algorithm. Our result in the next chapter indeed shows this phenomenon. One negative side of the look-ahead scheme is that a job requiring large memory space might get blocked all the time since smaller jobs might sneak in and never leave enough space for this big job. In order to avoid this problem, we adopt the following strategy. When a job first joins the waiting queue, we will attach a count to this job. Later on, we will reduce this count by one e\/ery time we scan this job. When this count becomes zero and has not been served, we will stop looking ahead and force the system to accept this job eventually. In the 360/75 system, people use a similar method to accomplish this, namely, by gradually reducing the magic number attached to a job. Of course, the 360/75 also uses a more complicated method, viz. adjusting the job initiators to give large jobs more chance to get into the memory. When a job gets into the memory, we first partition its CPU time into as many pieces as the number of I/O requests required by this job in the following way. Let us assume the job requires I I/O requests. We gen- erate I exponentially distributed random numbers in (0,1), sum them together, divide the total CPU time by this sum to get a proportionality constant, and then multiply each original random number by this constant to become the 70 length of each piece. It is easy to show the sum of these normalized pieces is equal to the total CPU time. The reason we are doing this is that we do not know the length of the time period between two I/O requests. Although the SMF tapes do contain this information, it is difficult and time-consuming to get it from the tapes. During the simulation process, a job will go to the I/O stage after one segment of CPU time has been served. It will then do some I/O and return to process the next segment. Each I/O operation will be assumed to be nominally 42 ms long, which is of the same order as a disk operation on a 2314 disk unit. Of course, this parameter can be changed. As we said before, the virtual clock of the simulator advances from an event time to the next event time. The next event time depends on how fast the system operates, and this in turn depends on the memory band- width. The faster we can get data out of the memory, the faster the sys- tem can operate. The memory bandwidth is a function of several variables, for example, the processor-memory speed ratio, the number of active processors, the number of memory modules, and the memory allocation scheme. In the last section, we showed a general equation which will be used in our simula- tor. This equation is used in the dotted box of Figure 16. However, one thing we still have not mentioned, i.e., p.. in equation (6). p.- depends on what percentage of the program resides in this module and how we allocate memory to the program. In our simulator, we will assume the program is interleaved horizontally, i.e., across the memory modules. So, it is fair to assume p • is the fraction of the program residing in a certain module. 71 In other words, if a program is distributed in four memory modules, then all four p.'s will be assumed to be 1/4. Of course, some people might argue about this, but we think it is a reasonable assumption. At any time instant, how fast those processor can run is deter- mined by the "instantaneous" bandwidth. By instantaneous bandwidth, we mean the memory bandwidth of the current state, or before the state change. This is because the memory bandwidth is state-dependent, i.e., depends on how many processors are running. Once the instantaneous bandwidth has been figured out, we will distribute it to all the "active" processors. The share each processor will get is proportional to the contribution it makes toward the total bandwidth. This partial bandwidth can be viewed as the processing power the processor uses to execute a job. We can then figure out the time that the next event will occur and the amount of work each processor has done between two event times. In general, the instantaneous bandwidths of two intervals will be different, because the number of active processors might be different. Now, let us give a simple example to show how to calculate the memory bandwidth and compute the next event time. Assume we have a distri- buted system with eight memory modules (m=8) and a speed ratio of two (s=2), Also, assume we have three jobs in execution that require 10.5, 13.0. and 20.8 miliseconds CPU time until their next I/O operations, and the current time T equals 2145.32 (all these figures are arbitrarily chosen). Since we are using the distributed memory allocation scheme, all p-.'s will be equal to 1/8. First, we have to figure out the total instantaneous bandwidth. 72 This can be obtained by using the following formula: B =8-2 2 q , w -i -_i • • J=\ -L-\ ±j where q . - = 1-p • .. Since q ■ = 7/8, we can get B = 4.41 . If we assume *-i ^J ^J w another memory allocation scheme, e.g., the partitioned or mixed scheme, the calculation of the total instantaneous bandwidth is very similar, although it will be a little bit more complicated. In the next chapter, we will show the calculation for a partitioned system. Since we are using the distributed scheme, the total bandwidth will be equally distributed to all the jobs. Therefore, each processor will get a bandwidth of 1.47 to execute the job. Now, we can compute the next event time by using all the informa- tion provided above. Apparently, the job with 10.5 mil 1 iseconds of work will be done and go to the I/O stage first. In other words, the next event time is the time instant this job stops processing and issues an I/O request. The other two jobs will still be in execution by their processors by then, since all the jobs get the same amount of bandwidth and will progress at the same time. So, the problem is to find out how long it will take to finish 10.5 mil iseconds of work, given the processor has a memory bandwidth of 1.47. We can solve this by minipulating the dimensions. Let us assume w to be the average number of memory references a processor will issue every mill isecondsand w' to be the number of memory cycles per mill iseconds. The dimension of memory bandwidth is the number of memory references per memory cycle. So, the time to finish 10.5 ms of work is: 73 at = 10.5 ms x w mr/ms 7 lzl _w_ 1 1.47 mr/mc x w' mc/ms ,,IH w I s ' If we assume a processor issues a memory reference every processor cycle, then w/w' is equal to the memory-processor speed ratio s. (In fact, we define processor cycle time to be the average period of time a processor will take to issue a memory reference.) Hence, A T is equal to 7.14 x s = 14.28 ms. The next event time will be T + A T = 2145.32 + 14.28 = 2159.6 ms. In other words, the first job will stop processing and issue an I/O request time at time instant 2159.5. Since we assume each I/O operation will take a constant amount of time, we can know the time this job will finish the I/O operation if it gets the I/O device it wants. Of course, we must assume that nothing else will happen between the current time T and T + A T. For example, if a fourth job finishes an I/O transaction and resumes processing between these two time instants, the next time will be the time instant that this fourth job resumes processing. Hence, instead of computing A T, we have to figure out how much work will have been done between the current time and the time instant the above event occurs. We then have to subtract this from the processing time of each job and repeat the whole thing with four jobs. We can see that the principle behind our simulator is quite simple. However, it is a very useful tool, and we use it to generate all our results. In the next chapter, we will show some interesting results we have obtained. Before we do that, we will define some of the measurements which will be used in the later discussion. 74 2.2.3 Definitions of System Measurements In order to give a more succint presentation of the simulation results, v/e will use the following definitions very often in the next chap- ter. As always, p, m, and r denote the numbers of processors, memory modules, and I/O devices in the system. The memory-processor speed ratio, i.e., the ratio of memory cycle time and processor cycle time, is denoted by s. Ta will be used to denote the average turnaround time, i.e., the average amount of time a job will spend in the system. Ta can be broken into two parts, namely, the average queueing time q and the average service time e. In other words, Ta = q + e. q is the average amount of time a job has to spend in the outside waiting queue, or the time period from the moment the job arrives to the moment it enters the memory. In a multi programmed system, q is usually caused by insufficient memory space. In a monoprogrammed system, this could also be caused by lack of a free processor. Sometimes, we might put a superscript on q to denote the queueing time which occurs some- where else. For example, q represents the queueing time which occurs in the I/O queue. On the other hand, e denotes the average amount of time a job will spend in the memory, which can further be broken into processing time, I/O time, and some possible delays due to queueing for resources, e.g., q 10 . We gave a graphical representation of these parameters earlier in Figure 7. n will denote the average number of jobs in the whole system. Again, a superscript will be used to denote which part of the system we are talking about. For example, n^ represents the average number of jobs in the outside waiting queue. 75 B 1( is used to represent the memory bandwidth. A superscript will w dm p be used to indicate the memory allocation scheme we use. So, B. B. and B (i •* ' w' w w will denote the memory bandwidths for distributed, mixed, and partitioned systems respectively. U m and U_ will denote the utilizations of memory and processors, U is the average fraction of the memory which will be occupied by jobs. We will explain later that there are two kinds of memory utilization in a partitioned system due to an unusual way of allocating memory. U on the other hand, is the fraction of time a processor is busy executing a job. 76 Chapter 3 EXPERIMENTAL RESULTS 3.1 Results for Software Related Questions We roughly described some interesting design problems in Section 1.4. In this chapter, we are going to present a lot of simulation results to answer these problems. Each problem is affected by a number of variables, and we will include the effects of as many of these variables as possible. Basically, we will follow the same order as that in Chapter 1. Therefore, we will start with software related questions. Before we proceed, a word of caution is in order. In sections 3.1.2 and 3.1.3 which deal with monoprogramming versus multiprogramming and the memory allocation schemes, the reader will quickly come to the conclu- sion that multiprogramming and the distributed memory allocation scheme are superior in terms of performance. This is in fact true. However, the results in these two sections assume the existence of complete processor to memory connection, i.e., any processor can access any memory module subject only to possible momentary delays due to memory conflicts. As is well known, such a connection network gets very expensive as the system grows and is difficult to expand. The effectiveness of multiprogramming and the distributed scheme depends to a large extent on this complete but expensive connection capability. Thus, we are interested primarily in the degree of degradation due to monoprogramming, whole module memory allocation, the poorer band- width resulting from partitioned and mixed schemes, and various factors which affect this degradation. In later sections of this chapter we will present similar results 11 using partial connection networks. We will see the advantages of multi- programming and distributed scheme diminish considerably in these cases. Before we talk about the first software question, we should first describe some properties of the workload used in the simulation, since the workload will greatly influence system performance. Although we are going to discuss how a change of the workload will affect the result, we feel that a description is needed here in order to give the reader a better understanding of the whole discussion. 3.1.1 The Workload As we said in Chapter 2, our workload is a sequence of four-tuples. Each four-tuple consists of four pieces of information about a job: the arrival time, the CPU time, the memory requirement, and the number of I/O requests. The original source of this information is the IBM SMF tape. Table 3 displays some statistical data on these parameters, which were ob- tained by analyzing 1300 real jobs which were run on the University's IBM 360/75 system. Note that we show the mean and the standard deviation of the job interarrival time instead of the job arrival time. This is because the job arrival time is an absolute measurement which does not show the distance between two arrivals unless we lay out all the arrival times. On the other hand, the job interarrival time is a relative measurement which can give us some idea how fast the jobs arrive. The data in Table 3 are obtained directly from the SMF tape. Sometimes we will scale this data in order to properly load our system. For example, if we want to see the effect of doubling the system load, we can achieve this by reducing the arrivial times by one half, thus making it appear as though the jobs are arriving twice as fast. Scaling will be 78 Data Mean a Unit Interarrival Time 6.87 6.96 Sec. CPU Time 22.68 22.40 Sec. Job Size 117 80 K bytes I/O Requests 757 739 No. /Job Table 3. Some Statistical Data of the Workload 79 used when we are studying the effect of various workloads. The real workload, of course, reflects what really happens in a computer system. However, it is very difficult to modify or enlarge. For example, if we want a job stream which is twice as long as what we have now, then we will have to get the second half from the SMF tape and append it to the first half. If we are unlucky, the second half might have completely different characteristics from the first half. For example, if the next day is the due day of a CS101 machine problem, the number of small jobs submitted to the system is suddenly doubled or tripled, which greatly perturbs the characteristics of the workload. This is a very undesirable thing in doing simulation. In addition, it is very difficult to modify some job parameters, e.g., the standard deviations and the distributions. Most of all, it only represents the workload on our 360/75 system, and we would like to see the result of a more general job stream. Therefore, we will use another method we mentioned in Chapter 2, i.e., to produce an artificial workload by using random number generators. To generate an artificial workload, we have to know the distribu- tions as well as the means and the standard deviations of all four parameters. Of course, we can arbitrarily make up this information. However, in order to maintain some reality, we will obtain them by analyzing the real work- load. Our analysis shows that the distributions of the interarrival time, the CPU time, and the number of I/O requests are approximately exponen- tial functions with the means and standard deviations shown in Table 2. Therefore, we can easily reproduce them by using the following equation: 80 y = m log e l/(l-x) where m is the mean, x is a uniformly distributed random number in (0,1), and the resulting y is an exponentially distributed random number. The proof can be found in [44]. However, the distribution of the job size is not so simple. Figure 7 shows the density curve of the job size. It contains two bumps, one at around 20K bytes and the other at around 120K bytes. Of course, this depends heavily on the system. The analysis by Chandy, et al [33] also shows the same phenomena. It is very difficult to write down an equa- tion and compute the inverse function, as we did above. We have to do this numerically, i.e., get the cummulative probability function F(y), generate a uniformly distributed random number x in (0,1), and then compute the inverse function y = f- ] (x) In fact, this is the basic method of generating a generally distributed random number. However, it is very time-consuming since it involves a searching procedure to determine the interval x as in, and perhaps an interpolation if we want a more accurate value. But, this is all we can do to handle a general distribution. We will use this method for generating the job size. After knowing these distributions and methods, we can generate four sequences of random numbers to form the artificial workload. This method is \/ery flexible since we can produce a workload with any character- istics we want. Most of the simulations will use the artificial workload. 81 NUMBER OF JOBS 260 - STATISTICS OF 1300 JOBS 20 40 60 80 100 120 140 160 180 200 220 240 260 280 300 JOB SIZE (K Bytes) Figure 17. The Density of the Job Size. 82 Of course, one disadvantage of this method is these parameters are nov\/ completely independent of each other. In the real job stream, there might be some correlation among them, for example, a job requiring very large space might use long CPU time and do a large amount of I/O operations. So, there is some difference between the artificial workload and the real workload. These large jobs will, of course, seriously degrade the average turnaround time. If one of these jobs gets into the memory, it will occupy a large portion of the memory for a long time. This will then block the jobs in the waiting queue from being executed until it finally gets done and releases the memory. Hence, all of the waiting jobs suffer a very long delay which causes a significant increase of the total average turnaround time. Figure 18 shows some simulation results using both real and artifi- cial workloads, where we use a system with 1024K bytes of main memory. Six- teen memory modules, constant I/O time of 42 ms., memory-processor speed ratio of 4, monoprogramming, shortest-job-first scheduling algorithm, the distributed scheme of memory allocation, a full connection switching net- work, and 800 jobs. All these parameters were explained in the last chapter and are indicated in the figure. As we can see from this figure, there is a significant difference between the two curves. After some analysis, we find that this difference is indeed caused by a few very large jobs in the real job stream. Each of these jobs claims a large amount of space and requires a large CPU time, so they contribute a significant amount of queueing delay to the final average turnaround time. However, this does not happen in the artificial 83 Ta (sec.) 800 700 600 500 400 300 200 100 M=1024 MONO. m=16 SJF 42ms. DIST. s=4 800 JOBS REAL WORKLOAD ARTIFICIAL WORKLOAD Figure 18. A Comparison of the Turnaround Times of the Real and Artificial Workloads. 84 workload. If we delete these big jobs and run the simulation again, we get a result which is just a 1 ittle bit higher than the result using the artificial workload. Therefore, we believe that the artificial workload is a pretty good approximation of the real workload if we ignore a few big jobs which occasionally occur in the job stream. However, we are not saying that we will ignore the existence of these big jobs. Actually, this kind of job always exists in a typical university workload. Most of them are so-called number-crunching jobs since they require a large amount of floating-point operations. A number- crunching job needs a processor with fast floating-point arithmetic unit for a fast, efficient execution. Most of the minicomputers and the micro- processors, however, do not provide floating-point hardware. The floating- point operations are done by software or microprogrammed subroutines. So, the number-crunching jobs are not suitable for these small machines. Al- though a few minicomputers which came out recently do have floating-point hardware, e.g., PDP 11/70, they still cannot provide a fast and efficient execution for this type of job. The best way to handle these big jobs is to use a big machine like the CDC 7600 or Amdahl 470. These machines all have fast pipe-lined floating-point arithmetic units, which can execute a number-crunching job yery quickly and efficiently. This is why these big machines are important in the computation world. Although a minicomputer or a microprocessor is not appropriate in handling these big jobs, we should not consider this as a fatal dis- advantage of building multiprocessor systems by using minicomputer or 85 microprocessors. We can easily solve, actually we should say "avoid," this problem by the following method. Whenever a big job arrives, we can send it to a big machine elsewhere via a computer network, which can give better service to this job. This is why we use the artificial workload since it approximates the real workload without those big jobs. In fact, it does not matter which workload we use in our simula- tion, since we are doing comparison work or finding the effect of a certain parameter. However, the artificial workload seems to be more convenient for us to use, and so we will use it in the following discussion. The real job stream will only be used to provide the necessary information, e.g., mean, variance, and distribution, in generating the artificial job stream. One thing we would like to point out here is the absolute value of a certain measurement, for example, 150 sec. average turnaround time in general does not have too much meaning alone. Only the relative magnitude or the percentage of difference can indicate the goodness of a certain sys- tem over the other one. We will try to use percentages in our presentation. 3.1.2 Monoprogramming versus Multiprogramming Our first problem is to compare monoprogramming and multiprogram- ming. As we said in Chapter 1, we are more interested in monoprocessing systems, i.e., each job can only be run on one processor like in the PRIME system. Hence, we are not talking about ILLIAC-IV type machines. Using the definition of Flynn'[45], we are dealing with MIMD type of machines, not SIMD type of machines. Again, we want to remind the reader that all the results in Section 3.1 are using a full connection. 86 By monoprogramming, as was defined in Chapter 1, we mean each processor is dedicated to a job once it is assigned to this job, and can execute only this job until it is finished. When this job is doing I/O, the processor remains idle. In other words, no processing and I/O over- lapping will take place. Due to this rule, we only allow at most as many jobs as the number of processors in the system. On the other hand, in a multiprogramming system, we will pack as many jobs as possible in the memory and let these jobs share all the pro- cessors. Once a processor becomes free, it will try to obtain a job from the processor queue and execute it. No processor will be left idle inten- tionally if there is a job ready to be executed. Of course, multiprogramming can result in higher processor utiliza- tion and memory utilization, which means higher system throughput. This in general implies we can get shorter average job turnaround time (Ta). Figure 19 shows Ta versus m curves for both monoprogramming and multiprogram- ming. For p=4, the gap between two curves is very big, monoprogramming is about 60% (182 vs 114) higher than multiprogramming. However, when p=6, the gap closes to about 13% (102 vs 90). This is not surprising, since when we increase the number of pro- cessors we also increase the maximum number of jobs allowed in the memory in monoprogramming. Apparently, for six processors and a total memory size of 1024K bytes, monoprogramming is already competitive with multi- programming. Figure 20 shows how the monoprogramming curve approaches the multiprogramming curve when we increase the number of processors. As we can see, for small p, multiprogramming does show a superiority over 87 Ta (sec.) 250 200 150 100 50 M=1024 SJF 42ms. DIST. s=4 800 JOBS r=4 FULL ■* p=4, MONO. p=6, MONO. * p=4, MULTI p=6, MULTI J m 16 24 32 Figure 19. Comparison of Monoprogramming and Multiprogramming, 88 Ta (sec.) 250 - 200 - 150 - 100 11=1024 SJF m=16 DIST. 42ms. 800 JOBS s=4 r=4 FULL MULTIPROGRAMMING MULTIPROGRAMMING (with 10% overhead) 50 - Figure 20. The Effect of Increasing the Number of Processors 89 monoprogramming. But with a moderate number of processors, e.g., 6 in this case, two results are more or less the same already. The reason is rather simple, since for the job size distribution we use, the main memory can only contain about six jobs for most of the time. Hence, there is really no reason we should use multiprogramming if we can afford "enough" proces- sors. Of course, "enough" is determined by the total main memory size and the distribution of the job size. Besides, we have not taken the software overhead of multiprogram- ming into account. By overhead here, we mean the extra work the multi- programmed operating system must do, e.g., updating the outgoing user's file, restoring the incoming user's status information, etc. We do not know exactly how high this overhead will be. However, in most computer systems a large portion of CPU time is spent in the operating system. In other words, the processors will have to do more work in a multi programmed sys- tem than in a monoprogrammed system. If we assume this overhead will increase 10% of the job CPU time, then the Ta curve for multiprogramming will move up to the dotted curve shown in Figure 20.. Now, monoprogramming wins for p > 6. The effect of this 10% overhead is different for each p. It causes a larger increment of Ta for smaller p. For larger p, say 8, the increment is about 10%, and for smaller p, say four or less, the increment is more than 20%. Apparently, the overhead is \/ery important when the number of processors is small. This phenomenon can be explained by Table 4, where for each p value we show the degree of multiprogramming and the queueing delay due to no available processor. The degree of multiprogramming is defined to be the average number of jobs each processor will have to take care of at the 90 p Degree of Multiprogramming Queueing Delay Due to No Processor Queueing Del ay /Average Service Time {%) 2 4.22 85 57.7 3 2.61 41 39.0 4 1.71 20 23.3 5 1.35 13 17.3 6 1.17 10 15.4 7 1.08 9 14.3 8 1.03 8 11.2 Table 4. Degree of Multiprogramming and Queueing Delay Due to No Available Processor. 91 same time. In other words, on the average there will be p* (degree of multiprogramming) jobs in the memory which are sharing p processors. From Table 4, we can see the degree of multiprogramming is high when p is small. This means a lot of jobs will have to compete for a few processors, and the queueing delay caused by waiting for a free processor will be very high. In the second and third columns, we show the queueing delay and its percen- tage in the total service time. We can see the queueing delay of a two- processor system occupies almost 60% of the total service time and is more than ten times that of an eight-processor system. If we add an overhead to each job, the queueing delay will grow as a function of the degree of multiprogramming since each job will cause a certain amount of delay (the increment of the CPU time) to every job waiting for a processor. This means that the overhead will result in longer delay for smaller p. Therefore, the averate turnaround time will increase more for smaller p. We will further explain this in Section 3.1.5. Before we add the overhead, we can see that multiprogramming wins by a wide margin when p is small. However, the difference will be reduced significantly if we just add a moderate amount of overhead. The increment of the average turnaround time is ^jery sensitive to the overhead, especially for small p. Therefore, the superiority of multiprogramming in that region will disappear rather quickly if the overhead goes up. If we insist on using multiprogramming, the software design will be extremely important. A bad design can easily degrade the performance seriously. From Figure 20, we can see another interesting point about these curves. If we do not consider the overhead, the multiprogramming curve is only two processors to the left of the monoprogramming curve. If we 92 use two more processors in the monoprogrammed system, the monoprogramming curve will be shifted to the left by two processors. Thus, two curves will almost overlap. In fact, in this case, the monoprogramming curve will be slightly under the multiprogramming curve. This means that if the difference of the software costs is less than the cost of two processors, then we should use monoprogramming with two more processors. Although we do not have exact figures of these costs so we can draw any conclusion, this ap- parently is the case in the current trend since the software cost is soaring up rapidly and the hardware cost is going down significantly ewery year. If we do take the 10% overhead into account, we can see the dif- ference is only about one processor. Suppose the overhead is even higher, the monoprogramming curve might become completely superior to the multi- programming curve. However, we are not completely against multiprogramming. In some cases, multiprogramming is still a better solution for system design. For example, if we have a job mix with all small and I/0-bound jobs, then multiprogramming might give us better results. By I/0-bound, we mean a job which spends most of its life time in doing I/O and relatively small amount of time in processor. In other words, an I/0-bound job will have an I/O time which is, say, several times longer than its CPU time. Most COBOL programs, for example, are I/0-bound under this definition. The job mix we are using, however, is not I/0-bound. This can be seen from the mean values we show in Table 3. If we assume an I/O operation takes 42 ms, then on the average, a job will spend about 30 seconds in doing I/O, which is of the same order as the average CPU time. Of course, the real average CPU time will not be the same as that shown 93 in Table 3, since it will be affected by several factors, e.g., the memory allocation scheme (to be discussed in the next section), the degree of inter- leaving, the memory and processor speeds, etc. However, our simulation results show that it ranges from 30 to 65 seconds. So, the CPU time has been increased by a factor of 1.35 to 2.9. This is caused by several factors. We assume a processor cycle (pc) to be the average amount of time between two successive memory references issued by a processor. In other words, a processor will generate one memory reference in one processor cycle. Since s is defined to be the ratio between the memory speed and the proces- sor speed, a memory cycle (mc) will be s * pc. Let us also assme that one CPU second is equal to w processor cycles, or w memory references. So, a program needs 10w memory references if its CPU time requirement is 10 seconds. If the average memory bandwidth a processor can get is b, it will take w/b memory cycles to satisfy one CPU second of work. The average memory bandwidth b in general will be less than s due to memory interference. Since pc = 1/w second by definition, w/b memory cycles is equivalent to s/b seconds. Since b is bound by s, s/b will always be greater than 1. That is, it always takes more than one second of time to complete one CPU second of work. This explains why the average processing time ranges from 30 to 65 seconds, instead of being 22.68 seconds shown in Table 3. There- fore, the job mix we are using is not I/0-bound since on the average a job will spend roughly the same or more time in execution than doing I/O. If we use the job mix with all small and I/0-bound jobs, the re- sults will be quite different from that of Figure 20. Table 5 shows the result by using a new job mix, where we increase the number of I/O requests of each job by 50% and reduce the CPU time and the job size by 25% each. 94 p Multi- programming Ml (Wi It' th Ta 'programming 10%' Overhead) pr Mono- ogramming 4 90 94 671 5 86 90 204 6 83 87 132 7 83 88 107 8 84 88 90 Table 5. The Results by Using New Job Mix. 95 The average I/O time is now about twice as large as the average CPU time (46 to 24). Therefore, the new job mix is I/0-bound with smaller job sizes. We only show the results for p=4 to 8. As we can see, the gap between multiprogramming and monoprogramming is rather large even for p as large as 7. If we add in 10% overhead as we did earlier, multiprogramming (middle column of Table 5) still wins by a slight margin for p=8. Ap- parently, multiprogramming will yield better results for the job mix which contains small and I/0-bound jobs. Therefore, which strategy we should use depends on the job mix we are dealing with. Our conclusions in this report are all based on the job mix we described in the early part of this chapter, which is a typical workload on a university batch system. Table 6 shows the memory utilization and processor utilization of both monoprogramming and multiprogramming for the system described in Figure 20. The memory utilization of monoprogramming goes up as p increases, and the memory utilization of multiprogramming essentially remains the same. This is what we would expect. The processor utilizations, on the other hand, both decline as p increases. This is because U is the utilization of one processor, in other words it is the "normalized" utilization. If we multiply U by p, we can see the results are increasing. This shows the same trend as the Ta curve in Figure 22. Therefore, we can think of U *p as a representation of the work being done by the system in a certain time unit. In fact, the processor utilization is strongly related to the system performance. If U *p is higher, the processors can finish more work in one unit of time, and the average turnaround time will be lower. 96 Monoprogramming Multi programming p U m % U *p m u p yp 4 .44 .50 2.00 .69 .65 2.60 6 .59 .44 2.64 .64 .45 2.70 8 .61 .34 2.72 .62 .32 2.72 Table 6. Hardware Utilizations for Monoprogramming and Multiprogramming. 97 Therefore, Ta can indirectly tell us what the relative processor utilization should be. In the following discussion, we will not show the utilizations except for a few occassions, since they will act like that in Table 6. The other measurements that can also indicate the work being done by the system is the total memory bandwidth B (1 The total memory bandwidth w is the memory bandwidth generated by all the active processors, i.e., pro- cessors that are executing jobs. Like U *p, the higher the total memory bandwidth is, the faster the processors will operate. Figure 21 shows B versus p curves for the system of Figure 20. As we can see, the total memory bandwidths of the monoprogrammed and multi programmed systems both go up as we increase the number of processors. The multi prog rammed system has a higher total memory badnwidth. However, when p is large, say 8, the total memory bandwidths of both systems are \/ery close to each other. Recall Figure 20, the average turnaround times of both systems also become very close to each other as p gets large. 3.1.3 Memory Allocation Schemes As we described in Figure 5 of Chapter 1, we are interested in three kinds of memory allocation scheme: the partitioned scheme (Figure 5-a), the distributed scheme (Figure 5-b), and the mixed scheme (Figure 5-c). We briefly explained there how they work and their advantages and disadvan- tages (Table 2). In this section, we are going to investigate their per- formance and look at some problems related to these schemes. As we shall see, the memory allocation scheme affects performance in three different ways. First, the space efficiency of an allocation will affect the number of jobs which can be in the memory at any time. The partitioned scheme tends to waste memory since memory can only be allocated Bw 4 _ 2 - 1 - MULTIPROGRAMMING lONCPROGRAMMING 98 Figure 21. The Total Memory Bandwidth for the System of Figure 20. 99 in whole modules. The other two allocation schemes do not waste any memory in this way. Hence, the partitioned scheme has less potential to pack jobs in the memory than the other two schemes. Second, the allocation scheme affects the memory bandwidth available to any given job. If, for example, a job requires a small amount of memory (less than one module), then under the partitioned scheme or the mixed scheme only one module will be allocated to this job and the memory bandwidth is limited to 1 . On the other hand, distributed allocation causes the job to be spread across all memory modules thus allowing a higher potential bandwidth (although this bandwidth is subject to interference from other jobs in the memory). Sup- pose, however, that the job requires four memory modules. It may well get worse bandwidth using the distributed scheme than that using the partitioned scheme since the latter case is not subject to interference from the other jobs. Finally, the third effect of allocation on performance has to do with the classical problem of address interleaving which affects the ability of a job to utilize the potential memory bandwidth. This question has been discussed extensively in the literature [39,42,46,47,48]. We see no way of providing definitive answers in this area short of using actual address streams. But, this would lead to results of questionable generality and would be prohibitively expensive. We will, however, establish some bounds on the possible effects of good versus bad address interleaving. The factors above interact in complex and frequently unpredicta- ble ways. We will attempt to isolate the effects of each factor as much as possible. We begin by analyzing the effect of memory waste on overall performance. Figure 22 shows the curves of the average turnaround time versus TOO the number of memory modules for all three schemes, and Figure 23 shows their corresponding total memory bandwidth curves. Notice that the total amount of memory does not change, m is simply the number of modules into which this total is divided. The solid lines represent the monoprogramming curves and the dotted lines represent the multiprogramming curves. Only five curves are shown in both figures; the two curves for the distributed scheme are yery close to each other and only one is shown. As we can see, m has a great influence on both partitioned and mixed schemes, especially from 8 to 16. For m=8, the turnaround time of the partitioned scheme is almost ten times as large as that of the distri- buted scheme. This is very easy to understand. Since with so few memory modules, a large portion of the memory can be easily wasted and not too many jobs can be in the memory at the same time. For example, a job requiring 130K bytes memory will occupy two modules since the module size is 128K bytes, and almost one-eighth of the useful memory has been wasted by this job. Therefore, a job in general has to spend a lot of time in the waiting queue before it finally has enough memory modules to enter the memory. Table 7 shows the total memory bandwidth (B ), the average job bandwidth (b ) the memory utilization (U ), the average number of jobs v w m in memory (n) , and the queueing time of each' job (q) for these three alloca- tion schemes. The average job bandwidth is the average memory bandwidth each job can get while it is being executed. It depends on how we allocate the memory, how we interleave the program, and the speed ratio s of memory and processor. We have described how we calculate the memory bandwidth in the last chapter. Since a processor can only generate up to s memory 101 Ta (sec.) 1404 M=1024 SJF p=8 800 JOBS 42ms. s=4 r=4 FULL X PART. MONO. * PART. MULTI O MIX. MONO. • MIX. MULTI e> DIST. BOTH. 16 24 32 Figure 22. Average Turnaround Times of Three Memory Allocation Schemes 102 Bw J I L 16 J L 24 32 m Figure 23. Total Memory Bandwidths of Three Memory Allocation Schemes. ( For System in Figure 22 ) 103 Scheduling Algorithm m=8 m=16 m=24 m=32 B T w 4.47 6.43 6.51 6.52 B w 1.15 1.54 1.75 1.97 Partitioned U m 0.58 0.67 0.62 0.60 n 5.5 6.8 6.4 6.2 q 1343 112 41 25 w 5.57 6.50 6.54 6.56 B w 1.10 1.53 1.76 1.98 Mixed U m 0.79 0.68 0.63 0.60 n 7.6 6.9 6.4 6.1 q 445 41 26 19 B T w 6.50 6.64 6.78 6.82 B w 1.46 2.57 2.81 3.18 Distributed u m 0.80 0.62 0.60 0.57 n 7.2 5.5 5.3 5.1 q 58 15 12 10 Table 7. The Total Memory Bandwidth (B ), Average Job Bandwidth (B ), Memory Utilization (U ), Average Number of Jobs in Memory (n), and Average Queueing Time (q~) of Three Memory Allocation Schemes. ( For the Monoprogrammed System of Figure 22 ) 104 requests per memory cycle, B,, is upper bounded by s. From Table 7, we w can see the highest job bandwidth we obtain is 3.18, where we use the distributed scheme, 32 memory modules, and a speed ratio of 4. The job bandwidths for the partitioned scheme and the mixed scheme are very similar. Basically, it is because the ways these two schemes allocate module to a job are quite similar, as we can see in Figure 5. A job will be stored in the same number of modules under both schemes, although there might be some module sharing in the mixed system. Thus, a job is con- fined in a fixed number of modules and can only access these modules no matter which scheme we use. Most of all, most or all of the job is free from interference by other jobs. There is no memory interference between jobs in the partitioned system, and small interference in the mixed system since only the "overflow" parts will share a module. Therefore, the job bandwidths for these two schemes will be quite close due to these facts. However, the mixed scheme yields better turnaround times, since it can use the memory more efficiently and hence allow more jobs to be processed at the same time. (This tells us that the job bandwidth alone cannot be used to compare the performance of two systems.) The job bandwidth of the distribured scheme, on the other hand, is much better than the job bandwidths of the other two schemes. The distributed scheme will spread out every job across the whole memory, thus providing each processor the potential of referencing every memory module. Although all the jobs are sharing the memory and the mutual interference is large, a large bandwidth still can be obtained through the large degree of interleaving. Later in this section, we will explain why the distributed system can produce a higher bandwidth than the mixed system by using a numerical example. 105 The memory utilization in Table 7 is defined to be the per- centage of the memory that is actually used by jobs. For the mixed scheme and the distributed scheme, there will be no problem since a job will be allocated exactly the amount of memory it asks for. As for the par- titioned scheme, it is a little bit more complicated since the memory is allocated by the module, and in general a job will get more memory than it really needs. So, there are two different memory utilizations we should distinguish. One is the utilization we defined above, which is to calculate the percentage of the memory really used by the jobs. The other one is the percentage of the memory that is occupied by the jobs. Of course, the latter is larger than the former since some memory will be occupied but not be used. In other words, some memory is wasted under the partitioned scheme. In order to distinguish these two types of memory utilization, we will call the first one "word memory utilization" and the second one "module memory utilization." Of course, both types of memory utiliza- tion will be the same in a mixed system or a distributed system, i.e., the word memory utilization. The utilizations we show in Table 7 are all word memory utilizations. Most of the time we will just call this memory utiliza- tion for short. One interesting thing is that the difference of these two memory utilizations is the percentage of the memory which has been wasted, i.e., occupied but unused. This is yery easy to understand. We will show some results of the memory waste of the partitioned scheme later. As we can see from Table 7, for m=8 and under monoprogramming, the partitioned scheme has a total memory bandwidth of 4.47, a job bandwidth 106 of 1.15, a (word) memory utilization of only 58%, and averages 5.5 jobs in the memory, which results in an average queueing time of 1343 seconds! Under the same condition, the distributed system has a total bandwidth of 6.50, a job bandwidth of 1.46, a memory utilization of 80%, averages 7.2 jobs, and has an average queueing time of only 58 seconds. Of course, one way to improve the performance of the partitioned system is to increase the number of memory modules and to decrease the size of each module. This can reduce the amount of wasted memory, since on the average each job will waste one-half of a module (see the proof in Chapter 2). Thus, the probability that a job gets blocked due to insufficient memory will be reduced. In Table 8, we show the word utilization, the module utilization, and the memory waste of the partitioned system in Figure 22. As we can see, when we double the number of modules from 8 to 16, the word memory utilization of the partitioned system increases to 67%, and the module memory utilization drops a little down to 92%. Meanwhile, the memory waste has been reduced from 37% to 25%. This is why the average queueing time reduces sharply to 112 seconds (a gain of 12)! If we further increase the number of modules to 24, the memory waste will decrease to 14%, and the average queueing time- down to 41 seconds. Apparently, the partitioned system is very sensitive to the number of modules. The main reason is, of course, the memory waste will reduce the ability of accepting jobs. Therefore, it is very important to provide enough memory modules in the partitioned system. Actually, the memory utilization and the average number of jobs in memory do not grow as the system performance improves. On the contrary, all the memory utilizations and the average numbers of jobs decrease when 107 m=8 m=16 m=24 m=32 Word Memory Utilization 0.58 0.67 0.62 0.60 Module Memory Utilization 0.95 0.92 0.76 0.71 Memory Waste 0.37 0.25 0.14 0.11 Table 8. Memory Waste for the Partitioned Scheme 108 we increase m, except for the case we just mentioned. And surprisingly, the distributed scheme has the smallest values of U and n amonq these m ^ schemes, yet on the other hand, it has the best turnaround time. This means that the higher utilization (as we define it) does not necessarily imply a better throughput. In fact, u and n should decrease as the system throughput in- m 3 creases, since if the arrival rate is fixed, the faster the system operates, the faster the jobs leave, and the emptier the system will be.* Especially in a distributed system, the fewer jobs in the memory, the less memory con- tention each job will suffer, and the higher bandwidth each processor will get to execute a job. The system throughput (the memory bandwidth) goes up when we increase the number of memory modules. This explains why U and n decrease as m increases. The distributed scheme, however, does not have this memory utili- zation advantage over the mixed scheme since both the distributed and mixed schemes can fully utilize the memory. This was shown in Figure 5. In the fist column of Table 7, the mixed scheme indeed shows memory utilization (79%) and average number of jobs in memory (7.5) comparable to that of the distributed scheme. Despite this, the distributed scheme still yields a better turnaround time, for every m value. Apparently, the distributed scheme can produce higher bandwidth than the mixed scheme can, if they are given the same set of jobs in the memory. The difference must come from the degree of interleaving a scheme provides to each job, since this is the only difference between these two schemes. *This can be explained by using Little's Theorem, i.e., n = Ax, where X is the job arrival rate and x is the average turnaround time. 109 Let us look at an example which can explain why the distributed scheme generates higher memory bandwidth than the mixed scheme does. Assuming we have a memory system of 8 modules and four jobs of sizes 1 2/3 modules, 2 1/2 modules, 1 1/3 modules, and 2 modules respectively. These jobs are stored in the memory as shown in Figure 24. The numbers shown in Figure 24-a are the fractions of these jobs in each individual module. We assume they are the reference probabilities we need in the general band- width equation, i.e., Equation (6) in the last chapter. Of course, for the distributed system shown in Figure 24-b, all the reference probabilities are 1/8. Let us also assume that the references generated by a processor are all independent, as we do in our simulation. In order to mimic the real operation, we will in addition assume job a is doing I/O and hence does not contribute to the total bandwidth. Now, for the distributed system, the bandwidth can be calculated as follows. ■ m p / v . B° = 2 [1 - n q)S>] = m[ i . (i _ i/ m )P S-, = 8[1 - (1 - 1/8) J q ] = 8[1 - (7/8) 12 ] = 6.389 where s=4 is the memory-processor speed ratio which we assumed in Figure 22. As we can see, s contributes a lot in the above calculation since it increases the power of the second term in the parenthesis. As for the mixed system, the bandwidth can be calculated by first finding out q.. s. Here are all the numerical values. no . — |C\J (J -o r-ICsJ rC _Q — JT c -a k Okfr rO -O U -a ; v 1 i — 1-51- fcn .a o Cvj|ld a; E ai -£Z u 00 a> 13 re _Q U s s ro -Q C -D s O a> -C +j u 3 tn X) •^ -a S- ai +J +j CO 3 • r— -Q Q •r" s- "O -t-> C re ■r~ Q -a a» <1) X .c • r— r- s: ai ■• — ^ .c -Q +J ra -O u s s cm in ro -O a N n ir> ro -O U s s s o ra (J o o E a CXJ O) en "*3- kD UD^-kO Ln lt> Ill (4) q ll = (4) q 12 = q (4) = q 23 (4) q 24 = (4) q 25 = q (4) ■ q 35 (4) q 36 = (4) q 47 = q (4) = q 48 - 2/5) - 2/5)' - 1/5)' - 1/4)' - 3/4)' - 1/2) - 1/2) 4 4 0.1296 0.1296 0.4096 0.3164 0.0039 0.0625 0.0625 Using the same equation, we can get the following result. B m = + + 0.8704 + 0.8704 + 0.8704 + 0.9961 + 0.9375 + 0.9375 w = 5.482 Comparing these two results, we can see the distributed scheme produces 16.5% more bandwidth than the mixed scheme. In fact, if all four jobs are active, i.e., all four processors are accessing the memory, the mixed scheme can produce a total bandwidth of 7.321, and the distributed scheme only produces 7.055. However, the proba- bility that all jobs are active is rather small, especially if a job spends a significant amount of time in doing 1/0 like the workload we are using. Suppose more than one job is doing 1/0, the distributed system will open an even larger margin over the mixed system, since more modules will idle in the mixed system. Therefore, the distributed system wins 112 by a large margin most of the time. This explains why the turnaround time of the distributed system is the lowest in these schemes. In summary, the performance difference between the partitioned system and the mixed system is caused by the job packing, and that between the mixed system and the distributed system is caused by the job bandwidth. This has been carefully explained above and can be seen in Table 7. Recall Figure 22, we can see all the curves are pulling together as m gets larger. For m=32, the monoprogramming curve of the partitioned system is only 35% (28/80) higher than the multiprogramming curve of the distributed system. This shows that the partitioned system is comparable to the distributed system if we can afford a large number of modules. Actually, one very important factor that makes the distributed scheme better than any other scheme is the memory-processor speed ratio s. The second term in the parenthesis of the equation for B will diminish w very fast as s gets larger, which makes the bandwidth approach m (the perfect bandwidth) very quickly. On the other hand, the distributed system will lose its superiority as s gets smaller. Table 9 shows the turn- around time versus m value for s=2. This table is the same representation as Figure 22, except we are emphasizing the numerical values this time. As we can see, the monoprogramming result of the partitioned system pulls within 14% when m=24. In the limiting case when s=l , the partitioned system will have the best bandwidth if we assume the same number of jobs in the memory. This is because the partitioned system does not have any memory interference between jobs, but the other two have. The only reason the distributed or mixed system can win is that they can fully utilize the memory, and hence 113 Monoprogramming, Multiprogramming System m=8 m=16 m=24 m=32 Partitioned 186,191 91, 92 83, 83 80, 81 Mixed 99, 95 83, 82 80, 79 78, 78 Distributed 81, 82 73, 74 73, 73 71, 72 Table 9. Ta Versus m for Three Memory Allocation Schemes ( s=2 ) 114 the probability that a job cannot enter due to insufficient memory is the smallest. Our simulation result shows that the turnaround time of the partitioned system is higher than the distributed system by only 15% (78/68) when m=16, and by only 7% (72/67) when m=24. Therefore, the partitioned system does perform very well when s is small. Currently, the memory technology can provide us a semiconductor with a cycle time of less than 100 ns. If we use a microprocessor with a similar cycle time, then a s value of 1 is realizable. Hence, the use of the partitioned scheme is indeed wery favorable since it has so many ad- vantages as we described in Chapter 1, and yet it performs equally as well as any other scheme. We compared the performances of monoprogramming and multiprogram- ming in the last section, and we claimed that monoprogramming is yery comparable with multiprogramming when we have enough processors. Both Figure 22 and Table 9 again give strong support for this fact. As a mat- ter of fact, some results in Table 9 even show that monoprogramming is slightly better than multiprogramming! When we decrease the memory-processor ratio s, it could mean we use either slower processor or faster memory. In our simulation, we hold the processor speed constant. So, changing s from 4 to 2 means reducing the memory cycle time by half. From Figure 22 and Table 9, we can see s has a significant effect on the partitioned scheme, particularly for small m. When we reduce s from 4 to 2, the turnaround time improves a lot ranging from 800% to 35%. Of course, we have to pay the price of faster memory. We will come back on this subject later in this chapter. Although there are several advantages of using the partitioned 115 scheme (for example, it is very reliable and easy to expand), there are some implementation problems, e.g., address mapping. Let us give a simple example to explain this problem. Suppose a job requires three modules of memory. How do we store this job in the memory when it gets these modules? We can either three- way interleave this job or store it in a sequential manner, i.e., store the first 1/3 of the program in the first module, the second 1/3 in the second module, and the rest in the last module. This second method does not create any particular addressing problem, since the instructions and data are stored sequentially inside a module and the ordinary address generation mechanism can be used to produce physical addresses. So, as long as we know which three modules contain this job, we will have no problem fetching or storing in these modules. Of course, the module size in general will be a power of 2, which makes the address mapping extremely easy. However, this scheme may not allow us to take the advantage of the independent memory modules, since if we assume a serial address stream, only one word will be accessed at a time. This implies we can only get a minimum band- width of 1! For s > 1, this scheme apparently will waste the processing power due to insufficient memory bandwidth. In order to get a higher bandwidth, we should use address inter- leaving, which we assume in our simulator. If we use this scheme, we need a more complicated mapping mechanism to generate the physical addresses, since now the consecutive instructions or data are stored in different modules. The difficulties are: the degree of interleaving is variable depending on the job size, and the modules allocated to the same job might not be adjacent to each other. Hence, we cannot get the next physical 116 address by simply incrementing the current program counter by one, as we can do in the above non-interleaving scheme or in the distributed sys- tem. Of course, indirect addressing and the branching instructions are even more difficult to solve. Therefore, we do need some extra hardware and an address generating algorithm in the instruction unit if we want to use the interleaving scheme in a partitioned system. We are interested in the hardware design of this problem. In the next chapter, we will discuss a few feasible methods which involve the use of quotient-remainder operation. From Figure 22 and Table 9, we can see the mixed scheme only out-performs the partitioned scheme by a small margin. However, it is less reliable since the failure of a shared module might affect several jobs. Moreoever, it needs a more complicated operating system which is one thing we are against. So, we feel that the partitioned scheme is a very good choice. In the example of Figure 24, we assume the probability that a processor accesses a certain module to be the fraction of the program that is stored in that module. This assumption obviously is only valid for the random access system, that is, a processor which generates references to the memory modules in a random fashion. In other words, there is no re- lationship between any two successive references generated by a processor and the second reference has the same chance to refer to any module occupied by this program independent of the first one. Of course, this assumption about random addressing is not neces- sarily valid. It is well known that address streams tend to be somewhat 117 serial. A serial address stream will produce better performance than a random address stream if the addresses are interleaved across several modules, and worst performance if the addresses run vertically in each module. Unfortunately, it is difficult to adequately quantify this serial ity or to determine a typical value of seriality. This has forced us to use the random addressing assumption in our simulation. Now, let us find out how reasonable our assumption will be. Let us look at some performance bounds to see how good and how bad our system will perform if we assume the perfect and the worst memory bandwidth cases. In the perfect memory bandwidth case, we will assume there is no memory conflict between processors, so each processor will be getting a maximum possible bandwidth of s. However, in a distributed system, if the number of active processors (n ) times s is greater than the total number of memory modules (m) , we will assume each processor is only getting a bandwidth of m/n (< s). In a mixed or partitioned system if s is greater than the num- r ber of modules assigned to a job m-, then the processor will only get a bandwidth of m.. Of course, we will then get the best possible performance among all the systems that have the same set of system parameters, i.e., a lower bound of the turnaround. This case can only happen if we horizontally interleave e\jery program across the memory modules, and assume a perfect con- dition of no memory conflict. On the other hand, for the worst memory bandwidth, we will assume each active processor is only getting a bandwidth of 1, which corresponds to the situation when we vertically store each program inside a memory module (no interleaving at all) and unluckily no two references will ever go to two modules. Of course, this case might not happen but it does give us the 118 worst performance which can serve as an upper bound on the turnaround time. Figure 25 repeats the curves of Figure 22 together with four curves which represent the performance of the perfect and worst memory band- width cases. We use monoprogrammed systems to derive the two upper bound curves since they contain fewer jobs. On the other hand, we use multi- programmed systems to obtain the two lower bound curves since they can con- tain more jobs. As we can see, the monoprogrammed, partitioned system yields the worst result which we call the largest upper bound, and the multi- programmed, distributed system yields the best result which we call the smal- lest lower bound. Any performance curve will be bounded between these two curves no matter what the memory allocation scheme we use and how we as- sume the reference probabilities, as long as we are using eight processors, the shortest-job-first algorithm, an average I/O speed of 42 ms, 1024K bytes main memory, a full connection, and a speed ratio of 4 (cf. Figure 22). One interesting thing is that when m >_ 16, all the curves are clustered above the lower bound curve. Obviously, the random distribution assumption already gives us pretty good results. Further complication of the memory bandwidth calculation apparently can only give a very minor effect on these performance curves. This explains why we are using the random distribution assumption in all our simulations. Furthermore, as we can see, all the curves are far below the upper bound curves. This tells us that the (horizontal) interleaving scheme can be a very important factor on the system performance. Now, let us briefly summarize the results we have in this section. Table 10 shows an overall comparison of these three memory allocation schemes. 119 Ta (sec.) 1600 1400 1200 1000 800 600 400 (2247) 200 LOWER BOUND (MULT I., PART SMALLEST LO (MULTI.,DIST.) ■ LARGEST UPPER BOUND (MONO., PART.) *108 2® 69 80 m 8 16 24 32 Figure 25. The Performance Curves of Figure 22 and Some Performance Bound Curves. 120 ^v. Scheme Parameter^-. Partitioned Mi xed Distributed Total Memory Bandwidth Moderate Moderate High Job Memory Bandwidth Moderate Moderate High (Word) Memory Utilization Bad Good Good Reliabil ity Good Moderate Bad Turnaround Time Bad for Small m Good for Large m Good Best Memory Waste Yes No No Table 10. Summary of the Performance of Three Memory Allocation Schemes. 121 The distributed scheme leads in all items except the reliability. On the other hand, the partitioned scheme trails in all items, except it is the most reliable one. However, we have shown that when we increase the number of memory modules in the system or improve the memory speed, the performance of the partitioned scheme improves very quickly. For a moderately large number of modules, say 24, the partitioned system already has a performance which is very comparable to the distributed system. Besides, the partitioned scheme provides a very high reliability. In a system where the reliability is extremely important, the partitioned scheme should be considered first. If we need a higher performance and can sacrifice a little bit of reliability, then the mixed scheme might be a better choice. Of course, if we are primarily interested in performance and have very reliable hardware, i.e.. the mean time between faults (MTBF) is long, then the distributed scheme will be the best candidate. 3.1.4 Job Scheduling Algorithm One of the major factors that affect the turnaround time of a job is the scheduling algorithm. The scheduling algorithm is used to determine the order that the jobs in the waiting queue will enter the system. This is based on some attribute of these jobs, not necessarily the order they arrive. Therefore, some new jobs might enter the memory and get executed before a job which arrived previously. This, of course, increases the time this old job has to spend in the waiting queue. But on the other hand, those jobs which get into the memory earlier will suffer shorter queueing dealays. In fact, the purpose of using a scheduling algorithm is to some- how scramble the execution order of a certain set of jobs such that some 122 kind of performance will be improved. Most of the time, the average turna- round time or the average queueing time will be what people are trying to improve. The queueing time of scheduling algorithms has been extensively studied by queueing theorists. In [26], Kleinrock has a \/ery complete dis- cussion on this subject. Most of the analytic results are expressed in terms of the average conditional queueing time, i.e., the queueing time of a job which needs a certain amount of processing time. For example, Figure 26 shows the average conditional queueing time curves of three commonly used scheduling algorithms in time-sharing systems, namely FCFS (f irst-come-first- serve), RR (round-robin), and FB (foreground-background), assuming the system is an M/M/l queue. We do not show the scales since they depend on the arrival rate, the mean service time and the service- time distribution. Only the shape of a curve is shown which gives the effect of a scheduling algorithm on the jobs with different pro- cessing time requirements. As we can see, the average conditional queueing time for FCFS is the same (constant) for any job, whether it requires long processing time or short processing time. This type of scheduling algorithm is called non- discriminating. FCFS can give the shortest queueing time to long jobs. However, it gives the longest queueing time to short jobs. The average con- ditional queueing time for RR, on the other hand, grows linearly as the processing time increases. The longer the job is, the larger the queueing time will be. This kind of scheduling algorithm is called linear-discriminating It discriminates against long jobs. Similarly, FB is called the most- discriminating since it gives the longest queueing time to long jobs among 123 AVERAGE CONDITIONAL QUEUEING TIME FCFS PROCESSING TIME Figure 26. Average Conditional Queueing Time for M/M/l System. 124 all known scheduling algorithms. But, FB yields the shortest queueing time to short jobs. RR is a very popular scheme in a lot of time-sharing systems. FB is used in the famous MULTICS system. There is no absolute standard of which algorithm is the best among these three. All depend on what kind of measurement we are most interested in and the job mix we are dealing with. For example, if we are interested in the overall average queueing time and most of the jobs are short jobs, then apparently we should go for FB. However, since we are dealing with batch systems, we will not use these algorithms. In our study, we will use the average turnaround time (Ta) of all jobs as our measurement of goodness. This is the same as using the overall average queueing time (q), since a longer q always implies a longer Ta. We will show both of the measures later. The following eight scheduling algorithms will be studied here. 1. FCFS: f irst-come-first-serve 2. SJF 3. LJF 4. SMF 5. LMF 6. SMNF 7. SPTF 8. BMFF shortest- job- first longest-job-first small est- job-first largest-memory-first smallest-magic-number-first shortest- processing- time-first best-memory- fit-first All these names are self-explanatory. SMNF is a scheme used in our IBM 360/75 system. Each job is assigned a magic number which is calculated by using the following formula: MN = 3*(processing time) + 0.01 (job size) + 0.05*(Number of I/O requests) Then, the job with the smallest magic number will get executed first. This scheme not only penalizes long jobs but also penalizes large jobs (jobs 125 requiring large memory spaces), since the above formula takes both processing time and job size into account. BMFF will choose the job which can fit into the memory best. If the available memory space is very large, then BMFF will act just like LMF since the largest job in the waiting queue will be chosen. However, if the remaining space cannot hold the largest job, some smaller job which fits best will be chosen instead. SPTF will choose the job with the shortest processing time. It is slightly different from SJF, since SJF also takes I/O time into account. In other words, SJF will choose the job which has the smallest CPU plus I/O time. When a job arrives, it will be placed somewhere in the waiting queue according to the scheduling algorithm. For example, SMNF will line up the jobs such that their magic numbers are in increasing order. Then, the queue will be considered from the beginning every time. Table 11 -a shows Ta versus p values for all these algorithms, and Table 11-b displays the numerical values of the average queueing time (q). As we can see, SJF gives the smallest turnaround time among all these scheduling algorithms. This is what we would expect since it has been proven analytically [26]. Therefore, we use SJF in all other discussions. In fact, for p > 6 all these algorithms perform more or less the same. For example, FCFS is less than 10% from SJF. This means when we have enough hardware the scheduling algorithm really does not make too much dif- ference on the performance. It is very easy to understand, since when the throughput is high the system will be lightly loaded, and no matter how we schedule the job each job will only suffer very little delay. Only when the system is heavily loaded will the scheduling algorithm be important. Maybe the following adaptive method can be used. If the system is lightly 126 M=1024 PART. m=24 MONO. 42ms 800 JOBS s=2 r=4 FULL \ P Schedul ing\ 3 4 5 6 7 8 Algorithm \ LJF 4663 1707 235 124 95 87 SMF 3415 1354 198 114 94 86 FCFS 3218 1140 179 109 92 84 BMFF 3121 1151 171 106 88 80 LMF 2691 1006 174 120 97 89 SMNF 2604 1029 166 101 87 79 SPTF 2592 1031 160 103 89 80 SJF 1960 776 152 100 88 79 (a) Ta Versus p \ p Schedul ing\ 3 4 5 6 7 8 Algorithm \ LJF 4600 1643 169 57 27 19 SMF 3352 1291 133 47 26 18 FCFS 3065 1077 114 42 24 17 BMFF 3059 1088 106 39 19 12 LMF 2629 943 108 54 29 22 SMNF 2541 966 101 35 19 12 SPTF 2530 968 95 37 21 13 SJF 1898 712 86 33 20 12 (b) q Versus p Table 11. Comparison of Eight Different Scheduling Algorithms 127 loaded, all the jobs will be served according to their arriving order and no scheduling algorithm will be used. When the system load exceeds a certain threshold, a scheduling algorithm, e.g., SJF, will become effective to schedule the jobs waiting in the queue. In Figure 20, we have shown the sensitivity of the turnaround time when we increase the number of processors. Table 11 also shows the same phenomenon. Moreover, the drop is even sharper here. This again shows the importance of having enough processors in the system. As we said at the beginning of this section, the scheduling algorithm is used to determine the order that the waiting jobs will be considered for entering the memory. In general, all the jobs will be lined up in the queue according to the scheduling algorithm. Hence, the job at the head of the queue will always be considered first. Of course, this job might not be able to get into the memory when we are looking for a job to be executed. A new problem arises here, that is, shall we skip this job and consider the next one? This is the so-called "look-ahead" problem. Apparently, there is no reason we should not consider the second job if the first job gets blocked due to lack of memory, since we can shorten the queueing time of the second job if it can fit into the memory. So, it is conceivable we might improve the average turnaround by doing look-ahead. Naturally, the next question is, if we allow looking ahead, do we consider the third job if the second one still cannot enter the memory. In other words, do we allow look-ahead to be carried onto the third job. This is usually called the "look-ahead distance" problem. The look-ahead distance is defined to be the maximum number of jobs we can look at down the queue. 128 For example, if the look-ahead distance is 2, then we can at most look at the first two jobs and cannot look beyond the second job. Therefore, the look-ahead distance of one is equivalent to the non-look-ahead. It is true that the chance of finding a "fittable" job is better if we allow longer look-ahead distance. But, it does not necessarily mean we can get a better average turnaround time by increasing the look-ahead distance, since the original order set up by the scheduling algorithm will be perturbed by the look-ahead scheme, and the longer the distance is the more this order will be perturbed. In other words, the look-ahead distance will cancel the effect of the scheduling algorithm. So, a large look-ahead distance might not be desirable. Figure 27 shows the effect of look-ahead. We can see when we allow a moderate look-ahead distance, say 4, we do gain some benefit. However, a larger look-ahead distance might even cause a negative effect! Therefore, we suggest a look-ahead scheme with a moderate distance. 3.1.5 Effects of Job Characteristics In this section, we are going to study the effects of job character- istics. As we mentioned earlier, each job is characterized by four parameters, namely, the arrival time, the CPU time, the job size, and the number of I/O requests. Needless to say, everyone of them will affect the system perform- ance. Our purpose here is to find out how sensitive the effect of each parameter is. Let us first look at the arrival time. The arrival time determines how fast the job stream puts work on the system. Of course, the faster the jobs arrive, the heavier the workload will be. Since the processing power 129 Ta (sec.) 150 - 146 142 - 138 - t M=1024 PART. p=5 MONO. m=24 SJF 42ms. 800 JOBS s=2 r=4 FULL J 1 1 1 1 1 1 1 1 ' 1 Look-Ahead Distance 1 4 7 10 Figure 27. The Effect of Look-Ahead Scheme 130 and the memory space are limited, more jobs will be accumulated in the waiting queue if the jobs come in faster, and thus each job will suffer longer delay. Apparently, the average turnaround time should go up as we increase the arrival rate. In order to change the arrival rate, we will multiply the arrival time of each job by a variable called the "Arrival Scaling Factor." We can speed up the arrival rate by using a smaller arrival scaling factor, since the arrival time of each job will then be scaled down to a smaller value. Figure 28 shows how the average turnaround time responds when we change the arrival scaling factor. As we can see, when we decrease the scaling factor from 1.0 down to 0.3, the average turnaround time does not change very much. Apparently, the system is unsaturated, or lightly loaded, within this range. In other words, the jobs do not arrive as fast as the processors can process. This does not mean our IBM 360/75 system is under- loaded, even though the job characteristics are obtained from analyzing the real workload on this machine. It is because we are using more proces- sors in our model, and hence our system has a higher processing power. However, when we further decrease the arrival scaling factor, Ta starts going up. Especially beyond 0.1, Ta increases very sharply. Obvi- ously, the system starts getting saturated at around 0.1. Heavier loading will push the system into oversaturation, and the average turnaround time starts to blow up. When the jobs arrive faster, it is conceivable the system will become busier. This is because the possible idle periods will become smal- ler. Figure 29 shows how the percentage of the time the system is busy increases as we decrease the arrival scaling factor. By busy, we mean at 131 Ta (sec.) 200 150 100 - MULTIPROGRAMMING M=1024 DIST. p=4 SJF. m=16 LA=4 42ms. 800 JOBS s=4 r=4 FULL -®o ®o ®0 ®0 50 0.1 0.3 0.5 0.7 0.9 Arrival Scaling 1.1 Factor Figure 28. The Effect of the Arrival Scaling Factor on the Average Turnaround Time. 132 System Busy Factor 1.0 0.8 0.6 0.4 0.2 0.0 0.08 li 0.1 J L 0.3 0.5 0.7 0.9 Arrival Scaling Factor Figure 29. The Effect of the Arrival Scaling Factor on the System Busy Factor. 133 least one job is in the memory, no matter whether it is doing I/O or being processed by a processor. At 0.08, the system is busy for more than 95% of the time. This proves that the system is indeed getting saturated when we decrease the scaling factor below 0.1. When the system is in saturation, measures of throughput or turnaround time might not reflect the true effects of some system parameters, and similarly when the system is far below saturation. Therefore, we scale the arrival time by a factor of 0.1 in all our simulations. This places the system in an interesting region. Now, let us look at the effect of the job size distribution. As we said earlier, one major reason that a job gets blocked from being executed is due to the lack of memory. If we fix the memory size, apparently the job size will have a big impact on the system performance. Of course, the larger the job sizes are, the smaller the average number of jobs the memory can contain is, and the more frequently a job will get blocked. Therefore, we should expect to get a longer average turnaround time if we use a job mix which has a larger job size distribution. Figure 30 shows the effect of the job size on the average job turn- around time. We fix all other parameters and change the job size by multi- plying the size of each job with a Job Size Scaling Factor. This is similar to what we have done on the job arrival time. Thus, if the scaling factor is 2, the size of each job will be doubled. We can see the turnaround time is very sensitive to the change of the job size. When we double the scaling factor from 1.0 to 2.0, the average turnaround time increases by almost 150% (from 83 to 209). Again, Ta doubles when we increase the scaling factor from 2.0 to 2.5. Apparently, Ta grows exponentially when we increase the 134 Ta (sec.) 400 300 200 100 M=1024 PART. p=8 MONO. m=24 SJF 42ms. LA=4 s=2 ASF=0.1 r=4 800 JOBS FULL J I I I I I *r 0.5 1.0 1.5 2.0 2.5 3.0 Job Size Scaling Factor Figure 30. The Effect of the Job Size on the Average Turnaround Time. Ta (sec.) 135 600 500 400 300 200 100 M=1024 PART. p=8 MONO. m=24 SJF 42ms. LA=4 s=2 ASF=0.1 r=4 800 JOBS FULL 1.0 2.0 - 1 CPU Time Scaling Factor Figure 31. The Effect of Increasing CPU Time, 136 job size. This indicates the importance of having enough memory. Table 12 shows the corresponding memory utilization. Just as we have expected, the memory utilization increases when the job size goes up. This again tells us the memory utilization should not be used as an indication of the performance. The limited size of memory is usually the bottleneck in a computer system. Figure 30 shows that this is especially true in a multiprocessor system. However, this does not say the use of an extremely large memory will always do some good. When we reduce the job size scaling factor down to 0.5, this is equivalent to using a memory which is twice as large, we still get the same performance, but the memory utilization has been decreased to only 42%. This means we are wasting the memory and getting no improvement at all. So, an appropriate amount of memory should be used in order to get both good performance and good utilization. From Table 19, we can see 1024K bytes is the best memory size for the workload we are using. This is why we use this memory size throughout this chapter. Finally, let us look at the effect of increasing CPU time or the number of I/O requests. Both of them will increase the time a job has to spend in the memory. This in turn will increase the time those waiting jobs have to spend in the queue. Therefore, the increase of either CPU time or the number of I/O requests of a job has a twofold effect on the average turnaround time. We briefly mentioned this in Section 3.1.2 when we compared monoprogramming and multiprogramming. We will discuss the effect of increasing CPU time first. Again, we use a CPU Time Scaling Factor to scale the processing time of each job. Figure 31 shows how the average turnaround time responds when we increase 137 Job Size Seal ing Factor Ta U m 0.5 84 0.42 1.0 83 0.58 1.5 96 0.76 2.0 209 0.87 2.5 402 0.88 Table 12. The Average Turnaround Times and Memory Utilizations for Different Job Sizes. 138 (or decrease) the CPU time. The curve first grows slightly more than linearly, and then starts taking off if we double the CPU time scaling fac- tor. This is not surprising since the system is getting saturated when we scale the CPU time beyond 2.0. The use of scaling factor effectively "stretches out" the distribu- tion curve since every job is enlarged by the same factor. Theoretically, if we use a random number generator with the same distribution and the scaled mean to generate a job stream, these jobs should have a distribution roughly the same as the stretched distribution. In other words, two methods should produce the same characteristics. The dotted curve in Figure 21 displays the result by using the random number generator method. We can see there is not much difference between these two methods. Therefore, we use the scaling factor method since it is easier to apply. The curves of Figure 31 are very similar to that of Figure 30. In fact, they almost coincide with each other. Apparently, both CPU time and job size have the same effect on the system performance. One other interesting point is, all these curves are very similar to the turnaround time curve of an M/M/l queueing system. The later can be expressed by the following equation: T = L/M _ J 1-p M-X where m is the service rate, X is the arrival rate, and p = A//i is the system utilization. Increasing the CPU time or the size of a job is in fact equi- valent to reducing the service rate. T increases when we decrease m. When m gets very close to A, T will become extremely large since the system is about to saturate. This can explain why the Ta curve goes up very sharply 39 when we scale the CPU time beyond a certain limit. Table 13 shows the average turnaround, queueing and service times of a job when we use a different scaling factor. As we can see, most of the increment comes from the average queueing time. This is what we should expect, since each job will remain in the memory for a longer time but the arrival rate remains the same, hence more jobs will be accumulated in the waiting queue and a new arrival will have to spend much more time in the queue. When we design a system, we should be very careful about the job arrival rate, the average CPU time, and the processing power we have. If the average turnaround time falls on the high rising edge of the curve, we should try to lower it by adding more hardware. 3.2 Results for Hardware Related Questions In the second half of this chapter, we are going to answer the hardware questions we listed in Chapter 1. Mainly, we will investigate the effects of using different amounts and different speeds of hardware, then find a cost-effective way of building a machine. We will also look into a very interesting problem, namely, the processor-memory inter-connection problem. The results we have shown so far are using a full processor-memory connection scheme, i.e., each processor is physically connected to every memory module and can access any module. For example, a crossbar switch is a full connection scheme. Under this scheme, each module can be assigned to any processor. Jobs cannot be pre- vented from entering the system due to the inability to connect available memory to an available processor. 140 CPU Time Scaling Factor Ta q X 0.8 74 n 63 1.0 83 12 71 1.2 94 19 75 1.4 108 29 79 1.6 128 41 87 1.8 155 62 93 2.0 230 131 99 Table 13. The Effect of Increasing CPU Time on the Average Queueing Time (q) and Average Service Time (x) . 141 However, as we mentioned in Chapter 1, a full connection is very expensive. Its cost can go up very quickly if we want to expand the system. For example, if we want to double a 4x8 system to a 8x16 system, the cost of the connection network will go up four times. Besides, a full con- nection network usually is not easy to expand. For example, in a crossbar switch, it is very difficult to increase the size of the fan-out tree, since that requires a complete rewiring between the fan-out tree and fan- in tree. Or, in a mul ti port memory system, we will have to replace all the memory interfaces if the number of processors of the expanded system exceeds the number of ports in a single module. Therefore, we are interested in using a partial connection network. Let us recall the connection network of the PRIME system shown in Figure 1. Each processor is connected to 8 memory modules via a private bus. Each memory module has four ports, so up to four processors can connect to one module. In the current PRIME system, there are five processors and 13 memory modules, so 40 out of the 52 ports are being used. This is a typical partial connection system. In our study of partial connection, we will assume this kind of architecture. Of course, we would expect some degree of performance degradation* since each processor only connects to a subset of the modules and thus can only access part of the memory. Hence, a processor cannot be assigned to a job if the memory attached to this processor is not big enough, even though the total available space in the memory is big enough. Consequently, the probability that a job will be blocked is larger in a partial connection system. However, the cost of a partial connection does not grow as fast as a full connection. In fact, it will grow linearly if we use multiport memories. 142 Most of all, the partial connection does allow us to expand the system without too much trouble. If we are increasing the memory, we can just connect the additional module to some processor arbitrarily or following some rule. If we are also increasing the number of processors, we might need to move some connectors to reconf igurate the whole system. But, no hardware modification is needed. We will study the performance degradation of a partial connection network. In addition, we will look at an interesting problem, namely, how should we interconnect the processors and the memories such that we can get a minimal degradation. In general, a processor will be allowed to connect to only part of the memory in a partial connection system. How many modules should be assigned to a particular processor will greatly influ- ence the job handling ability of that processor. Obviously, the more memory a processor connects to, the larger job it can handle. In the PRIME system, all processors are connected to the same number of memory modules, namely, 8. Hence, all of them are equally "capable." Of course, this is based on the assumption that no job will ever require more than eight modules, and the fact that there are enough ports in the memory. Obviously, if any of these two conditions are violated, this connection will not work any more, and we need a new configuration. For example, if some jobs need 12 memory modules, then at least one processor should be connected to that many modules. However, not all of them can have 12 memory modules since for 5 processors that would require a total of 60 memory ports. Therefore, an "uneven" connection might be more effec- tive, that is, some processor will have more memory and some will have less. 143 How to distribute the total available ports to the processors in this case is a very interesting distribution problem. This is part of the intercon- nection problem we will be looking at later. After we determine the number of memory modules each processor will get, another problem arises immediately, namely, which module we should choose to connect to a certain processor. In the PRIME system, this is done in a rather uniform way. Each processor will connect to 8 consecu- tive modules with the leading module two modules to the right of the leading module of the previous processor. Of course, this might not be a good way to arrange the connection. However, it is very difficult to come out with an appropriate analytic argument to show what a good connection might be. What we will try to do is to simulate several combinations and compare their results. Then, analyzing the connections which yield better results we can get some ideas of what we might need to do in order to achieve a good con- nection. This is the second part of the interconnection problem we will be studying. 3.2.1 Hardware Quantity Effect Let us now look at how the system responds when we increase the hardware. Of course, the more hardware we add into the system, the better performance we should get. What we want to know here is how sensitively each type of hardware resource will affect the system performance. Therefore, we shall know what to buy in order to achieve a certain percentage of im- provement and spend as little money as possible. In our system, there are four kinds of hardware we can increase: the total amount of memory M, the number of processors p, the number of memory modules m, and the number if I/O devices r. We will investigate the effect 144 of each individual parameter. So, when we change a certain parameter we will assume the other three are fixed. At the end of this section, we will also show what will happen when we increase them simultaneously. The effect of increasing the total amount of memory M, keeping p, m, and r fixed, is actually the same as that of decreasing the size of each job. This is because both actions have the same effect of allowing more jobs to enter the memory and get executed at the same time. Obviously, doubling the memory M is equivalent to halving the size of each job. There- fore, changing M should yield the same result as that shown in Figure 30, except if we plot Ta against increasing M the curve will become exponen- tially decreasing instead of increasing. Our simulation indeed shows ex- actly the same numerical result and so we omit the repetition. The effect of increasing the number of processors p has actually been shown in Figures 18 and 20. In Figure 32, we repeat this for three different m values. What we want to do here is to compare the effects of increasing p and m. In Figure 33, we use the same values of Ta and plot the curves against m. From Figure 32, we can see how important it is to have enough processors. In this case, at least five processors should be used in order to get good performance. We have stressed the same point earlier in this chapter. In Table 14, we show both the memory utilization and the average memory bandwidth for each job. We can see in Table 14-b the job memory bandwidth, i.e., the average memory bandwidth each job gets, does not change when we increase p. This is what we should expect since we are dealing with a monoprogrammed system. So, the performance improvement must come from the increase of memory utilization alone. In Table 14-a, we can see the 145 Ta (sec.) 2000 1800 1600 1400 1200 1000 800 600 400 200 100 M=1024 PART. 42ms. MONO. s=2 SJF r=4 LA=4 FULL ASF=0.1 800 JOBS ® m = 16 o m = 24 • m = 32 Figure 32. The Effect of p on the Average Turnaround Time Ta 146 Ta (sec.) 900 875 850 825 800 775 750 725 700 225 200 175 150 125 100 75 50 25 M=1024 PART. 42ms. MONO. s=2 SJF r=4 LA=4 FULL ASF=0.1 800 JOBS 16 24 _j 32 p=4 p=5 p=6 p=7 p=8 Figure 33. The Effect of m on the Average Turnaround Time Ta 147 P X \ 16 24 32 4 41.3, 14.1 41.3, 9.5 41.4, 7.0 5 51.5, 17.6 50.0, 11.5 49.1, 8.3 6 53.0, 18.0 51.8, 11.8 51.1, 8.7 7 53.4, 18.4 52.4, 12.0 51.6, 8.7 8 54.2, 18.5 53.1, 12.2 52.2, 8.9 (a) Percentages of Memory Utilization and Memory Waste of the Partitioned Scheme n. m P \. 16 24 32 4 1.35 1.45 1.52 5 1.36 1.46 1.52 6 1.36 1.46 1.52 7 1.36 1.46 1.52 8 1.36 1.46. 1.52 (b) Job Memory Bandwidth Table 14. The Memory Utilization, Memory Waste, and Job Memory Bandwidth for the System of Figures 32 and 33. 148 ( B; , q ) ^\ m 16 24 32 4 2.69, 852 2.78, 708 2.84, 629 5 3.27, 114 3.28, 74 3.29, 67 6 3.30, 41 3.31, 30 3.31, 29 7 3.31, 25 3.32, 17 3.33, 16 8 3.32, 23 3.33, 15 3.33, 14 Table 15. The Total Memory Bandwidth (B ) and Average Queueing Time (q) for the System of Figures 32 and 33. 149 memory utilization increase by as much as 13% as we double the processors from 4 to 8. When p=4, the memory is indeed under-utilized. Of course, this is because we only allow up to four jobs in the memory so a lot of jobs have to wait in the outside queue. We can see from Table 15 that the queueing time for p=4 is yery large. When we increase p, we actually allow more jobs to be in the memory at the same time, and hence the memory utiliza- tion goes up, and the queueing time goes down. Meanwhile, the average ser- vice time increases due to the competition of I/O devices. For large p, say 7 or 8, the curves become flat, and no improvement will result even if we add more processors. From our simulation result, we see that most of the time the memory can only contain six to seven jobs. So, any more processors beyond that will simply be wasted. From Figure 33, we can see the turnaround time also goes down when we increase m. In fact, the increase of m has two fold effect on the system performance. It can reduce the memory waste since the module size will be reduced. (Notice that, when we increase m, we are holding the total amount of memory fixed.) This can be seen in Table 14-a. Also, it can increase the memory bandwidth since the degree of interleaving for each job will be in- creased. This can be seen in Tables 14-b and 15. Since the speed ratio s is only 2, the increase of m will only cause small improvements on the total memory bandwidth and the job memory bandwidth. When we increase m from 16 to 24, a significant improvement has been achieved. This is because the memory waste has been reduced significantly, and the job memory bandwidth has been increased by a non-trivial percentage (about 8% in this case). When we again increase m from 24 to 32, a small change has been made on the turnaround time except when p is small. Therefore, 150 an m value of 24 should be enough to achieve good performance. This phe- nomenon can also be seen in Figure 19 and 22. In Table 14-a, we can see the memory utilization actually decrease when we increase m. This is because the throughput has been increased, and a job will stay in the memory for a shorter period of time. We have explained this when we discuss Table 7. From Figures 32 and 33, we can see that the number of processors has the most profound effect on the system performance. If we do not have enough processing power, the increase of any other hardware will turn out to be wasteful . Figure 34 shows the effect of another parameter r, the number of I/O devices. Here we use 8 processors, 24 memory modules, and the partitioned scheme. As we can see, the breaking points occur at r=4. When we increase the number of I/O devices from 2 to 4, the average turnaround time improves significantly. This can be explained by Table 16. Table 16 shows the average queueing time of a job waiting for an I/O device. For r=2, the queueing time is very large. Apparently, many jobs jam up in the I/O queue due to lack of I/O channels. When we double the number of I/O devices to 4, the queueing time drops drastically. This can be seen in the first two rows in Table 16. The reason is rather simple: since we are using monoprogramming and 8 proces- sors, at most 8 jobs will be in the memory simultaneously, and since the jobs are not particularly I/0-bound, the probability that more than four jobs are doing I/O is small. Hence, four I/O channels will be enough in this case. Further increase of I/O channels gives very little improvement to the per- formance. In fact, this is also true for the multiprogramming case, except the multiprogramming case shows a little bit higher queueing time since on 151 Ta (sec. ) 280 240 200 160 120 80 M=1024 PART. p=8 MONO. m=24 SJF s=4 LA=4 FULL ASF=0.1 800 JOBS ® 42ms 35ms ♦ 28ms * 21ms 40 J L J I I I I 12 3 4 5 6 7 8 Figure 34. The Effect of the Number of I/O Devices 152 r Unit I/O Time(ms) 21 28 35 42 2 10.2 23.3 40.5 53.6 4 0.9 1.9 3.6 6.6 6 0.9 1.8 3.5 6.3 8 0.9 1.8 3.5 6.3 Table 16. The Queueing Time for I/O Device. 153 the average more jobs will be competing for the I/O channels. Of course, if we are dealing with an I/0-bound job mix, more I/O devices will be needed then, since the I/O stage will become the bottleneck of the system. From Figure 34, we can also see that the percent difference of the turnaround time between r=2 and r=4 decreases as we reduce the average time per I/O operation, i.e., the curve becomes flatter for faster I/O devices. This is because the average unit I/O time has a twofold effect on the turnaround time. When we reduce the average unit I/O time, both the job I/O time and the queueing time for I/O has been reduced at the same time. In particular, the queueing time drops by a rather large factor when the average unit I/O time is reduced. This can be seen in each row of Table 16. One more interesting thing is: the queueing time for r=2 with an average unit I/O time of 21 ms is almost 60% larger than that for r=4 with an average unit I/O time of 42 ms (Table 16). However, the average turnaround time for the former case is about 25% better than the average turnaround time for the latter (Figure 34). The first case can be explained as follows: Although both cases have the "same" power, using four slow I/O devices a job will not suffer any delay unless there are already four or more jobs doing I/O. In other words, a job has a lower probabil ity of being enqueued. But, using a faster I/O device can reduce the total job I/O time. Apparently, in these two cases what we gain from the total I/O time is much more than what we lose in the queueing time. This implies that if the cost of an I/O device with an average I/O time of 21 ms (e.g., IBM 3330 disk unit) is no more than twice of the cost of an I/O device with an average I/O time of 42 ms (e.g., IBM 2314 disk unit), then it might be a better idea to use half as many of the faster one. We will further look at the effect of the I/O speed in the next section. 154 Now, let us look at how the system performance reacts when we simultaneously increase the number of the processors, memory modules, and I/O devices. Of course, we would expect the system performance, e.g., the average turnaround time, to be improved at a much larger rate as we increase them at the same time. But, in order to make a fair comparison of the "capa- bility" of each system, we will also adjust the workload of each system by scaling the job arrival time. For example, if we double the size of a cer- tain system, we will also double the system workload by doubling the job arrival rate. Figure 35 shows how the average turnaround time reacts when we double the system size from (4,12,2) to (8,24,4), then to (16,48,8). Of course, we double the job arrival rate everytime we double the system size. Notice that in these experiments, when we double the number of modules we also double the total memory size since we keep the module size the same. In the (4,12,2) system, we use 512K bytes of main memory. So, we use 1 024 K and 2048K bytes in (8,24,4) and (16,48,8) systems respectively. As we can see, the Ta curves (solid curves) drop roughly exponen- tially as we double the system size, even though we double the job arrival rate at the same time. This means that if we double the system size, we should be able to handle more than twice the workload. In Table 17, we show the total memory bandwidth, the job memory bandwidth , the average num- ber of jobs in the system, and the queueing time for three different system sizes. As we can see, the total memory bandwidth increases very rapidly when we double the system size. This is the major reason that the average turnaround time improves so quickly since the total memory bandwidth is the amount of work being done in a unit of time. Only the job memory bandwidth 155 Ta (sec.) 340 300 260 220 - 180 - 140 - 100 - 60 - 20 - PART. MIX. DIST, MONO. SJF LA=4 ASF(8,24,4)=0.1 42ms. FULL PART, MIX. DIST, J (p,m,r) (4,12,2) (8,24,4) (16,48,8) Figure 35. The Average Turnaround Time for Different System Size 156 ^\ System \-(p,m,r) Al location^. (4,12,2) (8,24,4) (16,48,8) Scheme ^v. 3.57 6.66 9.97 Distributed 2.67 3.04 2.79 5.19 2.88 9.35 65.8 11.1 1.4 3.39 6.60 9.91 Mixed 1.75 3.50 1.75 6.46 1.76 12.16 141.6 27.9 4.2 3.36 6.57 9.89 Partitioned 1.75 3.38 1.75 6.40 1.75 12.09 266.6 37.1 5.8 Table 17. The Total Memory Bandwidth (B ), Job Memory Bandwidth (B ), Average Number of Jobs in System (n), and Average w Queueing Time (q) for each System Size. 157 of the distributed system increases when we enlarge the system size. This is because the degree of interleaving has been doubled which reduces the memory interference between jobs. But, the important thing is the average number of jobs in every system almost doubles every time we double the sys- tem size, which causes the large increase of the total memory bandwidth. This implies the doubled system has twice the capability of containing jobs in memory. Of course, this seems rather intuitive since we also double the memory. However, in a smaller system, it is very possible that a few large jobs will occupy the memory and block the other jobs for a long time. But, this is less likely to happen in a larger system since it is unlikely several very large jobs will compete the memory at the same time. In other words, a larger system will have a higher potential of allowing more jobs to enter the memory. So, a job will pass through the large system quicker than the small system. In queueing theory, this is called the diminishing effect. In Table 17, we can see the queueing time decreases very fast when we increase the system size. In fact, we can show that a double-sized system can handle work 1 5 about 2.8, or roughly 2 * , times the workload. This is shown by the dotted curves in Figure 35. Thd dotted curves are obtained by the following method. Let us fix the performance of the (8,24,4) system and try to bring the performance of the other two systems close to it by adjusting the arrival rate. In order to lower the turnaround time of the (4,12,2) system, we slow down the arrival rate by increasing the arrival scaling factor. Notice that the larger the arrival scaling factor is, the lower the arrival rate will be. In Figure 35, we found that if we can use an arrival scaling factor of 0.28 (=0.1 x 2 1 ' 5 ) the (4,12,4) system will 158 perform roughly the same as the (8,24,4) system. In other words, we make the arrival rate, and hence the workload, of the (4,12,2) system 2.8 times smaller than the (8,24,4) system and get almost the same turnaround time. On the other hand, we use an arrival scaling factor of 0.039 (>0.1 t 2 ) for the (16,48,8) system and the turnaround time increases a little bit to the neighborhood of that of the (8,24,4) system. Therefore, all three systems now perform more or less the same 1 5 but the workload ratio is kept roughly at 2 ' between two consecutive sys- tems which have a size ratio of 2. This means that the processing power 1 5 of a double-sized system is about 2 ' times larger than the original sys- tem. If we let c be the size of a system, we may say the processing power 1 5 of our system grows roughly according to the function c " . Since the cost of a system is directly proportional to the size of the system, we might as well think c as the cost. Therefore, the processing power P of our system can be formulated as follows: P = a c = a c/c, where a is some proportionality constant. What this result implies is the performance will grow faster than linearly as we increase the size of the system. However, we must point out that the above result only holds in the range shown in Figure 35. When we double the system size again to (32,96,16), the workload it can handle to yield a similar turnaround time does not grow by as much as 2.7 or 2.8 times of the workload of the (16,48,8) system. In fact, the (32,96,16) system can only handle a work- load of about 2. 3 times that of the (16,48,8) system. Obviously, arrival 159 rate has a larger effect on the performance than system size. 1 5 So, c ' does not hold for system beyond (16,48,8). However, for a general system, (16,48,8) is already a reasonably large size. We can say that the above result is good for the range of system size most people will be interested in. 3.2.2 Hardware Speed Effect In this section, we are going to study the effect of using faster components. There are two parameters we will look at, namely, the average unit I/O time and the memory-processor speed ratio. We did mention some effects of these two parameters earlier, however, we will look at this prob- lem from a slighly different angle. Let us first look at how the memory-processor speed ratio s will affect the system performance. For the convenience of this discussion, we will assume the processor speed to be fixed. So, the larger the s is, the slower the memory will be. Figure 36 shows the Ta versus s curves for three different memory allocation schemes. As we can see, all three curves go up slighly more than linearly as we slow down the memory speed. However, the slope of the curve for the distributed system is smaller than those of the other two systems. This means that the memory speed has less effect on the distributed system. We can explain by using the bandwidth equation and the following example. Assume we have eight memory modules, and three jobs in the memory which require 1, 3, and 4 modules respectively. Let us compare the parti- tioned system and the distributed system. If we use the random distribu- tion assumption, we can apply Ravi's equation [38] to compute the memory 160 Ta (sec. ) 160 140 120 100 80 60 40 20 M=1024 MONO. p=8 SJF m=24 LA=4 42ms. ASF=0.1 r=4 800 JOBS FULL 1 2 3 ( Slower Memory — PART. MIX. DIST, Figure 36. The Effect of the Memory-Processor Speed Ratio, 161 System s=1 ( Slower Memory s=2 s=3 ► ) s=4 s=5 Distributed 2.64 0.88 4.41 0.73 5.60 0.62 6.39 0.53 6.92 0.46 Partitioned 3.00 1.00 4.42 0.73 5.42 0.60 6.14 0.51 6.65 0.44 Table 18. The Total Memory Bandwidths ( Accesses per Memory Cycle ) and the Number of Words a Processor can get per Processor Cycle for 5 Different s Values. ( 3 Jobs in 8 Memory Modules ) 162 0«- bandwidth. For the distributed system, the bandwidth will be 8[1-(1 - 1/8) ], c and for the partitioned system, it will be 1+3[1-(1 - 1/3) ] + [1-(1 - 1/4) ]. We list their numerical values in Table 18. Notice that, although these values increase as s gets larger, the number of words a processor can fetch from the memory in a certain unit of time, say one micro- second or one processor cycle, is actually reduced. In Table 18, we also list the average number of words each processor can get in one processor cycle. This is done by dividing the bandwidth by 3s, since the memory is s times slower than the processor and there are three jobs in the memory. We can see this "normalized bandwidth" is indeed decreasing when we increase s. This is because the memory cycle time is doubled when we double the speed ratio s, however, the values we get by using Ravi's equation do not double at the same time. Therefore, it in fact takes longer to fetch the same number of words out of the memory if s becomes larger. This is why the average turnaround time increases when we increase s. Moreover, as we can see from Table 18, the normalized bandwidth (number of words per proces- sor cycle) of the partitioned system decreases faster than that of the distributed system. Thus, the average turnaround time of the partitioned system degrades faster than that of the distributed system. For the mixed system, the situation is very similar to the partitioned system except the turnaround time is a little bit better. This is, of course, because the mixed system has a rather similar way of storing the jobs but with a little bit better memory space util iazation. However, we can also look at Figure 36 from the opposite angle. When we increase the memory speed, or decrease s, the average turnaround time of the partitioned system improves faster than that of the other two 163 systems. For example, in this case the Ta of the partitioned system drops from 146 down to 83 as we reduce s from 5 to 2. That is a reduction of 43%. For the mixed system and the distributed system, the reductions are 39% and 12%, respectively. Therefore, the memory speed is a very important factor when we are using the partitioned scheme. After all, the partitioned system might yield a better bandwidth when the speed ratio is very small. We can see this, for example, in the s=l column of Table 18. Recently, the memory technology has provided system designers with faster, cheapter, and higher density semiconductor memories. It is now economically feasible to design a system which operates in the small speed ratio region. This makes the partitioned scheme more attractive to use, since it provides a very high reliability as well as competitive performance The second speed parameter we are going to look at is the average period of time spent on an I/O operation, or what we call the average unit I/O time. We have shown some effect of the average unit I/O time in the last section. Now, let us look at how this parameter affects the system performance. Figure 37 shows the Ta curves versus the average unit I/O time for all three memory allocation schemes. All the curves have roughly the same increasing rate as we increase the average unit I/O time. This rate is larger than linear. We have explained that the reason is that this average unit I/O time has a two fold effect on the average turnaround time since it not only increases the total I/O time but also increases the queueing time indirectly. So, as we use slower I/O devices, the average turnaround time degrades rather quickly. One figure we should pay some attention to is when we halve the 164 Ta (sec.) 180 160 140 120 100 30 60 40 20 11=1024 MONO. p=8 SJF m=24 LA=4 s=4 ASF=0.1 PART, MIX. DIST. Average Unit J I/O Time (ms.) 21 28 35 42 49 Figure 37. The Effect of the Average Unit I/O Time. 165 unit I/O time from 42 ms to 21 ms the average turnaround times all decrease by more than 40%. This is a larger effect than that of halving the memory cycle time. Therefore, using a faster I/O device will have a more sig- nificant improvement on the performance, at least given the I/O-execution time balance of our job load. Needless to say, the effect will be even larger if we are dealing with an I/0-bound job mix. Of course, the type and numbers of I/O devices will depend on the cost and the resulting performance. For example, in Figure 34, we can see that using two 21 ms I/O devices can yield a better turnaround time than using four 42 ms I/O devices, and if the faster I/O device does not cost more than twice of the cost of the slower I/O device it is obvious that the faster I/O will be a better choice. However, it might be extremely expensive to replace a slower I/O device by a faster I/O device, since it may involve the replacement of the I/O controller and some very expensive equipment. So, it is very important to understand the effect of the I/O device before we can decide what to use in a system. 3.2.3 Partial Connection In all the discussions we have up to this point, we assume our system to have a fully connected switching network (full connection in short), e.g., a crossbar network. As we described earlier, in this kind of system a processor is physically connected to all the memory modules and can access any module if it is allowed to do so. So, the operating system can freely assign any module to any processor as long as the resource management policy is not violated. However, the cost of a full connection will grow very quickly as we expand the size of the system. This is the 166 price we shall have to pay in order to maintain that availability. Now, let us look at another kind of architecture which is a cheaper and more flexible alternative for interconnecting the processor and the memories, namely, the partial connection network. The best example of a partial connection is the multiport memory network used by the PRIME system, which we showed in Figure 1. We briefly talked about the advantages and the disadvantages of a partial connection network a few times in the early chapters. In this section, we are going to elaborate more about this sub- ject. Or course, the biggest (perhaps the only) disadvantage of the partial connection is performance degradation. The performance degrades when we reduce the connections between the processors and the memory modules. The main reason is that the utilization of the available memory has been seriously restricted by the partial connection. Very often, a job cannot be put into the memory because no processor has enough free space connected to is, although the total unused space is larger than what this job is requesting. This can be explained by a simple diagram. Figure 38 shows a partially connected system with three processors and six two-port memory modules which are interconnected in a uniform way similar to that in the PRIME system, and two jobs a and b are occupying four modules as shown in the figure. Suppose a third job arrives which requires two modules. It cannot enter the memory since the third processor has only one unoccupied module attached to it, although there are two free modules available in the system. Obviously, the memory will be wasted due to this incomplete inter- connection. As a result, the system performance is degraded. The memory waste caused by the partial connection is quite dif- 167 PI 2 3 4 .MEMORY MODULES Figure 38. A Partial Connection Network 168 ferent from that caused by the monoprogramming scheme . The memory waste here is created by the inaccessibility of a processor to a memory module. For example, in Figure 38, the fourth module is wasted since processor 3 does not connect to this module. Therefore, if the processors that connect to a certain module are all assigned jobs, then the unused portion of this module, probably the whole module, is wasted. The memory waste highly depends on the number of processors at- tached to a module. If we reduce the number of processors that can access a module, the probability that part or all of this module will be wasted is increased. Of course, when the memory waste increases the system performance gets worse. In our discussion of the partial connection, we will always assume all the ports of a module are connected to processors and none is left un- used. So, the number of processors connected to a memory module is equiva- lent to the number of ports the module has. We will also assume that all the modules are identical with a number of ports which is no more than the total number of processors in the system. Since in a partially connected system all the processors might not connect to the same number of memory modules, we will use an array of integers to represent this system which can tell us how many modules a certain processor is connected to. For example, we will represent the system in Figure 38 by (4,4,4). In fact, the order of these integers is not important since the processor number is arbitrarily assigned. Notice that the sum of these integers is equal to the total num- ber of ports in the memory. Of course, these numbers do not reveal the in- formation about how the connections are being made. If the details of the connection are important to the discussion, we will use a method called 169 "connection matrix" to represent the connection. But in general, the con- nection will be very uniform as that in Figure 38. The connection matrix is a succint representation of a partial connection. The matrix has p rows and m columns with each entry indicating the connection between a processor and a memory module. If a con- nection is made between processor I and module /, we will put a 1 at the position (l,j), otherwise, the entry will be 0. For example, the partial connection in Figure 38 can be represented by the following 3 by 6 connec- tion matrix: 11110 1111 110 11 Notice that all the column sums are equal to the number of ports of a module, and each row sum is equal to the number of modules connected to the cor- responding processor. Figure 39 shows how the average turnaround time degrades when we reduce the number of ports of each memory module, or equivalently the number of processors connected to a memory module. When the number of ports per module is 8, it is equivalent to the full connection since we are using 8 processors. If we reduce the number of ports to 4, only half of the proces- sors will connect to every module. Here, we use a (12,12,12,12,12,12,12,12) connection. The processors are connected to the memory modules in a uni- form way. Each processor is connected to 12 consecutive modules with the leading module being skewed three modules to the right of the previous leading module. Using the connection matrix representation, this connection can be expressed by: 170 Ta (sec.) 130 120 - 110 100 90 80 70 M=1024 SJF p=8 LA=4 m=24 ASF=0.1 42ms. 800 JOBS s=2 r=4 MONO . MULT I Ports/Module Figure 39. The Performance Degradation of the Partial Connection System. 171 111111111111000000000000 000111111111111000000000 000000111111111111000000 000000000111111111111000 000000000000111111111111 111000000000000111111111 111111000000000000111111 111111111000000000000111 For three-port modules, we have 72 ports altogether. If we use the similar connection, every processor will have 9 modules. However, in our job mix we allow a job to claim up to 500K bytes memory. So, at least one processor should have 12 modules, i.e., half of the total memory, other- wise no processor can handle those big jobs. Consequently, not all proces- sors will have the same number of modules. In Figure 39, we use a (8,8,8,8,8,8,12,12) connection when the number of ports is 3. The first six processors are connected to the memory in a similar way as we did in the 4-port memory case except each leading module is skewed by four modules. The last two processors are connected to the first 12 modules and the second 12 modules respectively. Again, the connection can be expressed by the fol- lowing connection matrix: 111111110000000000000000 000011111111000000000000 000000001111111100000000 000000000000111111110000 000000000000000011111111 111100000000000000001111 111111111111000000000000 000000000000111111111111 172 Similarly, we use a (4,4,4,4,4,4,12,12) connection when the number of ports is 2. This network is connected in the same way as the last one except now the first six processors are occupying six disjoint groups of modules. Here is the connection matrix for this connection: 1 1 1 1 00000000000000000000 00001 1 1 10000000000000000 000000001 1 1 1 000000000000 0000000000001 1 1 1 00000000 00000000000000001 1 1 1 0000 000000000000000000001 1 1 1 111111111111000000000000 000000000000111111111111 l_ All three of these partial connections are chosen simply because they are very symmetric. Of course, there are many other possible connec- tions in each case, and we will look at some of them later. In Figure 39, we show the turnaround time curves for all six combinations. As we can see, all six curves deteriorate when we reduce the number of ports of each module. However, many interesting and important results are shown in this figure which we are going to point out here. Perhaps the most noticeable result is what happens to the two groups of curves, namely, the monoprogramming curves (the solid ones) and the multiprogramming curves (the dotted ones). When we use full connection (number of ports equal to 8), the multiprogramming results are all better than their corresponding monoprogramming results. But as we reduce the number of ports, the multiprogramming curves get worse rather quickly. For example, when we reduce the number of ports to 2, the distributed system curve in- creases by about 75% which is the worst among all six curves. The mixed system and partitioned system curves increase by 50% and 53%, respectively. 173 In the meantime, the monoprogramming curves increase by relatively small percentages, 39% for the distributed system curve, 21% for the mixed system curve, and only 10.6% for the partitioned system curve. When the number of ports is reduced to 2, all the monoprogramming results are well below the multiprogramming results. The monoprogramming curves win by a margin of roughly 30%. Obviously, multiprogramming is more sensitive to the connec- tion. The degradation of a monoprogramming curve can be explained by the memory waste caused by the reduction of connections. Table 19 shows the word memory utilizations of all the systems in Figure 39. The difference between the utilizations of a partially connected monoprogrammed system and a fully connected monoprogrammed system is, of course, the percentage of memory being wasted due to the reduction of connections. As we can see, in all the monoprogramming rows, the memory utilization is strictly de- creasing, which means the memory waste is increasing. However, the memory waste is rather small. This is why the degradations of the monoprogram- med systems are small. Apparently, the reduction of connections only has little effect on the memory utilization of a monoprogrammed system. Table 19, however, does not tell us the memory waste of a multi- programmed system since all the memory utilizations for mul tiprogrammed sys- tems are increasing. This does not mean that no memory is wasted in a mul ti- programmed system. Some memory, although it might be small, must be wasted when we reduce the number of ports. Since a module can only be used by part of the processors, it is very likely this module will not be used up when the processors that attach to it are all occupied. Therefore, some other factor is causing the increase in the memory utilization in a partial 174 -"^--^N umber o f Ports System Module 8(Full) 4 3 2 Mono. 52.5 52.0 51.5 51.1 Partitioned Multi. 52.0 53.5 54.2 64.7 Mono. 53.4 52.7 52.6 51.6 Mi xed Multi. 53.8 56.2 57.0 68.1 Mono. 51.5 51.0 50.4 50.2 Distributed Multi. 52.2 53.5 58.1 66.5 Table 19. The Word Memory Utilization of All the Systems in Figure 39. 175 connection multi programmed system. In fact, it is not difficult to see. The memory utilization goes up because e\/ery job is now using the memory for a longer time. We have explained a similar phenomenon in Section 3.1.3 when we discussed Table 7. Let us show the average service time, i.e., the time a job spends in the memory of each mul tiprogrammed case in Table 20. As we can see, the service time t increases rather rapidly as we reduce the number of ports. The longer residence of each job in the memory thus results in a higher memory utiliza- tion. This is why the memory utilization is increasing instead of de- creasing. Apparently, the memory waste due to the partial connection has been covered by this increase. Therefore, we cannot use memory waste to fully explain the degradation of the average turnaround time of a mul tipro- grammed system. Interestingly, when we look at the statistics gathered from simu- lation outputs, we find that the queueing time a job spends in waiting for a processor increases almost at the same pace as the average service time increases. In other words, the increase of the service time comes from the increase of the queueing time waiting for a processor. In Table 20, we show this queueing time q together with the average service time. From Table 20, we can see the queueing time for a processor can be as large as 23% of the total service time (10.1/82.2 for the distributed system). Obviously, this queueing increase is a big factor that causes the performance degradation of a mul tiprogrammed system. However, there is no queueing time for a processor in a monoprogrammed system since each job has its own dedicated processor. The degradation of a monoprogrammed system simply comes from the memory waste. This explains why the monoprogramming 176 ^\Number of Ports ^\. per Module System ^\^^ Parameter ^\ 8(Full) 4 3 2 l s 68.4 69.6 70.5 81.2 Partitioned ? 0.3 1.6 2.6 13.6 *s 70.6 73.3 74.7 87.7 Mixed ? 0.4 3.2 4.8 18.1 l S 63.9 70.3 74.5 82.2 Distributed q* 0.8 7.2 11.4 19.1 Table 20. The Average Service Time (t c ) and Queueing Time for Processors (a/; of the Multi programmed Systems 177 curves deteriorate more slowly than theirmultiprogramming counterparts. The serious queueing delay when the number of ports is small, especially when it is 2, is the major reason why the multiprogramming curves are significantly higher than the monoprogramming curves. Mow, let us explain the cause of this queueing delay. Since in a multi programmed system there might be more jobs in the memory than the number of processors, the competition of a free processor is bound to happen. If all the processors are busy when a job come in or returns from I/O, this job certainly will have to wait until some processor becomes free. The situation in a full connection system is simple since a job can be executed by any processor. So, all the pending jobs will wait in a single queue and get served on a first-come-f irst-serve basis. In a full connection case, the probability that all processors are executing jobs is rather small, since in the first place the probability that more than eight jobs in the memory is not too large, and secondly, a job will spend a significant amount of time in doing I/O. Therefore, the queueing time for a free processor will be small in this case. As we can see in the first column of Table 20, this is indeed the case. Moreover, since the par- titioned system allows the smallest number of jobs in the memory, its queueing time is the smallest among the three allocation schemes. However, the situation is much more complicated in a partial connection network. A job cannot be executed by every processor since not every processor can access all the memory modules this job occupies. In fact, the number of processors that can execute a job is bounded by the num- ber of ports of a memory module. Mery often, a job can only be executed by one particular processor! This is especially true when the number of 178 ports is small. Let us look at an example in Figure 40, where we show a portion of a partial connection, multi programmed system. As we can see, job c can be executed by both processors 3 and 5, but jobs a and b can only be executed by processors 5 and 4, respectively. So, for example, job "a" cannot be executed if processor 5 is executing another job. In other words, a job can only queue for the few processors that can execute it in a partial connection, multiprogrammed system. This certainly will cause a serious queueing delay. If we keep reducing the number of connections this situation will get worse and worse. This is why the queueing time for processor grows so fast when the port number is decreased, which in turn degrades the turnaround time. Therefore, the monoprogramming scheme is more superior than the multiprogramming scheme if we are using partial connection, especially when the number of ports per module is small. This contradicts what happens in full connection systems. Now, let us concentrate on the monoprogramming curves in Figure 39. In the full connection case, we can see the partitioned scheme yields the worst result and the distributed scheme yields the best. We have explained this in terms of memory bandwidth in the first part of this chapter. How- ever, the situation starts changing when we use the partial connection. In the two- port memory connection, it is completely reversed, i.e., the parti- tioned scheme shows the best turnaround time, and the distributed scheme shows the worst. This phenomenon again can be explained in terms of memory bandwidth. Let us reuse the example in Figure 40. In fact, what we show there is the picture of a distributed system if we assume the jobs are • • • 777777/ a c T7L PROCESSORS • • • (MEMORY WASTE) MEMORY MODULES 179 Figure 40. An Example of a Partial Connection, Multi programmed System. 180 allocated memory in the order a, b, and c. The biggest difference we can see is the degree of interleaving each job can have. For example, job a can only be interleaved in three modules if processor 5 does not connect to any other module. However, in a full connection, each job can be inter- leaved into as many modules as the system has. This reduction of the degree of interleaving drastically decreases the memory bandwidth of the distri- buted system. If, on the other hand, we use the partitioned scheme or the mixed scheme, each job can also be allocated and interleaved in a similar number of modules, although in general a little bit fewer. That means both the partitioned scheme and the mixed scheme can have bandwidth comparable to the distributed scheme. This is particularly true when the number of ports per module is small. But, the most important thing is, in a partitioned system there is no memory conflict between any two jobs. In the other two schemes, especially the distributed scheme, memory con- flict will occur in those shared modules, which results in the degradation of the memory bandwidth. This is why the partitioned system shows the best turnaround time on the left half of the figure. Therefore, we can come to the following two conclusions about the partial connection. First, monoprogramming is better than multiprogram- ming due to no queueing for processors in the former case. Second, partitioned scheme outperforms the other two schemes due to no memory conflict. Interestingly, both of these results completely reverse the situation in the full connection system. In the first part of this chapter, we kept em- phasizing the advantage of monoprogramming and the partitioned scheme. However, the slightly better performance by multiprogramming and the distributed scheme tends to make these advantages look rather debatable. But, no doubt about 181 it, a partial connection system, monoprogramming and partitioned scheme are better from every aspect. Perhaps the most important result in Figure 39 should be the small degradation of the partitioned, monoprogrammed system. When we re- duce the number of ports from 8 to 2, the curve only increases by 10.6! Hence, we save 75% of the cost of the connection network but sacrifice only 10.6% of the performance. This is a tremendous improvement on the cost- effectiveness. Therefore, from a cost-effectiveness point of view, the partial connection system is a better architecture for system design. Now, let us discuss, from the memory utilization point of view, why such a low degradation can be achieved. Recall in Figure 39, we used a (4,4,4,4,4,4,12,12) connection when each module has only two ports. Each of the first six processors connects to four modules, and no two processors connect to the same module. This essentially partitions the whole system into six disjoint subsystems as far as these six processors are concerned. We can see this from the connection matrix we showed earlier, Of course, these processors then can only handle jobs of size less than or equal to four modules. Table 21 shows the job size distribution of the job mix we are using, where each job is counted toward the number of modules it will re- quire under the partitioned scheme. We can see 81.3% of the jobs are of size less than or equal to four modules (170K bytes). So, most of the jobs can be handled by these six processors. Only the remaining 18.7% of the jobs, i.e., the large jobs, have to be handled by the other two processors, where each of these two processors is connected to 12 modules (51 2K bytes). We let each of these two "large" processors share memory with three of 182 Number of Modules Density 1 2 3 4 5 6 7 8 9 10 £11 .165 .146 .326 .176 .105 .010 .031 .019 .009 .003 .010 Module Size = 42 2 /3 K Bytes Table 21. The Job Size Distribution 183 the other six "small" processors. This arrangement certainly might cause some trouble for the large jobs. If a large job is under consideration for entering the memory but none of these large processors have enough room, even though together they have enough free space, we have to delay this job until a processor has gained enough memory by itself. In other words, a bad distribution of the small jobs in the memory might block a large job from entering. For example, if four 4-module jobs have occupied the first, second, fourth, and fifth processors, respectively, a 5-module still cannot enter since none of the last two processors can allocate a chunk of five modules to this job. So, a large job will experience more difficulties than it does in a full connection. This arrangement also has its own advantage. Apparently not all of the jobs running on the small processors will use exactly four modules, the unused space can be chained together by the large processor to make room for another job. Of course, most likely a small job will be chosen again since the space usually might not be large enough for a large job. So, a small job which requires four memory modules or less in fact can be assigned to any proces- sor in the system. Since the small jobs constitute an absolute majority of the job mix, the (4,4,4,4,4,4,12,12) connection should still allow a pretty good memory utilization due to the reason we mentioned above. Table 12 indeed shows that the memory utilization of this connection only degrades a little bit from the utilization of the full connection system. This is the reason why the turnaround time increases by just a small percentage. From Table 21, we can see that 10.5% of the jobs require more than four modules but no more than five modules. People might wonder 184 if the system should perform better if we assign a few more modules to some of the small processors so the can handle this 10.5% of the jobs and alleviate the traffic in front of the two larger processors. We have col- lected the results for several slightly different connections, and we found the results are more or less the same as that of the (4,4,4,4,4,4,12,12) connection, despite the inclusion of some 5-module processors. For example, Table 22 shows the results for two other connections, (3,3,4,4,5,5,12,12) and (3,3,3,5,5,5,11,13), which are very similar to the first connection. Apparently, the memory sharing of a large processor with three 4-module processors can take care of the 5-module jobs very well. But most impor- tantly, we gain some confidence that these results are indeed in a reasonable region. Of course, the job size distribution is a very important factor to the performance of a partial connection. As we said earlier, 81.3% of the jobs are of size less than or equal to four memory modules which is the major reason why the (4,4,4,4,4,4,12,12) connection can have good performance. However, if we increase the size of each job, the performance of this con- nection might degrade rather severely since more jobs now have to enqueue for those two large processors. In Table 23, we show some results of monoprogrammed, partitioned system when we increase the job size by 25% and 50%. We can see the turnaround time of the (4,4,4,4,4,12,12) connection indeed degrades very quickly. It increases by 33% when we increase the job size by 25%. In the last row of that table, we can see that the percen- tage of the jobs with sizes less than or equal to four modules is now reduced down to 75.9%, which obviously is the reason of the performance degradation. If we increase the job size by 50%, we can see the turnaround 185 \\Connection Allocation Scheme N. (4,4,4,4,4,4,12,12) * (3,3,4,4,5,5,12,12) ** (3,3,3,5,5,5,11,13) Partitioned 94 95 96 Mi xed 97 90 94 Distributed 100 101 109 Connection Matrices 1 1 1000000000000000000000 0001 1 1000000000000000000 0000001 1 1 1 00000000000000 00000000001 1 1 10000000000 000000000000001111100000 00000000000000000001 1 1 1 1 111000111100001111100000 000111000011110000011111 ** 1 1 1 000000000000000000000 0001 1 1 000000000000000000 0000001 1 1 000000000000000 000000000111110000000000 000000000000001111100000 00000000000000000001 1 1 1 1 oooi 11111111110000000000 111000000000001111111111 Table 22. The Average Turnaround Times for Two Different Connections (Monoprogramming, with All Other Parameters of Figure 39) 186 ~^\^Job Size Scaling Connection "^\^^ 1.00 1.25 1.50 (4,4,4,4,4,4,12,12) .94 126 237 (3,3,4,4,5,5,12,12) 95 119 162 (2,2,5,5,5,5,12,12)* 106 128 137 % of Jobs with Sizes ^ 4 Memory Modules 81.3 75.9 58.6 * Connection Matrix N 1 1 0000000000000000000000 001 100000000000000000000 000011111000000000000000 0000000001 1 1 1 1 0000000000 000000000000001111100000 00000000000000000001 1 1 1 1 001111111111110000000000 110000000000001111111111 J Table 23. The Effect of Job Size on the Average Turnaround Times of Three Different Connections. ( Monoprogramming, Partitioned Scheme ) 187 time increases drastically to 237 which is 152% higher. This is because 41.4% of the job now requires more than four modules of memory. Apparently, the system is now saturated under this circumstance. If we use (3,3,4,4,5,5,12,12) connection, i.e., assign one more module to each of the fifth and sixth processors, we can see the situation is much better when we increase the job size. It only degrades by 70% if we increase the job size by one-half. In fact, we can see in Table 23, the (2,2,5,5,5,5,12,12) connection gives us the best result for enlarged job size. It degrades just 30% for a 50% increase of the job size. In other words, using more 5-module processors can result in a smaller degradation. Therefore, how to assign memory modules to processors really de- pends on how the job size distributes. It is ^ery difficult to formulate an equation and try to solve an "optimal" solution of a partial connection. The only rule of thumb is to look at the job size distribution and parti- tion the memory ports so that enough processors can have sufficient memory space to handle most of the jobs. In other words, try to assign the memory so that no severe bottleneck will be created at any processor. For example, before we scale the job size, four memory modules will be suf- ficient for a processor since 81.3% of the jobs are smaller than or equal to four memory modules. However, when we increase the job size, we need to connect more modules to some processors so they can handle larger jobs. Our results indeed show that this approach is generally correct. Of course, some of the processors will obtain less memory because the total number of ports is fixed. For more than two ports per memory, the same kind of approach 188 can also be used, except each processor can then connect to more memory modules. The memory utilization will be better, and hence the performance will be improved. One more interesting thing about the result shown in Table 23 is that the performance of a partial connection system is ^jery sensitive to the increase in job size. For example, for a 50% increase, the average turnaround times of these three connections increase by 152%, 70%, and 30%, respectively. If we refer back to Table 12, we can see the turnaround time of a full connection system only degrades 15% when we increase the job size by 50%, which is significantly lower. So, when we are planning to use a partial connection network, we ought to be wery careful about the job size distribution and use enough memory in order to achieve a satisfactory level of performance. Finally, let us redo Figure 35 for the partial connection system, i.e., find out the effect of system size on the system performance. Figure 41 shows how the average turnaround time changes when we double the system size. The solid curves are using an arrival rate ratio of 2, just as we did in Figure 35. We again use the (4,4,4,4,4,4,12,12) connec- tion and an arrival scaling factor of 0.1 for the (8,24,4) system. For the (4,12,2) system, we use a (4,4,4,12) connection, which is exactly one- half of the (4,4,4,4,4,4,12,12) connection, and an arrival scaling factor of 0.2. For the (16,48,8) system, we use a (4,4,4,4,4,4,4,4,4,4,4,4,12, 12,12,12) connection and an arrival scaling factor of 0.05. So, we again double the workload when we double the system size. As we can see, the Ta curves drop quickly when we double the system size, which is wery similar to that in Figure 35. But, the interesting thing is the curve of the 189 Ta (sec. ) 80 r DIST. 260 240 220 ~ 200 180 MIX. 160 PART 140 120 100 80 60 40 20 - MONO. SJF LA=4 ASF(8,24,4)=0.1 42ms. s=2 o DIST, __--© PART. MIX. (p,m,r) (4,12,2) (8,24,4) (16,48,8) Figure 41. The Average Turnaround Tine for Different System Size. ( Use Partial Connection ) 190 distributed system now is most sensitive to the reduction of system size. In Figure 35, however, the curve of the partitioned system degrades quick- est when we decrease the system size. Again, we increase the arrival scaling factor of the (4,12,2) system in order to lower its turnaround time. ^Jery surprisingly, when we use 0.28, which is the same scaling factor we used in Figure 35, the turn- around time of the (4,12,2) system drops close to that of the (8,24,4) system. On the other hand, when we decrease the arrival scaling factor of the (16,48,8) to 0.0395, the turnaround time becomes roughly the same 1 5 as that of the (8,24,4) system. This implies that our 2 ' conjecture still holds in a partial connection system. Of course, the result we get here is based on one particular set of connections. Although the per- formance of a partial connection system is very sensitive to the connec- tion we use, at least we know it is possible to connect the system so that 1 5 a double-sized system can carry a workload 2 times of the workload of the original system. Hence, we now gain more confidence in this simple conjecture. 191 Chapter 4 CONCLUSION 4.1 Summary In the last chapter, we have discussed several interesting problems about the design of a multiprocessor system. We talked about the performance of multiprogramming and monoprogramming schemes, the advantages and disadvan- tages of three different memory allocation schemes, the effects of job parameters and hardware characteristics, and the difference between using a full connection network and a partial connection network. We will briefly summarize these results in this section. Tables 24-27 summarize and compare the performances of six system combinations under both full and partial connections. Each table will show the comparison of one performance parameter. It is rather difficult to order the performance of various systems, so we will only use {bad, fair, good, best} or {high, moderate, low} to indicate their relative performances. However, we do point out the system which yields the best results in order to give the reader an idea which system might be the best choice for each area of performance. Table 24 shows the comparison of the average turnaround time. Of course, the turnaround time of a monoprogrammed system depends heavily on the number of processors. We will assume that there are enough processors, say 8, in the system. Under a full connection, the distributed, multi programmed system has the best turnaround time. As we explained earlier, this is caused by high memory bandwidth and high memory utilization. The distributed, mono- programmed system has the next best turnaround time. Obviously, this is because we are assuming eight processors in the system. If we assume fewer 192 """-^^^^ Connection System ^-^-^^ Full ** Partial Partitioned Bad Good Mono- programming Mi xed Good Good Distributed * Near Best Good Partitioned Fair Fair Multi- programming Mixed Good Fair Distributed Best Bad * Assuming 8 Processors ** Assuming 2 -Port Memory Table 24. Comparison of the Average Turnaround Time. 193 processors, the turnaround time will degrade a little until we reduce the number of processors below four (cf. Figure 32). So, both distributed sys- tems perform very well if we use a full connection network. Both mixed sys- tems also have good turnaround times but are worse than the distributed systems. This is due to a smaller memory bandwidth produced by the mixed scheme. On the other hand, the partitioned systems both perform even worse than the mixed systems, although not by much. This is caused by bad memory utilization and memory waste of the partitioned system. Overall, a multi- programmed system is slightly better than its monoprogrammed counterpart, and the distributed scheme yields the best result. The performance of a full connection system is essentially determined by the memory utilization and the memory bandwidth. Under a partial connection, however, the whole situation is reversed. All the monoprogramming results are better than the multiprogram- ming results when the number of ports is reduced to two. As we explained in the last chapter, the major reason is the queueing time for processors created in the partial connection, multi programmed system. But in a partial connec- tion, monoprogrammed system, there is no queueing time for processors since each job is assigned a dedicated processor. We can see that the partitioned scheme yields the best result in a partial connection, monoprogrammed system. The interesting thing is that it has the worst performance in a full connec- tion, monoprogrammed system. So, the connection has really changed the result. The mixed, monoprogrammed system also shows yery good performance which is similar to the partitioned, monoprogrammed system. In fact, all monoprogrammed results are \/ery close to each other. We can see this from the results shown in Section 3.2.4. On the other hand, all the multi programmed 194 systems perform relatively poorly when we use a partial connection. The most important result is, if we properly interconnect the processors and memories, the performance degradation of a partial connection, monoprogrammed system can be kept to within 10 to 20% of the performance of a full connection system. For example, we have shown that the (4,4,4,4,4,4,12,12) connection only creates 10.6% of degradation on the turnaround time of the partitioned, monoprogrammed system. This not only encourages us to use partial connection since it is more cost-effective, but also makes the partitioned scheme and monoprogramming more attractive in operating system design. Table 25 shows the comparison of total memory bandwidth, i.e., the memory bandwidth generated by all active processors. Under a full con- nection, the distributed scheme can yield relatively high memory bandwidth due to the high degree of interleaving each job can enjoy. The memory band- widths of both the partitioned and mixed schemes are lower than that of the distributed scheme since now each job is confined in part of the memory and the degree of interleaving has been reduced. As we said, this is the major reason why the distributed system can have better turnaround time than the other two systems. However, if we use faster memory, i.e., reduce the value of s, the difference between the bandwidths of the distributed and parti- tioned systems will be reduced. Their turnaround times hence become closer to each other. This can be seen in Figure 36. Of course, the total memory bandwidth of a mul ti programmed system is higher than that of its monoprogrammed counterpart since the mul ti prog rammed system on the average can contain more active jobs in the memory. Under a partial connection, the total memory bandwidth of every 195 Systerr Connection Full Partial Partitioned Low Low Mono- programming Mixed Moderate Moderate Distributed High Moderate Partitioned * Moderate Low Multi- programming Mi xed High Moderate Distributed Highest Moderate * Assuming Large m ( Low for Smal 1 m ) Table 25. Comparison of the Total Memory Bandwidth. 196 system will decrease. This is due to the decrease of memory utilization. For a distributed system, the memory bandwidth has been further decreased by the reduction of the degree of interleaving since a job can no longer be interleaved across the whole memory in a partial connection. Now, the total memory bandwidths of the mixed and distributed systems are similar because they have similar capability of containing jobs and similar degree of inter- leaving for each job. One thing we need to explain is the total memory bandwidth of a partial connection system. Intuitively, a multi programmed system should have a higher total memory bandwidth than its monoprogrammed counterpart. This is because a multi programmed system can allow more jobs in the memory at the same time which can cause a higher utilization of processors. How- ever, our simulation result shows that a monoprogrammed system has almost the same total memory bandwidth as a mul tiprogrammed system if we use partial connection. This rather surprising result actually is not difficult to explain. As we said in Section 3.2.3, the major factor that makes a partial connection, mul tipgorammed system have a worse turnaround time than its monoprogrammed counterpart is the queueing time for processors (see Table 20) This queueing time is caused by the fact that in a partial connection system ewery job can only be executed by a few processors which connect to all the memory modules this job is in. In other words, a job will have to wait if the processors that can handle it are all busy, even though some other processors are free. This queueing phenomenon essentially reduces the number of jobs that can be executed at the same time. Our simulation result shows that while there are more jobs in a multiprograirmed system, on the average both systems have approximately the same number of jobs in execution. The 197 extra jobs in the multi programmed system are either in the processor queue or in the I/O stage. Since the numbers of jobs in execution are roughly the same, both systems thus have similar total memory bandwidth. However, we know the total memory bandwidth represents the amount of work a system can do in one memory cycle time. It actually indicates how good the system throughput is, if we consider the system throughput as the number of jobs that get done in a certain unit of time. The higher the total memory bandwidth is, the faster the jobs can be done, and hence the higher the system throughput will be. So, the partial connection, multiprogrammed system should have the same throughput as its monoprogrammed counterpart does. Our simulation result indeed shows this since both of them have very similar total elapsed time, i.e., the total amount of time to finish all the jobs. In other words, the use of multiprogramming does not improve the system throughput if we are using a partial connection. As we said, the partial connection, multiprogrammed system has a worse average turnaround time due to the occurrence of the queueing time for available processors. Yet, it has the same throughput as the partial connec- tion, monoprogrammed system, which means it can finish the jobs at the same rate. If we look at the input and output of both systems, we can see both systems have the same arrival and departure rates. The only fact that can cause the difference of the average turnaround times is apparently the order these jobs get done. Although we are using the same scheduling algorithm in both systems to select jobs for execution, the queueing time for availa- ble processors in a partial connection, multiprogrammed system might delay the execution of some jobs and allow some other jobs to be processed faster. For example, a job that comes into the memory first and requires smaller 19S CPU time than any other job in the memory might not be finished first if it has to compete for a processor with some other jobs all the time. In a monoprogrammed system, this will not happen since the time a job will finish once it eneters the memory solely depends on its CPU and I/O time require- ments. In other words, the effect of the scheduling algorithm will be re- duced by the queueing delay in a multiprogrammed system. This is why a par- tial connection, monoprogrammed system has a better average turnaround time. Therefore, the average turnaround cannot always tell us how fast the system is doing work. Only the total memory bandwidth can indicate how good the system throughput is. Let us now summarize the memory utilization of these systems in Table 26. As we can see, the multiprogrammed systems have better memory utilization than the monoprogrammed systems, and the full connection systems have better memory utilization than the partial connection systems. This is what we would expect. From all the data we collected, the mixed and distributed systems both show rather similar memory utilizations. This is because both systems have the same capability of containing jobs in the memory, as we have mentioned several times. The partitioned system, however, has a significantly lower memory utilization, which is caused by the memory waste created during the whole-module allocation process. This is the major reason why the partitioned system yields the worst turnaround time when we are using a full connection network. Of course, a partial connection system has lower memory utilization than a full connection system since only some of the processors are connected to a memory module. So, if all the processors connected to a certain module are busy, then the unused portion of this module will be wasted. Or, if the 199 Systerr Connection Full Partial Partitioned Bad Bad Mono- programming Mixed Good Fair Distributed Good Fair Partitioned Good Fair Multi- programming Mi xed Best Good Distributed Best Good Table 26. Comparison of the Memory Utilization 200 unoccupied memory is split between several processors and none of these processors has by itself large enough space for the next job, then the unoccupied memory again will be wasted. This is what causes the performance to degrade. Fortunately, if we can use a good connection by considering the job size distribution, it is possible to keep the memory waste, and hence performance degradation, very low. The other performance parameter we often mentioned in the last chapter is the job memory bandwidth, which is the bandwidth each processor gets to execute a job. It is obtained by dividing the total memory band- width by the number of jobs in memory. As it turns out, the job memory band- width of the mixed and partitioned systems are only affected by the proces- sor-memory speeed ratio and the number of memory modules m. They remain essentially unchanged when we change the other system parameters. This is not surprising, since under these two schemes most or all of a job is iso- lated and prevented from the influences of the other jobs. Once a job gets into the memory, the speed ratio will affect the memory bandwidth this job can get since s determines the number of requests the processor can generate per memory cycle. On the other hand, m will determine the degree of inter- leaving for a job which also affects the bandwidth. However, the job memory bandwidth of the distributed system will be affected by almost eyery system parameter. Of course, it will be affected by the speed ratio s and the number of memory modules m. In addition, it will also be affected by parameters like job arrival rate, the average amount of time per I/O operation, the average number of I/O requests per job, monoprogramming or multiprogram- ming, and so on. All these parameters will affect the number of jobs in execution at the same time, which in turn will affect the memory bandwidth 201 due to mutial interference. Furthermore, the connection network also has a very significant effect on the job memory bandwidth. As we said before, a partial connection might drastically reduce the degree of interleaving of a job and will seriously decrease the job memory bandwidth. Overall, a full connection distributed, monoprogrammed system will yield the highest job memory bandwidth. This can be seen in Table 7 where we show some numerical values of the job memory bandwidth. In Table 27, we list the system which produces the best result in each performance area, under either a full or a partial connection. If we use full connection, the distributed, multi programmed system shows the best turnaround time, the largest total memory bandwidth, and the highest memory utilization. Only the distributed, monoprogrammed system displays the best job memory bandwidth. On the other hand, if we use partial connec- tion, the partitioned, monoprogrammed system now shows the best turnaround time. Both the distributed, multiprogrammed system and the distributed, monoprogrammed system display the best total memory bandwidth. The best memory utilization and the best job memory bandwidth are still obtained by using the distributed (or mixed), multiprogrammed system and the distributed, monoprogrammed system respectively. The full connection, distributed, multiprogrammed system seems to be a better choice, since it gives the minimum turnaround time. However, the partial connection, partitioned, monoprogrammed system is more cost-effective Especially when the system size is large, the use of a partial connection network can reduce the system cost significantly. Moreover, a partially connected system is easier to maintain and expand. While we are adding or deleting a memory module or a processor, only very few connections have to 202 System Performance Full Connection Partial Connection Turnaround Time Total Memory Bandwidth Memory Utilization Job Memory Bandwidth Distributed, Multi programmed Distributed, Multi programmed Distributed or Mixed, Multi programmed Distributed, Monoprogrammed Partitioned, Monoprogrammed Distributed, Both Distributed or Mixed Multi programmed Distributed, Monoprogrammed Table 27. Systems with Best Performance 203 be altered, and the rest of the system can be kept untouched and go on operating. So, a partial connection system also has the advantage of high availability and expandability. All the performance measures, in particular the turnaround time, are very sensitive to the job mix we are using. We have shown the turnaround time and the memory utilization will increase rather rapidly when we increase the arrival rate, job size distribution, or the I/O time. The reason is, these parameters can easily push the system into saturation. Therefore, when we are designing a system, we should carefully study the job mix we are deal ing with. 1 5 One of our most interesting results is the 2 ' workload relation- ship between two systems that have a size ratio of 2. Our simulation shows that, when we double the system size, we can handle 2.7 to 2.8, or roughly 1 5 2 ' , times the original workload. This is true for both the full and partial connection systems. So, our conjecture is, the system size C (or the cost) 1 5 and the workload it can handle P (or the processing power) maintain a P = a C relationship. Of course, this conjecture has been shown to hold only for systems of size up to 16 processors and 48 memory modules. As we said in Section 3.2.3, this factor will be reduced to about 2.3 when we double the system size again to (32, 96, 16). So, we believe the improvement factor of 2.8 would approach 2 as the system gets \/ery large. 4.2 Some Design Problems 4.2.1 Address Interleaving As we said in the last section, a partial connection, partitioned, monoprogrammed system is the most cost-effective choice of system design. 204 However, we pointed out that we will need a new scheme of generating phy- sical addresses if we want to use interleaving to get the best possible memory bandwidth. Of course, when the memory-processor speed ratio s is small, say 1 or 2, there will not be much difference whether we use inter- leaving or not. For s=2, even if we store a program vertically inside a memory module, quite often we might still be able to access more than one word if the data and the instruction we can fetch simultaneously are in different modules. If we use interleaving, i.e., store a program horizon- tally across several memory modules, we might only get a little better chance of accessing two words without conflict. So, it might not be worth it to implement the interleaving scheme when s is small. When s is larger, however, the interleaving scheme will show a much better bandwidth since several instructions and data can be accessed at the same time. It is more desirable to use interleaving under this circumstance. If we use interleaving in a partitioned system, two problems will arise that make the generation of physical addresses wery tough. First, the number of memory modules allocated to a job is variable depending on the size of this job. This implies that the degree of interleaving will be different for each job. Consequently, a processor must be able to ad- just its address mapping mechanism to cope with the changing degree of inter- leaving. Second, the modules a job gets in general will be scattered all over the memory and might not be adjacent to each other. Therefore, it will cause some trouble to locate an instruction or operand. If we horizon- tally interleave a program, the next instruction might be several modules away from the current instruction. So, we will not be able to get the address of the next instruction by simply adding one to the current module 205 number. Apparently, we need more hardware and a new algorithm in the in- struction decoding unit of a processor in order to generate a physical ad- dress properly. Let us propose a simple and feasible design which can solve this problem. Figure 42 shows the logic diagram of the design we are proposing. The logical address register contains the logical address we want to trans- form and the final physical address will be in the physical address register. The hardware between them is used to do the transformation. The physical address consists of two parts, namely, a module number x and an in-module word address w, which will be obtained by the following process. First, let us point out one thing which will affect the way we interleave a program, and hence affect the memory bandwidth. Assume that the program counter is L bits long. If we interleave a program successively into all the memory modules, say c of them, we must be able to perform a "quotient-remainder of c" operation on these L bits, called QR (L), in order to find the module and word corresponding to this address. However, c is a variable which is determined by the size of the job currently running on this processor. This means eyery processor should be provided the hardware to perform the QR operations for all possible c values if we want to inter- leave a program in a normal way. This is not economically attractive since it implies that we must build several QR circuits inside each processor for address decoding. Therefore, we must seek some other method to interleave a program. The method we suggest is the following: If a job requires a power of two modules or some number of modules for which QR hardware exists, then we will interleave the program in the normal way. Otherwise, we will 206 LOGICAL C ADDRESS REGISTER .1 n n= L /2 — n — SHIFT REGISTER mxu MODULE MAPPING TABLE OGICAL MODULE NUMBER REAL MODULE NUMBER PHYSICAL ADDRESS REGISTER Figure 42. Address flapping for Partitioned System. 207 partition the modules into a power of two groups each having the same number of modules such that QR hardware exists for this group size. For example, assume the processor has the hardware to do a QFU operation and a job requires six modules. We will partition the modules into two groups with three modules in each group. However, if the number of modules a job needs is not a multiple of 3, we will have to grant the job some extra memory to make the number a multiple. Now, we will use the last g bits of the logical address to determine the proper group. These g bits are called "Group Bits." If there are only two groups, g will be 1 . In so doing, we actually achieve a double interleaving, i.e., we not only interleave the successive addresses into different modules, but also interleave them into different groups. Figure 43 shows the result of interleaving a 6-module job, if we use the last bit to indicate the group. We will call this a 3-3 interleaving. The important thing is that this method still allows us to fully interleave a program across all the modules even though we do not have the appropriate QRg circuit. The first (left) L-g bits of the program counter will be fed into a shift register where we will perform the QR operation. Since g is a variable which depends on the number of groups we form for the job, we need a shift register here in order to shift the logical Address g bits to the right. Of course, if g is zero the whole content of the logical Address register will be gated into the shift register without shifting. So, the shift register must also be L bit long. We now perform the QR operation on the content of the shift register. The remainder of this operation will tell us the correct module within a group. This remainder together with the group bits will tell us the logical module number we are looking for. 208 GROUP 1 GROUP BIT = GROUP 2 GROUP BIT = 1 Figure 43. The Interleaving of a 6-Module Job, 209 The quotient will give us the correct address inside the module. Now, let us describe how we do the QR~ operation. Of course, we can use a combinational circuit to perform the operation. For example, Gajski and Vora [49] have a very nice design of modulo 3 circuit. However, it might take a large number of gates to implement the circuit when L is large, and we also need to determine the quotient. So, we choose another design using read-only-memories (ROM). If L is small, say 10 to 12, we can use one ROM to find out the remainder and quotient of 3. However, we are using 1024K bytes of memory in our system, which means L is about 18 to 20. 20 We then have to use a very large ROM with bits in the order of 2 ! This is apparently very expensive. So, we will use the design shown in Figure 42, which requires seven ROM's of reasonable size, one small integer adder and an incrementer. Of course, we have to spend a little longer time to do the operation. In fact, our design is good for any base. Let us use c to repre- sent the base of the QR operation. We first break the shift register into two equal halves each having n = L/2 bits. The contents of the right and left halves will be called a and b respectively. In other words, the con- tent of the shift register can be expressed as a + b2 n . Also, let us further decompose b to uc+v, i.e., let b = uc+v. The remainder and quotient of the QR operation can be found to be: remainder = (a+b2 ) mod c = [a mod c + (b2 n ) mod c] mod c = [r, + r 2 ] mod c, 210 and quotient = (a+b2 ) f c L|J ♦ Lf J ♦ L-VJ l|j + u2 " ♦ L^f-J ♦ L^J , wh ere r, = a mod c, r~ = (b2 ) mod c, * represents integer division, and |_ J is the floor function. ROM 2 contains theresultof r, = (a mod c) and ROM 5 contains the result of r ? = |_(b2 ) mod cj. The outputs of these two ROM's will be used as the address to ROM 7, which contains the result of {r,+r ? ) mod c. The output of ROM 7, i.e., the remainder, coupled with the group bits will tell us the logical module number. Each word of ROM 1 contains a result of [-J together with a bit that tells whether a is all one's or not. Hence, all the words in ROM 1 have zero in the last bit position except the last word. The use of this bit will be explained very soon. ROM 3 contains the result of u which will be shifted to the left by n bits. The third term i v2 n i on the right hand side of the quotient equation, i.e., |_ J, will be called a "Corrector" term. Since v has only c possible values, the number of pos- sible values for the corrector term is at most c. If c is small, the cor- rector term can only have a few possible values. Therefore, we can either hardwire them or use a small number of registers to store them. ROM 4 will be used to select the corrector value we should use. The outputs of ROM 2 and ROM 5, i.e., r, and r ? , will also be fed into ROM 6, which gives the r l +r 2 i r l +r 2i result of [- J- It is easy to see that £ r i +r ? < 2c, so L — ? — •* 1S either or 1 . Hence, the output of ROM 6 is only 1 bit long which can be used as the carry-in to the adder. v? n r l +r 2 The adder is only n bits long which adds [-J, l—r- J» anc ' L — r — -• 211 together. These three terms are n-1, n, and 1 bits long respectively. The fourth terir., i.e., u2 , wll be fed into an incrementer, which will increment by one if the adder generates a carry. The reason we use an n-bit adder and an n-bit incrementer instead of a 2n-bit adder is because the sum- ming of these four terms is rather special. Figure 44 shows the length of each term and how they align when they are summed together. As we can see, the only chance u will be affected is when the other three terms produce a carry. So, we only need an incrementer for the left n bits. In fact, the carry can be generated in advance by using the last bit of ROM 1, r l +r 2 L J, and the output of ROM 4, so we do not have to wait for propagation delay of the adder. This can be shown by the following example. Let us assume c=3 and n=10. |_ _ J will be 9 bits long. Since v can only be 0, 1, or 2, the output of ROM 4 is only 2 bits long. The corrector term can be shown to be one of the following three values: namely, 0000000000, 01 01 01 01 01, or 1010101010 (all in base 2). The last word of ROM 1 contains 101010101 which is the largest value of [— J and is the only word that will generate a carry provided the corrector term is 1010101010 and I— — -J = 1. The last bit of ROM 1 will tell whether |_-J is 101010101 c L c or not and the output of ROM 4 can tell whether the corrector term is 1010101010 or not. So, we can AND them together to see if we should incre- ment u or not. The results of the adder and incrementer will be combined to form the quotient we are looking for. The size of ROM 1 is 2 n x n bits, the sizes of ROM 2, ROM 4, Tlog ? cl ROM 5, and ROM 7 are all 2 L x pog 2 c] bits, the size of ROM 3 is 2 n x (n-1) bits, and ROM 6 is only 2 n x 1 bit. In fact, ROM 6 and ROM 7 can 212 n v2 n c n-1 CORRECTOR a_ c 1 r l +r 2 c Figure 44. The Length of Each Component in Quotient Computation. 213 be put together in one ROM. If L is 20, n is 10, and c is 3, we need 20 lKxl bit ROM's and one 16x4 bit ROM. So, the hardware is actually very cheap. We can see from Figure 42 that it takes at most two ROM cycles and one addition cycle to do a QR operation. However, this is in general less than most of the arithmetic operation times. So, our design will not cause any serious problem to the address decoding. As we mentioned earlier, the memory modules a job gets might be scattered all over the memory, and they might not be adjacent to each other. So, we need to transform the logical module number we obtained above to the physical module number. This is what the Module Mapping Table shown in Figure 42 will do. The Module Mapping Table is in fact a cache memory. When we allocate memory to a job, we will record the physical module numbers in the table in the correct order, which can be retrieved later. Hence, the Module Mapping Table acts just like the page table used in a paging system. After the physical module number has been retrieved, it will be appended to the in-module word address to form the final physical address. Of course, a job that requires some power of two modules does not need the QR operation since the module number and the in-module word address can be obtained simply by breaking the logical address into two parts, the lower 8 bits (cf. Figure 42) will indicate the module number which will be used by the Module Mapping Table to retrieve the physical module number. The upper L-& bits will be gated directly to the physical address register. The two multiplexors (MUX's) in Figure 42 are used to decide which result should be used. The only drawback with this scheme is sometimes we have to waste some memory in order to make this scheme work. For example, if we can only 214 do QR~ operation and a job requires five modules, we must allocate six modules to this job and use the interleaving scheme shown in Figure 43. Table 28 shows the actual number of modules each job will be allocated and the interleaving scheme to be used. As we can see, 5-module and 7-module jobs will be granted six modules and eight modules respectively. Obviously, this deisgn will further increase the memory waste originally present in a partitioned system. The situation is even worse for 9-module, 10-module, and 11-module jobs since all of them must be allocated 12 modules to use 3-3-3-3 interleaving, i.e., the modules will be partitioned into four groups with three modules in each group. We cannot use 3-3-3 interleaving for 9-module jobs since we need a power of two groups in order to use the last few bits as the group bits. (Recall, however, Table 21 which indicates that for our job mix, yery few jobs require more than six modules.) If we also implement a QRr circuit in the processor, we can im- prove the situation stated above. Especially, the 5-module jobs will no longer need an extra module in order to use the 3-3 interleaving. In Table 29, we show the result with the addition of a QR 5 circuit. After the generation of a physical address, the processor will send it to all the memory modules attached to its processor bus. Inside each module, there should be some identifying hardware so that the destina- tion module will pick up the address and other modules will reject the ad- dress. This can be easily done by using a comparator and an identity register which contains the module number. 215 Job Number of Interl eaving g I Memory Size Modules Required Waste 1 1 1 No 2 2 2 1 No 3 3 3 No 4 4 4 2 No 5 6 3-3 1 Yes 6 6 3-3 1 No 7 8 8 3 Yes 8 8 8 3 No 9 12 3-3- 3- ■3 2 Yes 10 12 3-3- 3- ■3 2 Yes 11 12 3-3- 3- ■3 2 Yes 12 12 3-3- 3- ■3 2 No Table 28. The Number of Modules Required and Interleaving Scheme for each Job Size. ( with only a QR-, Circuit ) 216 Job Number of Interl eaving g I Memory Size Modules Required Waste 1 1 1 No 2 2 2 1 No 3 3 3 No 4 4 4 2 No 5 5 5 No 6 6 3-3 l No 7 8 8 3 Yes 8 8 8 3 No 9 10 5-5 i Yes 10 10 5-5 l No 11 12 3-3- 3- -3 2 Yes 12 12 3-3- 3- -3 2 No Table 29. The Number of Modules Required and Interleaving Scheme for each Job Size. ( with both QR- and QRr Circuits ) 217 4.2.2 I/O Connection The other problem we need to discuss is the connection between processors and I/O devices. Recall in Figure 1, the PRIME system uses an External Access Network to provide the communication paths be- tween processors and external devices. The network is essentially a cross- bar switch, except it also allows two processors to set up a path between themselves. This is done by connecitng both processors to a free switch node, i.e., a node that does not connect to any external device [15]. This network is easier to control and allows simultaneous use of all the I/O devices. The probability of access conflict, i.e., more than one processor accessing the same device, will be small if the number of I/O devices is large enough. For example, in the systems we are simulating, the results show that the average number of jobs in the I/O stage is less than the number of I/O devices. This implies that in general a job does not need to wait for an I/O device if we use a network like the EAN to interconnect the processors and I/O devices. However, the cost of this kind of network will be very expensive if the system size is large. This is the typical disadvantage of a cross- bar-like network. In our simulation model, we choose a common bus structure which is shown in Figure 45. Each processor in this structure is connected to a common bus. The number of common buses we should use is determined by the traffic between processors and I/O devices. Usually, this is propor- tional to the system size. The I/O devices are partitioned into groups and the I/O devices in one group are connected to an I/O bus. These I/O buses are interconnected with the common buses via a small crossbar switch. The switch allows a processor to access any I/O device. 218 n COMMON BUSES MEflORY J I/O DEVICES I I/O BUSES Figure 45. The Interconnection Network between Processors and I/O Devices. 219 The cost of this design apparently is much cheaper than the cost of a connection network like the EAN in the PRIME system. If we use n com- mon buses and c I/O buses, the total cost will be the cost of n+e buses plus the cost of an n by C switch. Suppose n and £ are small, this cost will be \/ery low. In addition, when we increase the number of processors, we can simply connect an additional processor to a common bus if the I/O traffic does not increase too much. In an EAN-like network, the extra processor will cause a significant increase of the network size. Although the sharing of a common bus by a number of processors can reduce the number of buses we need and the size of the switch, bus conten- tion might occur if more than one processor connected to the same bus want to access I/O devices at the same time. The contention can be serious if the average number of I/O requests for a job is large or too many processors are connected to one bus. On the other hand, bus contention might also occur on an I/O bus if more than one processor is trying to access the I/O devices connected to the same I/O bus. Both of these bus contentions will result in queueing of an I/O request, which will cause some delay to a job. Of course, we can keep the bus contentions small by making n and e large enough. In order to find out the number of buses we should use, let us analyze how n and c affect the bus contentions occurring in our connection network. We will derive the probability that a processor can successfully access the I/O device it wants without being blocked due to bus connections. Let us assume a to be the probability that a certain processor is performing and I/O operation. Roughly speaking, the ratio of the I/O time and the total service time can be thought of as a. Therefore, a processor-bound job mix has a small value of a and an I/0-bound job mix has a large value. We will 220 assume o to be the same for all active jobs, or processors. Of course, l-o is the probability that a processor will not issue an I/O request. We also assume each common bus has p/n processors attached to it. Obviously, if a processor wants to make a successful access, the required common and I/O buses should both be free. So, the probability of a successful access is the product of the probabilities of these two events. The probability of the first event is (l-a) p/n ~ , which is the probability that all other p/n - 1 processors sharing the same common bus are not doing I/O. The probability of the second event is n M n_1 ) [1 - (l-a) P/n r [(l-o)P /n ] n ^ _1 (1 - 1/C) , 1=0 * where each term in the summation is the probability that exactly I of the re- maining n-1 common buses are busy but none of these I requests are accessing the I/O bus we are interested in. This summation is actually the expansion of the following equation: [(1 _a)P/ n + (1 - l/a)(l - (l-a^)]"- 1 = [1 - 1/a + l/a(l-o)P /n ] n_1 Hence, the probability of successful access Ps can be expressed as follows: Ps = (l-a)^ 1 [1 - l/o + l/o(l-o)P /n ] n " 1 . In Table 30, we show some numerical values of Ps for different a, n, and C values. Here we use eight processors. As we can see, when a is small, Ps is rather large even for moderate values of n. If we use the definition stated above, we can see the a value for our job mix is about 0.3 since our simulation result shows that the I/O time for a job is roughly 30% of the total service time. If we look in the 0.3 column, we need four 221 p = 8 \^ a n \^ 0.5 0.4 0.3 0.2 0.1 2 0.07 0.12 0.21 0.36 0.60 3 0.08 0.13 0.22 0.37 0.61 4 0.12 0.19 0.29 0.44 0.67 8 0.13 0.21 0.32 0.48 0.70 ( 1=2 ) n N. 0.5 0.4 0.3 0.2 0.1 2 0.09 0.15 0.26 0.41 0.65 3 0.13 0.20 0.30 0.45 0.67 4 0.21 0.29 0.40 i 0.55 0.74 8 0.28 0.37 0.48 I 0.62 i 0.79 ( 1=3 ) \ d 0.5 0.4 0.3 0.2 0.1 2 0.10 0.17 0.28 0.44 J 0.67 3 0.15 0.23 0.34 ! 0.50 0.70 4 8 0.27 0.39 0.36 0.48 0.47 i 0.58 i i 0.60 0.70 0.78 0.84 ( *=4 ) Table 30. The Probability of Successful Access to I/O Device. 222 common buses and four I/O buses in order to obtain a near 0.5 probability. Even with eight common buses, we can only achieve near 0.6 probability. It seems that this design is not very attractive due to the high blocking probabil ity. However, when we are doing an I/O operation, we do not have to occupy the buses all the time. We can release the buses during the seek and rotational latency time of an I/O operation and let some other processors use the buses. In other words, a processor will only occupy the common and I/O buses when the data or address is being transferred. This effectively reduces the value of a and hence increases the probability of successful access. For example, if the data transfer time is only one-third of an I/O transaction time, a can be reduced from 0.3 to 0.1 and we can get yery high probability of a successful I/O access. Of course, we need to implement a smart controller to control the use of these buses. In fact, if we assume a bus will only be occupied during data transfer, our model of Figure 45 is almost equivalent to the L-M memory organization model proposed by Briggs [50]. The analytic result of his model can be slightly modified to fit our model or used as an approximation of our model . In our simulations, we used p common buses and two 1/0 buses, i.e., n=p and C=2. From all simulation results, we can see the queueing time for 1/0 devices is relatively low. Apparently, the structure of Figure 45. is quite good. If we use faster 1/0 devices, we can also reduce the a value. Hence, we can use fewer buses and maintain the same performance. 223 4. 3 Further Problems Multiprocessor have been an important subject in computer design for quite a while. Recently, new technology has permitted us to consider systems with large numbers of autonomous processors. Most of the work done previously in this area has concentrated on speeding up single program through the use of multiple processors, or on providing a multiprogramming environ- ment. This in turn has led to complex memory-processor and interprocessor communication schemes. We have shown in this thesis that multiprogramming is not necessary for high throughput, low turnaround time, and that simpler architecture is indeed a viable design alternative, producing good performance and expandability, and capable of high reliability and availability. However, there are still a number of areas which need further study. We discussed the design of several components of a system (e.g., addressing hardware), but many of the details of the processor, memory, and I/O systems need more work. Some of this design is straightforward, but some requires better model before we fully understand the tradeoffs involved. For example, in determining the actual distribution and connection of memories to processors, we have been unable to demonstrate a model which correctly predicts the best distribution or connection. But we have shown that the distribution and connection does cause significant changes in per- formance. In this thesis, we have purposely omitted consideration of interactive job loads and virtual memory. An interactive environment imposes new problems both in connecting terminal s to the system and in handling the large number of small tasks. This type of environment should be investigated to determine whether or not it would necessitate significant changes to the architecture or our 224 conclusions. Finally, a very important research area concerns reliable operating systems. Conceptually, it is easier to design an operating system with centralized control. But this approach leaves us open to total system failure if the control hardware fails. In Chapter 1, we briefly describe the design philosophies used by the PRIME, C.mmp, and NonStop systems. The C.mmp and NonStop systems essentially let each subsystem own a copy of the operating system. This prevents the failure of the operating system of one subsystem from affecting the operations of the other subsystems. However, this duplication does occupy a significant amount of memory. The PRIME sys- tem, on the other hand, partitions the operating system into a central con- trol monitor and external control monitors (ECM's) and distributes these ECM's to different subsystems. This distributed approach, of course, can reduce the memory requirement. However, it does create some problems when the control subsystem (the one who is running the central control monitor) goes down, for example, how to save all the system tables and pass control to the subsystem that is taking over the central control monitor. Complete distribution of the central control, or minimization of the central control so it could reside in a highly reliable component, is an interesting and important research area. 225 References [1] Linger, S. H. "A Computer Oriented toward Spartial Problems," Proceedings of the IRE , Vol. 46, No. 10, pp. 1744-1750, October 1958. [2] Leiner, A. L., W. A. Notz, J. L. Smith, and A. Weinberger. "PILOT, A New Multiple Computer System," Journal of the ACM , Vol. 6, No. 3, pp. 313-335, 1959. [3] "Vocabulary for Information Processing," A merican National Standard X3 , December 1970. [4] Enslow, P. H., Jr. Multiprocessors and Parallel Processing , John Wiley and Sons, Inc., New York, 1974. [5] Enslow, P. H., Jr. "Multiprocessor Organization - A Survey," ACM Computing Surveys , Vol. 9, No. 1, pp. 103-129, March, 1977. [6] Anderson, J. P., S. A. Hoffman, J. Shifman, and R. J. Williams. "D825- A Multiple Computer System for Command and Control," AFIPS Conference of 1962 FJCC , Vol. 22, pp. 86-96, 1962. [7] Bell, C. G. and A. Newell. Computer Structures: Readings and Examples , McGraw Hill, New York, 1971. [8] Thurber, K. J. and L. D. Wald. "Associative and Parallel Processors," ACM Computing Surveys , Vol. 7, No. 4, pp. 215-255, December 1975. [9] Barnes, G. H., R. M. Brown, M. Kato, D. J. Kuck, D. L. Slotnick, and R. E. Stokes. "The ILLIAC IV Computer," IEEE Transactions on Computers , C-17, pp. 746-757, August 1968. [10] Slotnick, D. L., W. C. Borck, and R. C. McReynolds, "The SOLOMON Computer, 1 AFIPS Conference Proceedings of 1962 FJCC, Vol. 22, pp. 97-107, 1962. [11] Fung, L. "A Massively Parallel Processing Computer," Proc eedin gs of the 226 S ymposium on High Speed Computer and Algorithm Organizat ion, Depart- ment of Computer Science, University of Illinois, April 1977. [12] Baskin, H. B., B. R. Borgerson, and R. Roberts. "PRIME - A Modular Architecture for Terminal -Oriented Systems," AFIPS Confere nc e Proceed - ings of 1972 SJCC , Vol. 40, pp. 431-437, May 1972. [13] Wulf, W. A. and C. G. Bell. "C.rrmp - A Multi-Mini-Processor," AFIPS Co nference Proceedings of 1972 FJCC , Vol. 41, Part II, pp. 765-777, 1972. [14] "Seven Tough Problems in On-Line Data Base Systems and How Tandem's NonStop System Solves Them," Datamation , 1977. [15] Quatse, J. T., P. Gaulene, and D. Dodge. "The External Access Network of a Modular Computer System," AFIPS Conference Proceed.ings of 1972 SJCC, Vol. 40, pp. 783-790, May 1972. [16] Ferrari, D. "Architecture and Instrumentation in a Modular Interactive System," IEEE Compute r, pp. 25-29, November 1973. [17] Fabry, R. S. "Dynamic Verification of Operating System Decisions," Communications of the ACM , Vol. 16, No. 11, pp. 659-668, November 1973. [18] Ravi, C. V. "On the Issue of Physical Memory Sharing in the MCS System," Document No. W-37.0/CSRP, Computer Systems Research Project, University of California at Berkeley, July 1971. [19] Bell, C. G., et al . "C.mmp - The CMU Multiminiprocessor Computer," Department of Computer Science, Carnegie-Mellon University, August 1971 [20] Wulf, W. A., E. Cohen, W. Corwind, A. Jones, R. Levin, C. Pierson, and F. Pollack. "HYDRA: The Kernel of a Multiprocessor Operating System," Communi cations of the ACM , Vol. 17, No. 6, pp. 337-345, June 1975. [21] Abate, J., H. Dubner, and S. B. Weinberg. "Queueing Analysis of the 227 IBM 2314 Disk Storage Facility," Journal of the ACM , Vol. 15, No. 4, pp. 577-589, October 1968. [22] Oney, W. C. , "Queueing Analysis of the Scan Policy for Moving-Head Disks," Journal of the ACM , Vol. 22, Mo. 3, pp. 397-412, July 1975. [23] Fuller, S. H. and F. Baskett. "An Analysis of Drum Storage Units," Journal of the ACM , Vol. 22, No. 1, pp. 83-105, January 1975. [24] Bhandarkar, D. P. "On the Performance of Magnetic Bubble Memories in Computer Systems," IEEE Transactions on Computers , Vol. C-24, No. 11, November 1975. [25] Coffman, E. G. , Jr. and P. J. Denning. Operating Systems Theory , Prentice- Hall, Englewood Cliffs, 1973. [26] Kleinrock, L. Queueing Systems, Volume 2: Computer Applications , John Wiley and Sons, New York, 1976. [27] Gaver, D. P. "Probability Models for Multiprogramming Computer Systems," Journal of the ACM , Vol. 14, pp. 423-438, 1967. [28] McKinney, J. M. "A Survey of Analytical Time-Sharing Models," Computing Surveys, Vol. 1, pp. 105-116, 1969. [29] Binder, R. , N. Abramson, F. F. Kuo, A. Okinaka, and D. Wax. "ALOHA Packet Broadcasting - A Retrospect," AFIPS Conference Proceedings , 1975 NCC, Vol. 44, pp. 203-215, 1975. [30] Mallach, E. G. "Job-Mix Modeling and System Analysis of an Aerospace Multiprocessor," IEEE Transactions on Computers , Vol. C-21, No. 5, pp. 446-454, May 1972. [31] Avi-Itzhak, B. and D. P. Heyman. "Multiprogramming Computer Systems," Operations Researc h, Vol. 21, pp. 1212-1230, 1973. 228 [32] Konheim, A. G. and M. Reiser. "A Queueing Model with Finite Waiting Room and Blocking," Journal of the ACM , Vol. 23, No. 2, April 1976. [33] Brown, R. M., J. C. Browne, and K. M. Chandy. "Memory Management and Response Time," C ommunications of the ACM , Vol. 20, No. 3, pp. 153-165, March 1977. [34] Franta, W. R. "The Mathematical Analysis of the Computer System Models as a Two-Stage Cyclic Queue," A cta Informa tica 6, pp. 187-209, 1976. [35] Jackson, J. R. "Jobshop-1 ike Queueing Systems," M anagement Science , Vol. 10, No. 1, pp. 131-142, October 1963. [36] Gordon, W. J. and G. F. Newell. "Closed Queueing Systems with Exponential Servers," Operations Research , Vol. 15, pp. 254-265, 1 967. [37] Browne, J. C, K. M. Chandy, R. M. Brown, T. W. Keller, D. F. Towsley, and C. W. Dissly. "Hierarchical Techniques for the Development of Realistic Models of Complex Computer Systems," Proceedings of IEEE , Vol. 63, No. 6, pp. 966-976, June 1975. [38] Ravi, C. V. "On the Bandwidth and Interference in Interleaved Memory Systems," IEEE Transactions on Computer s, Vol. C-21, pp. 899-901, August 1972. [39] Chang, D. Y. "Analysis and Design of Interleaved Memory Systems," M.S. Thesis, Department of Computer Science, University of Illinois at Urbana-Champaign, Report No. 75-747, August 1975. [40] Hellerman, H. Digital Computer System Principles , McGraw-Hill, New York, pp. 228-229, 1967. [41] Burnett, G. J. and E. G. Coffman, Jr. "A Combinatorial Problem Related to Interleaved Memory Systems," Journal of the ACM , Vol. 20, No. 1, pp. 39-45, January 1973. [42] Chang, D. Y., D. J. Kuck, and D. H. Lawrie. "On the Effective Bandwidth 229 of Parallel Memories, IEEE Transactions on Computers , Vol. C-26, No. 5, pp. 480-490, May 1977. [43] Kuck, D. J. The Structure of Computers and Computations , Volume 1, John Wiley and Sons, New York, 1977. [44] Knuth, D. E. The Art of Computer Programming, Volume II: Seminumerical Algorithms , Addison-Wesley, Reading, Massachusetts, 1969. [45] Flynn, M. J. "Some Computer Organizations and Their Effectiveness," IEEE Transactions on Computers , Vol. C-21 , pp. 948-960, September, 1972. [46] Sastry, K. V. and R. Y. Kain. "On the Performance of Certain Multiproces- sor Computer Organization," IEEE Transactions on Computers , Vol. C-24, No. 11, pp. 1066-1073, November 1975. [47] Bhandarkar, D. P. "Analysis of Memory Interference in Multiprocessors," IEEE Transactions on Computers , Vol. C-24, No. 9, pp. 897-908, September 1975. [48] Baskett, F. and A. J. Smith. "Interference in Multiprocessor Computer Systems with Interleaved Memory," Communications of the ACM , Vol. 19, No. 6, pp. 327-334, June 1976. [49] Gajski, D. D. and C. Vora. "High Speed Modulo 3 Generator," Submitted for publication. [50] Briggs, F. A. "Memory Organizations and Their Effectiveness for Multi- processing Computers," Ph.D. Thesis, Report No. R-768, UILU-ENG 77-2215, Coordinated Science Laboratory, University of Illinois at Urbana-Champaign, May 1977. 230 Appendix A Assuming we have p processors referencing m memory modules numbered from to m-1. Each processor generates s references in every memory cycle. Let a be the probability that the next reference will access the next module in sequence, i.e., a = Pr{r. + , = (r. + 1) mod m}, and any other module with the probability (l-a)/(m-l). We will call the former case a "sequential transition" and the later a "nonsequential transition." Also, let p\ . be the probability that the i processor will generate a k in the fi ' posi- tion and no / occurs in the first 2-1 positions, i.e., the probability of the event shown in Figure 14. In the first position, assuming we know all the reference proba- bil ity P^;, we have : 9/2 ' p l2' p' 1 ' - p... pP) . p. . The probability of no / shows up, trivially, is 1 - p. - . In the second (2) position, consider two kinds of transition and the definition of p v ., , we will get 231 p = (i - pO> - p 0))l4 + p(')«, P (2) , . (i - P n)- P (}> )bi +P n> ;.,«. n(2) . (1 _ p(D) 1±L p ( - 2) - (i -pW-pW nJ^+pty n a. The first term is the probability of nonsequential transition and the second term is the probability of sequential transition. The probability of no j in the first two positions is: 1 - p(J) . S pW = (1 - p (})) [1 - (?Sl=3l fl + (1 - ^ld))l^)] 1 will be then by the same argument, the bandwidth equation for P (s) m - n q\V m 2 i-1 -t-1 We can solve the above combinatorial problem by using another method, namely, the Markov chain analysis. Figure 46 shows the Markov chain for this problem. Since we are only interested in finding out the probability that no j will show up in s-l transitions, we simply make state j transfer back to itself with probability 1. Hence we have an absorbing Markov chain. The transition matrix of this Markov chain is given by: T = a j3 .. a a 1 .., .. a a ... 233 CD -5 O o rr CD lit o < C) 234 t If we let 7r ., be the probability that the Markov chain will be in state 2+1 fe after P-l transitions, then we can find -n . from the following relation: 8+1 fi T 7T = f T where 7r =[».,,*.«,...». is the state probability vector. If we consider the request generation to be the state transition in our Markov chain, then the probability that no j will show up after generating s requests will be equal to 1 - it . ., given it = [p -, , p - 9 , . . . , p . ]. Actually, this method is exactly the same as the previous one, except we are using matrix expression instead of using recursive expression now. Ob- viously, the latter method is neater and hence more preferable. BIBLIOGRAPHIC DATA SHEET 1. Report No. UIUCDCS-R-77-908 3. Recipient's Accession No. 4. Tit le .»nJ Subi i( It- Further Results Regarding Multiprocessor Systems 5- Report Date October. 1977 7. Author(s) Donald Yi-Chung Chang 8. Performing Organization Rept. N °" UIUCDCS-R-77-908 9. Performing Organization Name and Address University of Illinois at Urbana-Champaign Department of Computer Science Urbana, Illinois 61801 10. Project/Task/Work Unit No. 11. Contract /Grant No. US NSF MCS76-81686 12. Sponsoring Organization Name and Address National Science Foundation Washington, D. C. 13. Type of Report & Period Covered Doctoral Dissertation 14. 15. Supplementary Notes 16. Abstracts The recent developments of inexpensive but powerful LSI microprocessors and extremely high density semiconductor memory chips have led to the design of large computer sys- tems containing a large number of processors and memory modules. Many systems have beer built with many processors interconnected with a large number of memories, e.g., the PRIME system, the C.mmp system, and the Tandem NonStop system. All these systems have one common feature, i.e., to increase the system throughput by simultaneous operation of several processors. We give a brief description of these multiprocessor systems in both hardware and soft- ware aspects. Each system has a different architecture and its own advantages and dis- advantages. They provide us with many valuable examples of system design. However, a lot of questions arise in this area which are yet to be solved. For example, is multi- programming better than monoprogramming for system operation? Are there better ways to IKMi^sBftwigginMBSSiiiSw8 M S B I^ii55IWBBw*SiySoi's interconnect processors and memories? How does workload affect system performance, etc.? In this thesis, we shall try to answer these questions in order to get a better understanding of multiprocessor system design. Because the systems we study are highly complex, simulation technique is used to collect the data we need to answer our ques- tions. We shall give a detailed description of our simulator and present a lot of simulation results. We also discuss some logic design problems concerning the real system design. l"7t>— "Htti ri H ers ~^&prm -trrrd cd- tF rrm s 17. Key Words and Document Analysis Interleaved memories Multiprocessors Performance of multiprocessors Processor interconnection 17c. COSATI Field/Group 18. Availability Statement Release Unlimited FORM NTIS-35 ( 10-70) 19. Security Class (This Report) UNCLASSIFIED 20. Security Class (This Page UNCLASSIFIED 21. No. of Pages 238 22. Price USCOMM-DC 40329-P7I m 3«R AUi inn ?,ou ND aZ Wl