LIBRARY OF THE UNIVERSITY OF ILLINOIS AT URBANA-CHAMPAICN 510 -^f top- & «mRALC.RCUlAT.ONBOO^ACKS The ^rL^'e"! wafo™^ return to sponsible for its rene borro wed the library ^JTSjJf Date stamped on or before ^e Ut«t B milllmllin ^ :,*r 5 So^or e e.«M.s.bo.k. fee OT $/3«" w ,w ^ ,,,0 reosens « — — '-; ^jeta •» — — *- for dUelplhwry actl0B ° the Unlverrity. ^, CENTER. 333-8*0° To RENEW CAU TELEPHONE CENTER URBANA ^rA 1 G^ SEP 16 19* u we write new due date below When renewing by phone, wnt u62 previous due date. Digitized by the Internet Archive in 2013 http://archive.org/details/loadregulationdi537fitz '/O.St <■ V / UIUCDCS-E-72 -537 fS37 LOAD REGULATION AND DISPATCHING IN A NETWORK OF COMPUTERS BY JAMES T. FITZGERALD August 1972 IH§ LIBRAE QE ih e >EP 12 1972 UNIVERSITY OF ILLINOIS ATUr^' 'VCH "MPAIGN UIUCDCS-R-72-537 LOAD REGULATION AND DISPATCHING IN A NETWORK OF COMPUTERS BY JAMES T. FITZGERALD August 1972 Department of Computer Science University of Illinois at Urbana -Champaign Urbana, Illinois 6l801 *This work was supported in part by Contract No. NSF GJ 28289 and was submitted in partial fulfillment of the requirements for the Master of Science degree, August 1972. Ill ACKNOWLEDGMENT I would like to express my sincere gratitude to the National Science Foundation whose Grant No. GJ 28289 financially supported this work and to my adviser, Professor Edward K. Bowdon, Sr., whose friendship and guidance supported me in other ways. A special thanks also goes to Mrs. Gayanne Carpenter, who typed this paper so "beautifully. But my biggest thanks must go to my family for whom this work was done. First, to my wife, Debbie, without whose encouragement, I would never have finished, and secondly, to my daughter, Shannon, who encouraged me just by being here. IV PREFACE This paper is aimed at developing tools to control efficiently the flow of jobs and job traffic in a network of computers. Input of jobs to each center is controlled by predetermined information based on probabilities and stored in table form. These probabilities are developed mathematically, predicated on the fact that we consider the input rate to be a random variable capable of assuming any size. The table is then extended to handle the dispatching of jobs that must be rerouted between different centers in the network and an efficient controller is thus developed. V TABLE OF CONTENTS Page ACKNOWLEDGMENT iii PREFACE iv 1. INTRODUCTION 1 2. THE LOAD REGULATOR k 2.1. Description of the Network and Load Regulator . . k 2.2. The Probability With a Known X 7 2.3. The Probabilities 10 2.k. Calculation of the a Change Factor 18 3- THE DISPATCHER 29 3.1. An Introduction to the Problem 29 3*2. The Communication Links 30 3-3- General Considerations for Rerouting 36 3.1+. The Algorithm 39 3«5« The Load-Regulator-Dispatcher i+3 k. CONCLUSION hG APPENDIX k-J LIST OF REFERENCES 50 1. INTRODUCTION "For -which of you, wishing to build a tower, does not sit down first and calculate the outlays that are necessary, whether he has the means to complete it? Lest, after he has laid the foundation and is not able to finish, all who behold begin to mock him, saying, 'This man began to build and was not able to finish!'" St. Luke lU:28-29 The Objectives Two thousand years ago the importance of scheduling was recognized. Having enough material to start and finish a job while not having too much unused upon completion introduced the art of resource allocation. Since then, much attention has been paid to the area of schedules and to the algorithms which alleviate or at least somewhat subdue the problems associated with scheduling jobs, or tasks, in a particular environment. The amount of attention is due to the fact that an ever increasing complexity in the type and number of jobs demanded by a mechanically expanding society necessitates their fast and efficient completion. One environment which by its very nature demands painfully high levels of efficiency and the one towards which we turn our attention is the area of network computers . The term network computers can mean different things to different people. It could be the connection of two or more processors at a single computer installation. Or it could mean the connection by telephone wires of many geographically distant and distinct single machines. We will combine both of these ideas in our definition. We will use the term network computer to mean the connection by some communication facility of geographically distant and distinct computing centers, each of which has a number of processors and peripheral devices. In recent years, schedulers have turned their attention to the area of network computers in an attempt to more efficiently use the tremendous computing power of such a system. Efforts, on their part, have generated many priority assignment rules and scheduling algorithms effecting fast and efficient throughput of jobs at a particular center. They have had a tendency, though, to concentrate on just one center, forgetting, perhaps, that this is only a small part of the entire picture While we will concede the importance of this work, we feel that some scheduling problems as they relate to the entire network deserve some attention . One area which has been all but neglected, is the one of load leveling or load regulation for the entire network. It is not unrea- sonable to expect that all jobs once accepted at a center will be run to completion at that center. It may happen, however, that because of mis -management at or failure of a center, some jobs already accepted into the network cannot be run at the center intended. Should the user 3 then be called and told that his job will be delayed or not run at all? We think not. Management would frown if every effort was not made to meet our obligations. We would like then to have these type of jobs run at another center in our network so as still to meet the deadlines. We would, however, still like to enjoy the benefits of making some profit even though through our own fault it will be reduced. Towards this goal then we develop in Chapter 2. a load regulation scheme meeting our objectives, the most important of which is to minimize as much as possible, the chances of having this overloaded condition. Upon discovering that an overloaded condition exists at one center, we must choose another center that is capable of running the excess jobs. This is the object of Chapter 3- 2. THE LOAD REGULATOR 2.1. Description of the Network and Load Regulator We speak of load regulation in a computer network as the scheduling and routing of jobs so that available resources are used efficiently to implement fast completion of all jobs. In a multi- processor system such as we deal with, each processor should be loaded all the time to be used to best advantage. A processor sitting idle while another has a full load or is even overloaded not only seems sub-optimal but also gross mismanagement of an expensive asset, machine time. Jobs originally scheduled to be run on a machine now overloaded should be rerouted. We could ask, at this point, what happens to incoming jobs when the entire system is already running at full capacity, but we i will leave this as a management consideration. Our concern will be with the efficiency of our full system. The tool that we will use in this controlling function we will dub, oddly enough, the load regulator. We wish, then, to construct a feasible and workable load regulator for the entire network. A regulator that will periodically check each of the nodes in our system and immediately be able to determine if that particular center is in danger of being overloaded or is, in fact, already overloaded. After this status check, we would like our load regulator to take the proper action, if necessary, to alleviate the impending or present congestion. We will allow our regulator to make the decision to either inhibit or reduce further input to the center or to let the input flow continue at the present rate. We will not, at this point, force our regulator to be concerned with the rerouting of jobs whose entry it has refused. Our network will consist of n computing centers with a varying number of processors and peripheral equipment. Each processor in a center draws its work from a common queue relevant only to that center (i.e. an idle processor in another center may not draw work from the queue of an overloaded center but rather must wait for it to be rerouted to it). We will assume that the input to the queue of a particular center is a Poisson process with an unknown random mean X. Due to the randomness, we will allow X to range from zero to infinity; despite the infinite capabilities of the input rate, we never really expect an infinite number of jobs to be present at one time. Our load regulation will consist, then, of periodically sampling this random variable queue size and based on predetermined information, stored in table form, make the proper decision Again, since our queue size is a random variable, the information stored in the decision table must be based on some estimate, as accurately as possible. Toward this goal, then, we will develop analytical tools based on the probabilities of the queues reaching critical threshold levels and then bound these probabilities by some arbitrary criterion e which will be small. We "begin by characterizing our computer network of n centers by the following: (1) Assume that the input rate to any center is a Poisson process -with unknown random means X, (2) Assume that the service rate in each center is an exponential random variable with an average value of ju • Then the service time is l/ju, (3) Assume that each center has a finite storage capacity c, (h) The queue size of each center is sampled at discrete instants of time K 5, K = 0, 1, 2, . . . (5) A delay time A is associated with the load regulator, where A is the time elapsed between the issuing of a control order and its implementation, (6) Each center has a common queue of work relevant only to that center, f7) Let qft) be the number of jobs in the queue at time t then the criterion of estimation is to keep Prob [queue size at time t > c] < e for all t > where e is a given real number rather than zero. For ease of computation, we will assume that each node is of equal stature (i.e. each center in the network has approximately equal computing power, equal speed, and on the average equal work loads). Therefore, we can speak of the relationship between K, 11, c, and A as being the same for each center. Since computing centers and their affect 7 on the entire network, unless identical should not "be considered so, this equating of the X, \i, c, and A may seem like an unrealistic approach. Intuitively it is, until we realize that a center far "below the capacity, speed, and overall computing power of the others, would have a smaller queue and, therefore, accept fewer jobs, jobs that were shorter, and jobs that required less sophisticated computing power. In this sense, then, the X, \x, c, and A of a large center in the network would be relatively and proportionally equal to those of the smaller center. Since we are speaking of probabilities relative to each separate center, this approach suits our purposes. For each center, then, to use just the one decision table in the load regulator, it is just a matter of a normalization factor with respect to the different centers. To emphatically dissolve the problem of a consistent A, we will assume that the load regulator occupies roughly the geographic center of our network and, therefore, the communication delay times are equal. Our network, therefore, will roughly assume the shape of a ring (see Figure 2.1.). 2.2. The Probability With a Known X Before proceeding with the development of the probabilities for our system, we may do well to look at the probabilities for a similar system when X is known. Predicating our system on an unknown X as will be shown in section 2.3- necessitates using complicated mathematics to solve long and messy equations. Morse [l] shows us that these messy equations are not needed if the input rate is defined or can be QUEUE Q.UEUE LOAD REGULATOR QUEUE QUEUE QUEUE QUEUE Figure 2.1, approximated closely enough to suit the purpose . When X is known, the probability of the queue at a center being overloaded can be defined by: Let P = the probability of n units in the system. Then, 1 - *At 1 - (\/u) c+1 (VV)' n=l,2, . . ., c where N is the capacity of the queue. In our case, the probability of the queue being overloaded is just the probability that there are c+1 in the queue. Therefore, c+1 1 - Vm 1 - (Vm) c+2 (Vm) c+1 which will simplify to c+1 x c+1 (m-x) c+2 " c+2 U -A We will see that this is much simpler to work with as we now move on to describe the probabilities when X is unknown. 10 2.3. The Probabilities We begin our analysis by developing the probabilities which we will need. The following approach to the derivations is due mainly to Bailey [2] and Saaty [3] • Assume that initially q(0) = and that our first discrete sampling time (K 5, K=0, 1, ...) which will have any meaning is K=l. Therefore, our first sample is taken at time 1 • 8 or 8 and the size of the sample is m. We wish then to define this probability that there are m jobs in the queue at time 8 given that there were none in the, queue at time 0. We can state this formally as, Prob [q(8) = m | q(0) = 0] . Remembering that the input to each center is a Poisson process with unknown random mean X and that the service time is an exponential random variable with an average service time of l/ju we define the following: Prob [l arrival in time At] = XAt (l) Prob [>1 arrivals in time At] =0 (2) Prob [l service completion in time At] = juAt (3) Prob [>1 service completion in time At] = (h) where At is a very small time interval and (2) and (k) tend to zero in the limit as At ^ since the actual probabilities are 0^((At) ) (Read: 2 On the order of (At) ) which is too small to be of any significance. 11 If we let P (t) = Pro"b [of m jobs in the system at time t] m then P (t + At) = Prob [of m in the system at time t and during the interval t to At no jobs arrived and no jobs were serviced] + Prob [of m-1 in the system at time t and during the interval t to At one job arrived] + Prob [of m+1 jobs in the system at time t and during the interval t to At one job was serviced] . Therefore, P (t+At) = [l-(\-Hi)At] P 't) + P ,(t) AAt + m' ' m m-1 P m+1 (t) juAt + ^(At) 2 , m > 1 (5) and P Q (t + At) = P Q (t) (l-*At) + P 1 (t) ]uAt + e^(At) 2 (6) from the first equation (5) P (t+At) = P (t) - (\-hu) At P (t) + P .ft) XAt + P , (t) ,uAt m m \ fi m m-1 m+1 P (t+At) - P (t) = - (X+u) At P (t) + P ,(t) XAt + P ,,(t) juAt m m ' m m-1 m+1 P (t+At)-P (t) m m = - (\+ w )P m (t)+P wi . (t)M-P^ (t)/i, m > 1 14 _ P (t+At) - P (t) dp (t) llm m m D« f+\ m At-0 At _ F m [Z) ~ "3t 12 Therefore, dp (t) m ~3t (\+u) P (t) + P '(t) \ + P lV (t)-jLi, m > 1 (7) "> ^ i m ^ ' m-l v ' m+1 ' — and for we set P Q (t-fAt) = P Q (t) (1-XAt) + P 1 (t) juAt SP Q (t) 1TE— = " x p o^ + M p i^ for P (t) the corresponding probability generating function is m n(z,t) = E p (t) z m . (9) m=0 Multiplying (7) and (8) by z and z, respectively, summing overall values of m and then using (9), shows that H(z,t) satisfies the partial differential equation z ^^ = (i-z) {(m-MO n (z,t) - y p (t)J . (io) We will continue the derivation under the assumption that initially there are i jobs in the system we are looking at where i > 0. If initially we have t = and a total of i jobs in the system, we have the initial condition n(z,0) = z 1 (11) 13 using the Laplace transform with respect to time 00 0*(s) = / e" st 0(t)dt , R(s) >0 (12) and using the inverse, we have C+ioo 2«i J C-loo :i3) Applying (12) to (13) and using (11) gives z 1+1 - w (l-z)P *(s) II*(z,s) = r= r, r-^r (1*0 sz - (l-z)(ju-Xz) It follows from the definition of the Laplace transform of Il(z,t) that II*(z,s) must converge somewhere within the unit circle |z| = 1, provided R(s) > 0. Thus in this region, zeros of both the numerator and the denominator on the right-hand side of (lU) must coincide. The zeros of the denominator o^.(s) are obtained from the equation o^Cs) = J(Mtu+s) + [x +u +s) 2 - l+Xju] 1 / 2 }- /2\ K = 1,2 . (15) By Rouche's theorem the denominator [sz-(l-z) (u-\z)] has only one zero in the unit circle and it can be seen from (15) that |a p (s)| < |a, (s)|. Thus the numerator on the right-hand side of (ik) must vanish when z = a p (s) Hence, P *(■) = S i+1 /U(l - a )] Ik and (ik) can "be written as n*(z,s) z i+1 - [(1- Z ) a p i+1 /(l-«J] - x (z-a 1 ) (z-au) multiplying the numerator by (1-cO and factoring, we get (16) : II*(z,s) = (z-a 2 ) (z 1 + a 2 z 1+1 + . .. 0£g) - za 2 (z-a 2 ) (z 1_ + o; 2 z 1 "'" + ... a 1_ ) x a^z-a^) (l-z/a^) (i-a 2 ) -1 K and since (l-z/cu ) = Z (z/a ) and adding and subtracting K=0 i+1 OU , we get II*(z,s) = — (z^z 1 " + ad) Z (z/a ) K=0 K OL. i+1 + 4^? K ? (z/a i )K P* (s) is the coefficient of z in the expansion; hence for m > i P*(s) = ± m v ' X 1 + mA , [bM i a/" 14 - 1 a/- 1 - 1 " 3 of^" 145 + a (nA) 1 m+i+l + ( ^) m+1 z MJ K 1 K=m+i+2 a K 1 _J 15 now the inverse C+loo P (t) = JL / e St P*'s)ds c-ipo and we get, -(\+ M )t p ( t ) = ^~T m X ,m-i+l (4i~/~uf~ ±+ (m-i+l)t" 1 -I ._ (2s/?it) + u/\ (\/I7Ii) m_1+3 (m-i+3)t" • I -i+3(2^It) + ... + UA) 1 (>/xA0 m+1+1 (m+i+i) i^Ca^t m+i+1 ' >m+l + (\A0" r * ^ (>/^A) K Kt" 1 I (2>Ct] K=m+i+2 where I ... is a modified Bessel function of the first type and substituting 2v I (z) = I ,(z) - I n fz) z v y v-1 v+1 and simplifying, we get P ft) = e' m (~X+u)e ,r~/- si-m+1 uA) 1_m I .(2^t) ' m-i '2\^t) + TV (1 - \/ M ) (X/ U ) m Z (^\) K (2^t) K=m+i42 (IT' 16 ■which is the equation we sought and finally substituting for i (initially we have zero) and 8 for our time we get Prob [q(&) = m | q(0) = 0] - (Mm) 5 U^Jx)~ m i m (2^ 8) e + (^) Vl^^ 6) + (1 - X/U) (X/U)1 ,K ij\u 8) y Z W\rL(2^8) (18) K=m+2 * J The above probability (18) defines our chances of finding m jobs in the queue at any of the centers in our network given that there were none to start with. We have shown, however, that q(0) = is not a necessary restriction and that we may start with any number in the system. We now know the probability of finding m jobs in the queue at time 5. What is even more important for us to know, at this point, is whether or not m is greater than c (i.e. are there more jobs in the queue than the system can handle) . If m > c we would want to either prohibit further input of jobs to that node or at least to reduce it until the overloaded condition was no longer present. We are, therefore, interested in knowing the probability upon finding m jobs in the queue at time 8, that this m exceeds the capacity for the center. Formally, we want Prob [q(t) > c | q(8) = m] 17 Since X is a random variable it is allowed to vary over the range \e(0,) and even the best means, to estimate X will often be grossly in error. Rather than depend on this somewhat unstable statistic, an average value of the quantity under consideration, Prob [q(t) > c], will be derived, and this estimate will minimize, though by no means correct, the error inherent in this calculation. Let P /c . «\(X) be q(o,m,0) the normalized Prob [q(S) = m / q(0) = 0] such that P ,_ _v(\) ax = 1 q(P.,m,0) v then from the previous probability derivation with substitution it should be clear that for t > o, we have Prob [qCt) > c | q(g) = m] n _P J q(fi,m,0) v 1 . [(^/x) m - n i n _ m [a^(VB)] + (^A) m " n+1 i n+m+1 [S^(t-B)] 00 \ + (i-x/u) (x/ii) n Z (Jl/xf I [2^u(t-fS)] ( dX (19) K=n+m+2 K J We might note that 1 - Prob [qCt) > c ' q(fi) = m] is the probability that the queue size is within efficient limits. 18 2.k. Calculation of the a Change Factor Before continuing with the probabilities we will look at some general considerations that must be taken into account. Remembering that the load regulator has a decision delay time of A, we note that no change in the input rate (either a blocking or reducing change) can be achieved before A + 5. This is because our first sampling time of any importance is K8, where K=l. Thus if the probability in equation (19) exceeds e at any time in the range (5,A+8), the load regulator will not change X in time to correct it. Furthermore, it should be noted that if no change is ordered at time 8, then the earliest possible time for effecting a change would be A+25 (A time units later than the next observation) . Therefore, the present input rate X should be maintained if and only if Prob [q(t) >c I q(;6) =m] < e for all t in the range ^5,A+2o). If a change is necessary it should be made so as to guarantee that the criterion e is met only for t in the range (5,A+2o), since at time 25, we may order another change if needed. We conclude, therefore, that we wish to order a change in the input rate X if and only if Prob [q(t) > c q(&) = m] > e, for t e(A+8, A+2o) . If by chance Prob [q(t) > c ! q(&) = m] > e, for t e(o, A+S) which could happen, then our criterion will be violated. 19 If a change is necessary and ordered, we will then change X to OX, for a=0, . . .,1, where Cfc=l means no change, Q!=0 means a complete halt to all input to that center, and a in "between is some reduction factor. Assume that Max Prob [q(t) > c q(S) = m], for t in the range (A+5, A+25), occurs at time t=t_ and that we need a change (i.e. the probability is greater than e). We, then, compute the change factor a in the following way: For q(0) = Let = t - (A + 6), then Prob [q(t_) > c q(&) = m, X - Q ^ , „„ „«„ ***,„ ^\- ^ - was changed to a X at A + 6] = E / E Prob [q(A+5) = m | q(o) = m] • n=c " n m =0 Prob [q(t Q ) = n | q(A+6) = rr^] P ( 6 ) M dX (20) from this we get 00 . n=c - m n =0 L L m i -m + "T , m-m +1 " (n/mA) X I ^ , (2s^uA) m.+m+l m oo >, + (l-\/ M ) (\/ U ) l E (^) K I„ (2nT^a)^ • K=m+m +2 K J 20 f -(oMm) 1 m, -n (^) X I v (2n^ 0) n-m . m -n+1 + (n/aiM) x i n4 . ,. (av/aXju 0) "n+m +1 00 __ — -i "\ + (1 - 9£) (^) n E (-^A.) K I K (2^ 0) ^ M M K=n-kn.+2 * J J P ,_ M (\) d\ = e (21) q(5,m,0)\ the derivation being similar to what we did before . For q(0) * Here we wish to make our decision at some time K8, with some number of jobs i already present in the center when we checked at time [K-l] 5. Let q(K&) = m, and let q([K-l]o) = i. Then if no change was ordered at time [K-1J8 from (18) we have that Prob [q(K8) = m | q([K-l]5) = i] = e -(^)8 | (^) 1 - m I m _. (2^5) + (^A) 1 "^ 1 I m+i+1 (2^8) + (1-A./M) (\/n) m S (^) K L. (2^8)} (22) If a change was ordered, at time [K-l] 5 we have Prob [q(KS) = m ] q([K-l] 8) = i] 21 I I e - {X+tl) m =0 ^ 1-1IL (>/mA) 1 T m _i (2>^A) mm J- i-m + (>/J7\) - 1 I _ ± (2^) + (i-Vm) (Vm) 1 " 1 Z (^) k I* (2n^A)^ K=m 1 +i+2 ■K | e -(a(K-X). +M ) ( 5 -A) f^jg^V Im (2 ^I), W(5 . A ) Q!(K-1)\ n / a(K-l)\ vm E (^/a(K-i)\r I (2s/a(K-l)\jLi(8-A) K=m+m 1 +2 I J (23) where A < o. Again because of the random X we will minimize the error in computing Prob [q(t) > c] by normalizing. Let P ,.„ .\(>0 be the normalized Prob [q(K8)=m|g( [K-l]&)=i] q(K8,m,i) v ' ^ ' |BV ' such that p ( V * .s (>,) d\ = 1 q(K8,m, l) then for t > KB, we have Prob [q(t) > c ! q(K8) = m, q([K-l]s) = i] 22 00 ,-> = z , ,. ;x , ^(Mu) (t-KB) n=c q q(K6,m,i) • { (^) m - n I n _ m [2^ (t-K5)] + (^) m " n+1 I n+jn+1 [2^ (t-K5)] OO "I + (l - \/n) (\/ u ) n E C^A) K \ 1&6* (t-KB)] [■ d\ K^n+m+2 R. J as before we wish to change (X if and only if Prob [q(t) > c | q(K8) = m, q([K-l]s) = i] > e for t in the range (KS + A, [K +l] 8+ A) occurs at t = t Q . Now let = t - (KS+A) . Then 00 oo r, oo r . N m-m, 2 / S je-^ )A [(V^A.) 1 I m ^ (2-^A) -c '•• . m,=0 ^ 1 n=c ... r MW^I^l^A)] HI-, oo + (l - \/u) (\/n) x I T K=m+m +2 uA) I K (2^A) ( sKiA) t • {e-( a ^^[(^M) Vn i n . m (2^ 0) m - n+1 + (■« I n+mi+1 (2^ 0) + (1 - a\/n) (a\/V) n Z (^/^) K I K (2n5^ 0)] [ K=n+m +2 J K=n+m +2 23 q(K8,m, i) [25) We can see that solving (21), (23), and (25) for Q! is at best a tedious and complicated operation. We can console ourselves "by the fact that the OL change factors need only be computed once and then stored away in the table of the load regulator; after this it is just a referencing operation . It could also be noted, at this point, though it should be obvious, that for each value of m observed at K5 in the range (0, c) a decision with regard to the input rate \ can be computed from the previous decision value of i observed at [K-l] B. We can then generate our table in the following form: 2k q(K-l) q(K-l) a Change Factor a(o) 1 a(D C a(c) C a(o) c a(C) 25 For purposes of illustration, a ta"ble "was generated according to the following criterion: (1) Storage capacity, c = 5 (2) Average service time = 5 units of time, therefore, ju = 0.2 (3) Queue scan at KS, K = 0,1,2, ... with K = .05 units of time (if) Delay time A = .0^5 units of time (5) Keep the probability that the queue size exceeds c "bounded below e, where e = .01. The first column of the table, q(K), is the observed sample at time K, the second column of the table, q(K-l), is the observed sample at (K-l) . The third column of the table is the required input multi- plicative change factor a. We can see from looking at the table that with the parameters chosen as they were that the load regulation schemes decision seems highly dependent on the present observed queue size. This is obviously the case since, with only a few exceptions, the first half of the possible queue samples 0,1, and 2 require an alpha change of 1 (i.e. no change), while the last half 3,^, and 5 required prohibition of all further inputs. Upon expanding this table for bigger queues we will find that this is also the case. We can reasonably expect that when the queue is empty to approximately half full that there will be no change, around half full to suffer some reduction in input, and from approximately half full to full to completely prohibit the input rate X. 26 q(K) q(K-l) a Change Factor one 1 one 2 one 3 one k one 5 one one 1 one 2 one 1 3 one k one 5 one • 5U l .87 2 one 2 3 one U one 5 one i < zero 1 zero 2 zero 3 3 zero if zero 5 • 32 i j zero l zero 2 zero k 3 zero k zero 5 zero zero 1 zero 2 zero 5 3 zero ! 1+ zero 1 1 : 5 zero C = 5 M - 0.2 e = .01 A = .0^5 8 z= • 05 Table 2.1. 27 We can also see that for servicing an entire network of multi-processing computer centers the size of the table (i.e. the amount of memory storage in the load regulator) is not at all excessive. With the addition of the zero capacity possibility, our table has 2 (c + l) entries. With the relative symmetry of one and zero around the middle of the table, it is not unreasonable to expect that even this figure could be reduced. We may reduce it in the following manner using the data from Table 2.1. We will let an entry of the form A,B where A and B are real numbers define a range from the first q(K-l) value to the last for the same q(K) that require the same a change factor. For example in Table 2.1. for q(K) equal to zero we have for q(K-l) from zero to five an a change factor of one (no change). Then for q(K) equal to zero the table entry of q(K-l) will be 0,5- The entire Table 2.1. could then be decreased in size to q(K) q(K-l) a Change Factor 0,5 one 1 0,5 one 2 0,0 1,1 2,5 • 5U •87 one 3 t k 5,5 zero • 32 k 0,5 zero 5 0,5 zero Table 2.2 28 A total of nine entries now define the entire Table 2.1., a seventy-five percent reduction. A simple hashing algorithm can now he applied to find the correct q(K-l) range. While we do not expect this kind of reduction in all tables that would be generated we would expect some. In fact, all we can say is that the number of entries in the new table is bounded by (c + l) and (c + l) . Our intuition, however, tells us that the 2 number will be far less than (c + l) . With these ideas in mind, we will now move on to discuss what happens to the jobs that are refused at some center by our load regulation scheme. 29 3- THE DISPATCHER 3.1. An Introduction to the Problem In speaking of our load regulation scheme, we touched lightly on some of the problems inherent in a working network computer. We also paid lip service to the fact that there are some problems which must be solved by management and not by our scheme or any other (i.e. what happens to jobs when the entire system is full and no center can accept it) . One problem which we chose to lay aside before, but which now requires our attention, is the rerouting of jobs from a full or an overloaded center to one less busy. We will handle this by incorporating into our load regulation table other decision making material adequate to efficiently handle this redistribution of the work load. We will then call our scheme the load-regulator-dispatcher. We note at this point that we could have made things considerably easier on ourselves in the beginning. If all jobs were initially routed to and dispatched from the load regulator instead of a common queue at each center, our problem would already be solved. The load regulator would know at each discrete sampling time the sizes of all the queues in the network and would route new jobs to centers it knew could handle them. But this would have been a costly if not an unrealistic approach; the time and the money wasted to send a job, possibly hundreds of miles, before it is even started, negate all the advantages of this idea. It would also restrict the type of jobs we would accept as the cost of transmitting and running and re-transmitting a short, fast job may out- weigh the reward we would receive from it. We will, therefore, use. the 30 load-regulator-dispatcher to reroute only the jobs whose entry was refused at one of the centers. We begin by discussing the communication links that hold our network together. 3-2. The Communication Links In constructing our network (Figure 2.1.), we omitted description of the communication links connecting the centers in our network. We did this because the question of inter-center communications was not relevant to our discussion of load regulation. It sufficed to say, at the time, that jobs refused entry at one center would have to be rerouted to another center. Since it is our intention in this chapter to advocate a workable dispatcher to handle this rerouting problem, we next consider the question of communication channels between our centers. For purposes of clarification and completeness, we will talk about three possible communication configurations, the last of which we choose for our network. The simplest way of forming a communications network is to provide each center with a communications line connecting it with every other center. Figure 3-1* shows our system if this approach is used. Assuming that each communications link is bidirectional (i.e. a link from center 1 to center 3 implies a link from center 3 to center l) f a network with n centers has n(n-l)/2 links. This configuration is optimal from a communications standpoint as it allows one center to communicate directly with all others but, unfortunately, it is only practical for a network with a very small number of centers. The cost of 31 Figure 3.1. the many lines when n is large is prohibitive and forces us to seek a less optimal but more economical solution. Our obvious line of attach, therefore, is to eliminate as many- connections as possible while still maintaining efficient communications in the network. The minimum number of communication links that will allow for our network to function (i.e. one center has the capability to communicate with all others, though not necessarily by direct means) is achieved by one bidirectional line between geographically adjacent 32 centers. This situation is portrayed in Figure 3*2 Figure 3«2. If a job is refused in a network that is connected in this manner, it cannot always he directly transmitted to the center that has accepted it, if any. The i — center has direct links with only the st st (i-l) J and the (i+l) ' centers. It is necessary if a job is to travel from center i to center j where center j is not connected to center i by a bidirectional communication line for that job to travel through the intermediary center or centers. When transmitting a job, our wish is to minimize the number of centers that we have to pass through to reach a particular accepting center. We, therefore, define the following. 33 If we let negative traffic flow be the transmission of a job clockwise in Figure 3-2. and positive traffic flow be the transmission of a job counterclockwise in Figure 3-2. and state that a job to be rerouted from a refusing center i to an accepting center j may travel in either the positive or negative direction so as to minimize the number of transmissions, then center j can be reached in not more than n/2 transmissions according to the following rules 'see Table 3 •!•)*• For: Direction of Job Travel j > i and -i=i < 1/2 n ' Positive j > i and ^ > 1/2 ° n — ' Negative j < i and - — < ° n Negative j < i and -1 < ^ < 1/2 o n / Positive TABLE 3.1. This scheme applies itself well and efficiently 'only n transmission lines) when n is a small number. When n gets larger, the cost to transmit a job the maximum number of times (n/2) may be more than the worth or the reward of the job itself. ^Centers are numbered arbitrarily but consecutively l,2,...,n in a ring formation with the n^ node connected to the first node. If the centers are numbered in some other manner, these rules do not hold. 3>+ We also encounter a question of reliability in this configuration. We want our network to "be such that we can depend on it . If one node connector fails in this scheme it may drastically hinder the running of the entire network. Since the sending of refused jobs depends so critically on finding the minimum transmission path, we cannot tolerate a breakdown in the communications between any two centers. To drastically illustrate this point, picture the situation where the node between the refuser and acceptor fails . Instead of two transmissions using the fallen node, we now must use n-2 transmissions to reach the acceptor at a much greater cost. And if two centers failed at the same time, unless they were adjacent, it would mean a complete isolation of at least one center from the rest. Therefore, we will turn our attention to a scheme that uses more transmission lines to effect faster and more reliable communications between the geographically distant centers. The scheme we deem feasible and propose to use is the connection by one bidirectional line, of a center with two adjacent centers, and in turn, their adjacent centers (see Figure 3-3-)* 35 Figure 3«3- As can easily be seen, we have added only n more lines for a total in this configuration of 2n. This may not seem like an addition significant enough to increase our efficiency but we have proposed this design for three reasons: (1) Reaching the j — center farthest from center i can be achieved twice as fast . (2) We are talking about computer centers with mult i -processor capacities, not about inexpensive equipment. Each center is a large financial investment and, therefore, the number of centers is bounded 36 by the availability of funds . We feel a network of sixteen centers to be sufficiently large for any purposes. Our scheme will work well with this number of centers. (3) Should one center fail there is still a communication line maintaining a possible minimum path in the direction from the sender through the failed node, to the receiver. In the light of this scheme, we find that the maximum number of transmissions from center i to center j is [rt/lfj* if n is not exactly divisible by k and n/k otherwise. It may be noted that the rules for governing the positive or negative flow of a job to minimize the number of transmissions are the same as in Table 3»1« We feel that this transmission linkage is adequate to handle the inter-center communications we desire. We will then discuss some general considerations and then describe an algorithm to be used for the rerouting. 3.3. General Considerations for Rerouting Before formally stating the rerouting algorithm, we shall consider the reasons why the job was refused entry to a center in the first place . We will also discuss under what conditions a center will accept work that another has turned down. It may happen, that a job is refused at one center and no other center can accept it at the time; is the job then lost to the system entirely or should it be resubmitted at a later time? We will now focus our attention on these and other questions of interest. *Read: the ceiling of n/k and meaning the next integer higher than n/k. 37 From Chapter 1. we see that the load regulation scheme used refuses entry to the incoming job solely on the basis of the queue size probabilities developed there. We feel, at this point, that a more thorough discussion of the idea of critical queue size is in order. It is implied, though not intended, that the decision to accept or reject work rests strictly on the number of jobs that are held in the queue, and when the queue reaches the critical number, the load regulator would inhibit further inputs to the center. While the queue size is one of the factors in determining whether or not to accept work, it is not the only one. Another factor in the decision is the expected execution time of the jobs waiting in the queues. If this expected execution time is large, then incoming jobs joining this queue may have to wait for a long time. It may be reasonable under these circumstances to assume that the center has reached its saturation point and to inhibit any further input to it. In this light, we would inhibit input even though the number of jobs is less than the maximum queue size. We, therefore, observe that the decision to accept work is a function of the execution time as well as the number in the queue. We must remember, however, that the critical queue size in our equations is dependent only on the number of jobs. We must, therefore, express the queue size in terms of the number of equivalent average jobs, rather than just the number of jobs actually in the queue so that our load regulator will still work effectively. We do this in Appendix A. In summary, then, we have the equivalent number of average jobs, ENJ, given by 38 ENJ = TET./CRIT. • CAP. l' 1 l where th CAP. = the maximum queue capacity of the i — center TET. = the total execution time for the i — center l CRIT. = the critical execution time of the i — center l Another reason for refusal of a job is that the initial center is just not equipped to handle it. In this case, the job must be re- routed to a center capable of running it. For ease of understanding in the formation of the rerouting algorithm, we will assume that this does not happen. Any job entering the network is capable of being run at the center where it is input initially or at any other in the system. Again we will leave it to management to decide what happens to these types of jobs that cannot be run at some center. With the discussion of the refusing center completed, we can talk about the center, if any, that will receive the job. It would be expedient, at this point, to say that the circumstances governing a center accepting a job from another center are exactly opposite from the reasons the first center refused it (i.e. it has room in the queue and plenty of time in which to run it). But it would also be incomplete. The biggest factor in the acceptance of a job another center refused is the cost in- volved. In particular, the receiver must be able to ascertain if the network will still profit despite the cost of transmission and retransmission of results between the refuser and the acceptor. If one 39 center refuses a job and no other center can accept it, one of two things can happen: (1) the refusing center can try to make room for the job "by sending one or more jobs to another center (at a profit, of course), (2) or the job is lost to the network and must be forgotten or re-submitted at a later time. Money, here as in most areas of any interest, is the governing force. With these considerations in mind we will allow the user to place a time estimate on his job, and to assign to it a priority number between 1 and some x (the higher the number the higher the priority), used to position his job in the queue. We assume here that an incoming job with some priority p, which causes an overload and forces the load regulator to inhibit further inputs, will not displace jobs of priority less than p unless it cannot be run elsewhere. Now, we will finally consider the algorithm. 3.k. The Algorithm Let CAP., i = 1,2, ...,n = the queue capacity at the i — center AT., i = 1,2,... ,n = the average execution time of jobs that enter the i — center CRIT., i= 1,2, ...,n = the critical execution time of the i — center (CAP. • AT. ) i l T ., I = 1,2, . . .,k, i = 1,2, . . .,n = the expected execution times of the k entries in the i — centers queue th TET. = the total expected execution time for the i— - center ko th M. = the present size of the i — queue 1 REW,,. = the reward for doing the I — job at the i — center (here reward is defined to be the profit for running this job. If a job is transmitted to another center, the reward is decreased by the cost of transmitting the job and getting back the results) I = the position in the queue of the I — job, where the priority of the t — job is greater than or equal to the priority of the (l + l) job and jobs are run in order 1,2, . . ., t, t+1, « • • ,&. 1 th w. ( Z rew )/m. = weight in importance of the i — 1 £=1 ^ X m. l center where Z rew.. is in the range (m. • 1, m. • x) and w. is in the range (l,x) (Remember that the user specifies the worth of his job by giving it a number between 1 and x. We can, therefore, with the w. get an estimate of the kind of work a center is doing; a center with w. = 2 would not seem to be doing useful work, while a center with w. » x/2 would be doing very useful work. We will assume that the x is the same for all centers. Heuristically it seems that a w. = x/2 for all i, i=l,2, ...,n would be the best for the network since a center with w. » x/2 is doing the more important work and is a bigger risk to the network if it should fail. If jobs are distributed equally in importance then the risk is about the same for all centers in the network. We mention this here but we will not attempt to level the work loads so that the w. 's are relatively equal) . kl ET. , job job PROF . . job C. . 1J TIME = the estimated time to run the refused job = the priority of the job (between 1 and x) = the profit on the job th = the cost to transmit a job from the i — center to some center j (this cost is computed by using the shortest path from center i to center j according to the communication paths described in section 3«1») = the total estimated time to process all jobs already in center j, plus the estimated time to run the refused job (TET. + ET J job then, the algorithm proceeds as follows; for a job I that must be rerouted from center i to center j STEP I: STEP II: obtain the w., i=l,2, ...,n from the table choose the center j not yet considered with the smallest remaining w. and compute TIME . = TET . + ET . , J J job If no more w., GO TO STEP IV STEP III If TIME. >CRIT. RETURN TO STEP II otherwise compute PROF . . = REW , . - C . . job Vl ij k2 STEP IV: If PROF . t_ < RETURN TO STEP II otherwise insert the job - th job into the queue of this j — center in the following way: Given that the last job already in the queue that has a priority equal to P is found in the q — position, then insert the job in the (q+l) position and displace the lower priority jobs if any by — — ET . , /AT . + 1 positions in the queue then STOP job' JJ STEP V: If the job has a higher priority than some jobs in the queue, th insert the job into the i — center as in STEP IV and then GO TO STEP II to re-route a job or jobs that had to be displaced. Generally, this algorithm tends to choose the center that is doing less useful work than the others to hopefully increase the w. for this center and make it more useful to the network. ^Performed in integer arithmetic because we still want our queue size in terms of time if necessary. ^3 3-5« The Load-Regulator -Dispatcher We have thus far discussed the communications scheme for our network, some of the reasons why a job might "be refused and need these communication paths, and finally the algorithm "by which we accomplished the rerouting. in order to achieve our goal of a workable load-regulator-dispatcher we need to finish construction of the decision table . We built the load regulator part of the table in Chapter 1 with 2 (c+l) entries; we need now to add the dispatcher part to the table. The following information is necessary for the algorithm and can be divided into two categories; information stored in the load-regulator- dispatcher permanently and information local to each center that comes to the load-regulator-dispatcher as parameters of the refused job: LOCAL INFORMATION A. (1) ET. . (2) P. . (3) REW . ; job ; job VJ/ ti these are job parameters carried by each refused job. B. (1) CAP. (2) AT. (3) T^ local statistics used for updating of information stored in load- regulator-di spatcher . kk L0AD-REGU1AT0R-DISPATCHER INFORMATION (1) TET. (2) CRIT. (3) W. (k) C. . (5) M. information used for algorithm and updated as the characteristics of a center change. Our load-regulator-dispatcher table then takes on the following form ( see Table 3 «2 . ) • 2 The number of entries in the table is now 2(c+l) + kc, a number which is well within the realm of feasibility, especially when all centers with the use of a normalizing factor can use the same table . We also expect that with the same analysis of Chapter 1, we can reduce this number. Our load-regulation-table, with the addition of this information, becomes a load-regulation-dispatching table, the generation of which was our goal. ^5 •H O H n o s S s S •H O H CJ Is |S IS is •H o rH CJ EH EH EH Eh H H H H « Ph « « O O O O •H o H o EH EH EH EH H W W (jH EH EH EH o . ■r-3 o 3 CJ o H o CJ ■H o a a a o H H • • • H o • • • CJ CJ cj o o O O o u o fn o -p o CCJ [X) •-— v .-"-v .» V ^— s s~^ S >v o H o o H CJ o CJ CD "■ — H* • • a s. .* V. — 1 a** • • • v. s s s • • • N* •* M G 8 B B B B B B B ,G O b •i-3 °8 H i o H • • • o o H • • • cj o • • • o w •H 08 o H CJ ^-~, ; : n' OJ CO H ■3- EH U6 k. CONCLUSION In the preceding chapters, we discussed "briefly the ideas of load regulation and dispatching in a network of computers. We developed analytical tools which we used to form a decision table in memory, a decision table that adequately handled the inhibition or reduction of input to any center in our network. We then extended our load-regulator into a load-regulator-dispatcher. We, therefore, have the tool for control of our network that we sought to find. The only limitation to this scheme would be one of memory space, and how much of this precious resource management is willing to lay aside for this purpose . We have shown that the memory needs for the load-regulator- dispatcher to work were not excessive in terms of the job that it does for us . It could be noted, at this point, that the load regulation part of our table could be used in a single center for control of jobs on the different processors and future research may look at this. The algorithms that do the dispatching in our network could also be sophisti- cated to a degree that management would have a hard time finding fault with these ideas for control. We have, in this paper, then, razed the wall that stood in the way of using this type of controller. It now remains to only sidestep or climb over the rubble and debris left behind to reach a usable load-regulator-dispatcher. ^7 APPENDIX We wish to express the queue size in terms of the number of equivalent average jobs in queue so that our load regulator will still work efficiently if we let m. CAP. 1 AT. 1 th = the actual number of jobs in the i — centers queue th the capacity of the queue of the i — center the average processing time of jobs that are run at the i — center (a figure gathered over a representative period of time) T„., -6=1, 2,. ...m = the expected execution times of the m entrie: li ,, .th in the l — queue TET. l CRIT. ENJ = total execution time for the i — center th the critical execution time of the i — center = the equivalent number of average jobs in the queue then, CRIT. = CAP. • AT. (a reasonable upper bound) l li and TET. = Z T,. + AT. (A-l) where we add AT. to represent the execution time of jobs already in th process at the i — center and finally the equivalent queue size is the number of equivalent average jobs (ENJ) is given by ENJ TET. /CRIT. • CAP. 1' l l (A-2) kQ We can also see at this point that if TET./CRIT. > 1 th we should inhibit further inputs to the i — center. EXAMPLE Let CAP. = 100 l AT. = 50 units of time then CRIT. = CAP. - AT. = 5,000 units of time; given that the following ill- to m jobs (m=20) are in the queue Job No. Units of Expected Execution Time, T . . 1 150 2 25 3 ^0 k 90 5 110 6 30 7 10 8 75 9 50 10 175 11 k5 12 55 13 20 Ik 90 15 200 16 70 17 60 18 100 19 220 20 80 ^9 then applying (A-l) we find TET. = 1695 + 50 = 17^5 and (A-2) we get ENJ = 17^5/5000 ■ 100 = 35 this implies that for this situation when our queue size is polled at a discrete sampling time KS, for some K, the equivalent queue size sent back to the load regulator should he representative of the number of average jobs (35) and not the actual figure (20) . We should also note that if AT. * m > TET. 1—1 (i.e. the expected total execution time for m jobs is less than or equal to the average), then the equivalent queue size sent back to the load regulator should be m. Therefore, we stipulate that the lowest number that can represent the equivalent queue size is the actual size of the queue in terms of the number of jobs (i.e. CAP. > ENJ > m. ) . We do this because it is too hard to change the hardware configuration of the queue which specifies that each job should occupy a specific physical space (i.e. four words). It would, therefore, be unwise to attempt to cram four or five short jobs into the physical space generally occupied by one or two. 50 LIST OF REFERENCES [l] Morse, P. M., Queues, Inventories and Mai ntai nance , John Wiley, 1962 . Bailey, Norman T. J., "A Continuous Time Treatment of a Simple Queue Using Generating Functions, " Journal of the Royal Statistical Society , Series B. Vol. 16, I95I+, pp. 288-291. [3] Saaty, Thomas L., "Time Dependent Solution of the Many-Server Poisson Queue," Operations Research , Vol. 8, i960, pp. 755-772* [k] Bowdon, E. K., Sr., and Bar r, W. J., "Throughput Optimization in Network Computers, " Proceedings of the Fifth International Conference of Systems Sciences , Honolulu, Hawaii, 1972 • [5] Frank, H. et al, "Topological Considerations in the Design of the ARPA Computer Network," Proc . SJCC , AFIPS Press, Montvale, New Jersey, 1970, pp. 581-587- [6] Foley, James D. and Lau, Kar-Wong, "Computer Aided Design of Computer Networks Via Computer Graphics, " The University of North Carolina at Chapel Hill (unpublished) . [7] El-Bardai, M. T., "Load Regulation Through Stochastic Queue Control," Working Paper No. WP-72^4-, Collins Radio Company, Cedar Rapids, Iowa, 1969, 2k pages. [8] El-Bardai, M. T., "Load Regulation in C -System, " Working Paper No. WP-72J+5, Collins Radio Company, Cedar Rapids, Iowa, 1969, Ik pages . El-Bardai, M. T., "Numerical Results for Some Load Regulation Schemes," Working Paper No. WP-72^+9- Collins Radio Company, Cedar Rapids, Iowa, I969, 20 pages. 3GRAPHIC DATA r 1. Report No. UIUCDCS-R-72-537 » and Subtitle ad Regulation and Dispatching in a Network of Computers 3. Recipient's Accession No. 5. Report Date August 1972 ior(s) imes F. Fitzgerald 8. Performing Organization Rept. No. orming Organization Name and Address oartment of Computer Science Lversity of Illinois at Urbana-Champaign sana, Illinois 61801 10. Project/Task/Work Unit No. 11. Contract /Grant No. NSF GJ 28289 insoring Organization Name and Address tional Science Foundation shington, D.C. 13. Type of Report & Period Covered Thesis Research 14. >plementary Notes is paper is aimed at developing tools to control efficiently the flow of jobs i job traffic in a network of computers. Input of jobs to each center is ntrolled by predetermined information based on probabilities and stored in table rm. These probabilities are developed mathematically, predicted on the fact at we consider the input rate to be a random variable capable of assuming any ze. The table is then extended to handle the dispatching of jobs that must be routed between different centers in the network and an efficient controller is us developed. y Words and Document Analysis. 17a. Descriptors twork Computer, Load Regulation entifiers/Open-Ended Terms OSATI Field/Group ^lability Statement 'nlimited Distribution TIS-3B ( 10-70) 19. Security Class (This Report) UNCLASSIFIED 20. Security Class (This Page UNCLASSIFIED 21. No. of Pages 22. Price None USCOMM-DC 40329-P7 1 SEP2 2|97 ?