; ■ . ■;.- • : ' 0'855 I! a n Ik m n iyi^f^ Report No. UIUCDCS-R-76-830 NSF-0CA-MCS73-07980 A03-000024 INTERPROCESSOR CONNECTIONS- CAPABILITIES, EXPLOITATION AND EFFECTIVENESS ^ by Kuo Yen Wen October 1976 !^^<(.>twork 2.2. 1 £ontrol_S true turGs_fgr_Ome£a_Not work 2.2.1. 1 Source/Dostindtion_Ta2_MGtho^ •"he omcTn network is attractive not orly because of its low qate complexity, bat also b<^'caus€ of its control simplicity. There are a numhor of ways to control the omeqi network. T\\^ most fandamental laethoci is the destination taq method[''-i] which uses K destinatioii tags, each of log N bits, ^ach source port has a tag which r'^-inresents the destination port numbers the data element intends to reach. log N stages will be required to set the network, and as each stage is set, the data elements will be switched accordingly. This control method can he used to pass all omoqa-passa ble ii^XiBMl^ t ions . A more general method is the source tagging met hoi [U]. Instead of using the destination tags, source tags are used. One source tag is associated with each output and represents the input to which that output port will be connected. The source tagging method c<\n pass all oneqa- passa.ble connections, including t h<=' one- to-many- connections that Cannot be realize-i by th(^ destination tai methoa. However, the iiiain drawback is that the network control has to be set stage by stage fLora the last stage to the first stage before the data elements can be passed through the network, flenca an extra C (log N) gate delays will be nee.ied Cor a conection. Details of these two control iJ methods are dosccibed in [ U ], A inodifiGd destination tag method that will allow certain broadcasting (one-to-many) connections will be presented here. Using this modified method, he will tio away with the extra ©(log.N) gate delays needed for the source taoging method. however, the broadcasting furctions will be limited to only one- to-power-of-two elemients. For example, the following connection will not be realizable by this method, cut can he realized using the source tag rn thod. Example Source Hestinition 1 2 2 3 For this modified tag method, instead of allowing only or T for each of the log^N tag bits, we allow '0', •1', •♦• or '-' for each of the log.N destination tag characters. So a source/iest inat ion pair may now look like: (0 1 1 ■) 1, * 1 1 *) This pair will represent a one-to-four broadcasting function of: (0 1 1 1, C 1 1 0) (0 1 1 1, C 1 1 1) (0 1 1 1, 1 1 1 0) (0 1 1 \ 1 1 1 1) The tag characteu •*' takes on the value of all possible binary diqits, while 'C and ' '' » still have the original meaning. For completeness, we have to use the tag character: '-' to specify no connection. So for a complete Source/destination set, we might have: Fyample (0 0, * 0) (0 1, 1 1) (1 0, 1) (1 n - -) source port destination port 2 1 3 2 1 It should bG clear by nOw that this modiiied tag method can only be used if the power cf two (say, 2 ) destination ports that a source port is connected to have the same (log. N-h) bits in their tag bit representations. This modified destination tag inethod also provides a great notational advantage over the source tag method when we have to describe a one-to-pcwe r-of- two broadcasting functions. As can be seen in later chapters, most of the common broadcasting functions are of this type and can be easily described oy this modified tag method. SIS "I 10 2.2.1.2 Co 1 iiiin_ Con trol_ Method To control an omega nctwoclc of size N kN for any omega passable E§£iDJiil3.ili2Q/ ^^ would require Nlog N/2 control bits. TSach switching element will either be set to the 'cross over* or *strdiqht through' state according to the value of the corresponding control bit. Suppose we are willing to sacrifice some of the capabilities of the network in order to further simplify the control structure. If we use only log N control bits, each controlling a complete stage of switching elements, then we will have the column control method. An omega network utilizing column control method turns out to be exactly the same as Batcher's scrambling/unscrambling network[ 12 !• As pointed out in [12], the scrambling/ uncramblinq network can be constructed with 1 eg N levels of selections and perfect shuffles, just as an onrega network. Let s.. he the 1th most significant bit of the bit •J representation at the ith source port and d— be the ith most significant hit of the bit representa tior. of the ith destination port. Mso lat p. be the jth irost significant J control bit. Then for any permutation to be realizable by Vj=1...1og2N, Vi=1 . . ..N. where ® is the exclusive-or operation. More details can be found in [ 12 1. the column control method, dj. = Sj. © p: Since there is a total of only 2**(log N) (=N) 11 different sots of p. '3, we can pass at most i: rjistinct permutations using this method. Using the aigumonts in [12], we can see that if the (i, j) th element of an NxN matrix is stored in the itb position of memory module i®j (0twork using this column control method (either by building five stages of shut f le-exchange s or recycling a one stage network five times) , and also that we set p to CI 111. , The permutation that we get at the output will be (5 7 1 3 4 6 2) for an 6x8 network. However, the fourth most significant bit of p will force a shuttle and then 'exclusive or' with the most significant i>it of the source 12 c 2 2 3 3 7 7 5 1 a 2 3 2 7 3 5 7 2 1 a 8 1 2 6 1 3 5 <5, 4 1 6 -> 1 3 H 2 1 3 6 7 1 5 6 U 5 6 3 1 7 6 5 1 U 6 6 3 5 7 U 5 C U 2 7 7 7 f) 5 4 a 2 Fiqure 2.1.1 Intecmedidte Patterns using p=C1111 s y S S E S S c a u a 5 '■) 5 1 '4 A 5 u 1 7 2 1 5 6 7 U 1 3 5 1 2 7 6 3 U 2 6 5 1 7 a 5 6 2 7 1 3 6 f. 3 7 1 2 3 6 7 7 3 3 3 2 2 2 Figure 2.1.2 IntcrmGdiate patterns asing p=0 11®1 1 r=1 01 plus two shuffles at the end T'igure 2. 1 13 taqs ^qain. A similar etfect will b« produced hy the least significant bit of p on the second most significant bit of the source tags. The net effect will then be equivalent to setting p=0 1 1©1 1 C (= 10 1 ) and adding two extra shuffles at the end. Note that the output of Figure 2.1.2 is also (5 7 1 3 U '^ 2) . So by setting p=01111, we only result in the 2-shuffled version of setting p=101. However, it should be notsd that the Nlog N upper limit on number of passable permutations onlv applies to the column control method. Foe any individual switch control irethod (such as the source/ destination tag method and the ROM method), the upper linit depends on the nnmher of stages. The perinutation capabilities of the log N stag^^ column controlled omega network are well delirod by [12], and we will not rePeat his results here. However, we can increase the capabilities of a column controlled network to allow certain broadcasting functions by usi rg two control bits, bo and b, t in each column. In Chapter 3, it will be shown what broadcasting functions can be realized by this method. The switching functions for various values of b^ and b, are shown in Figure 2.2. u^ Ihl 14 ^0 t>, Action 111 ustration upper hcoadcast straight pass cross Over lower broadcast C \ ► 1 - — ^ — ► 1 ^ ► X — ► ► 1 1 ^ y ► Figure 2,2 15 2.2.1.3 PC iM _Con t ro l.He t hod To implement the source/destination tag method, we either use a fast method [13] which would require SNlog^ N (<^* (loGfj^"^^ ^■^^ qates, or a slower method [T^] wnich needs (Ud* 1 1 ) Nlog N gates but requires the use of strobes at each stage to pass the tag bits along. The column control method also needs the use of stroties if a one-stage shuffle-exchange network is used* So the propagation delay through the network is On the order of log N clocks. In this section, we will propose another control method which can eliminate the use of the strobes (clockings) without paying too much penalty in gate counts. This POM control method provides a faster m--»thod to evaluate the control function at each of th^ Nloq N/2 switches simultaneously and these functions are imposed on all stages of the network at the same time. 3o instead of taking log N clocks through th-^ network, we would only require a couple of clocks for the source data to be routed through. Ihis oreatly reduces the network delay for the processing system. This method does not pass as many permutations as the source/ destination tag method, but it will pass many of the more common ones. It car pass all shift, flip, (c-i) , and odd-ordered vector unscrambling permutations in any power of two partitions. Ho again assume, as in Section 2.2.1.^;, that a control bit value of 1 will set d switching element to the 16 •cLOss over* state and the value of will set it to the •straight throuqh' state. The basic idea is to fetch N/2 control bits from each of log^ N ROM's, accocding to which permutation function is called for. The array of log N x N/2 control bits is called the control iritrix ax\d can be imposed on the omega network to facilitate the corresponding permutation function. For ex-imple, the control matrix for a 1-shift permutation in a UxU oirega network is: Imposing it on an omega network, we g€t: o O -o 1 O 2 o 3 fee will iirmediatcly get the following permutation : 17 -■ 1 -• 2 - 3 -■ > 1 •> 2 > 3 •> , which is a 1-shift permutation. In order to miniinizc the amount of ROf! space required for different families of periratations, as many common characteristics are recognized frorr the control patterns as possible. Then, by using sone extra logical operations, i X I 20 shift distance control pattern c 1 1 1 1 1 1 1 1 shift c lontrol distance pattern 4 1 1 1 1 5 1 1 1 1 1 1 1 1 1 6 1 1 1 1 1 1 7 1 1 1 1 1 1 1 igure 2.4 Control Patterns for Shift Peciriutations N = 8 21 space of ( U2+U+. . . N/2) bits =N(N-1)/2 bits, with the help of some additional loqic elements. R similar phenomenon to that in the shift patterns can be seen for odd-ordered vector uascramblir.g permutations. The control patterns for various odd-ordered vector unscramblinq for N=16 are shown in Figure 2.5, Let P P/s* • • p be the bit representation of the order of Unscramblinq. Then P.P^....p« . will be used as an address to fetch the basic pattern for the first column, P2.,,.p^_, as the address to fetch the pattern for the second column, and so on. The output, however, does net need to be exclusive-ored with p. (as in the case of shift patterns) to produce the correct patterns. A possible organization of the control system is shown in Fiqure 2.6. Using microprogramiring, we can set k,k2....k^ to 3,S2....Sj^ for shifts, to c^c^'-'-c^ for (c-i) permutations, and to Op^ pg unscrairbling. Pn-. for odd-ordered vectoc The basic control patterns have to be generated and input into the POI's. The basic shift patterns can be generated quite easily. There will be N/2'* entries in the jth RCM from the lof t ( 1< j :> ■s iN 22 control order pattern 1 c (5 3 1 1 1 1 1 1 1 1 1 5 1 1 1 1 1 1 1 1 7 1 1 1 1 1 1 1 1 1 1 1 1 1 control order pattern 9 1 1 1 1 11 1 1 1 1 1 1 1 ■ 1 1 13 1 3 1 t -4 .f^ ^^' *D' -0 1 1 Q 1 1 15 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 Figure 2.5 Control Patterns for p-unscrambling 23 J3 C o CM +J ^ rO Z Q. Sk Z3 O o 4-> c: o o a: C7> O C < CsJ s- 3 cn .(£ > » — r^ ^ E +j o 'f— O) +-> > n3 -o Q. CD J- OJ C +J -o -M M- $_ • •^ o • JZ 1 • (O Q. CM 24 p-ordered vactor unscrambling patterns, it requires a bit irore work. For the first stage pattern, the ith control bit (0 ax+cy = bx+dv mm rn Rl) x+a = y+rt --> x=y m m P2) x=y --> ax = ay m m F;3) If a is prime to ir (i,e. gcd (a,rii) =1 ) y then ax = ay --> x=y m m 26 FU) ax = ay < — > x=y am m ES) If 0 x=y am TO m P6) x=Y and xf^y --> x ^ y El through B6 can be found in [U]. We will also present Leir.ma 2.1 which is extended from Lemma 1 of [ft! and some of the above number theory results. Lemma., 2. 1 Let 0 «i =7, and X2=yj,. an Proof : a) to prove ax, ^-Xg e ^y, +y — >x,=y, and ^z'^iz an Lemma 1 of [ U ]. proved in b) to provt^ ^1 ^Yi '^'^d X2 = y2 — >dx, +X2 = ay, ^l^' assume x,=y, an and X2=Y2 • By TRU, ax = ay,- Since 0<:Lr.,'i^<^, then by P5, we an have X- = y„ . So by RC, we qet ax,+x. = ay +y,, QED. 2 ~ '2 an I -^2 an 1 -'2 To simplify the proof of the partition theorems, we need to restate Theorem 2 in [U], which dictates whether a given connection is omega-passable or not. Lemma_ 2.2 (Equivalent Statement of Theorem 2 in [ U ]) Given a set of desired input-output connection P = {(5. ,D.) I 0^ , and so on, A pictorial illustration of Pm is shown in Figure 2.7. >i*r I « re > I IS > ■ M ik' With all these preliminary definitions and lemmas, we can present the first cf the oireqa partition theorems. 28 Partition //O 1 2 o o 3 4 Partition n 5 6 o o 7 o 8 o Partition n 9 10 o 11 o 12 o Partition in 13 14 o o 15 o Flip 3-order unscrambling 1-shift 1-to-many broadcast Figure 2.7 A Partitioned permutation 29 Theore iT 2. 1 Let nLt^L ^"^ ^t^^fA- ' Olii^' ^^^ l^*" N=Lx«, then ^N^^N^^L^fPo'P \-,>- We first present a simple sketch proof in Proof 1. then we present a more rigorous proof in Proof 2» Proof 1 Assume n.^P|^. By lemina 2.2, there exist S, -SpM+tpq , D.=dpM+epq, S2~Su^**-MV» ^l~^^^*^o\/ ^^^ ^ such that S, ^ S2 and X D, S D2. Let m=logM, n=loqN, b=logL, and x=loqX. It X>Mr pictorially we have: ^2 4— b- Sp -».«— — m — ► tpcj Su t^jv — . , „ , ■* [J^A 1± 'UV I D2 m c;; i^» b*-ir-x Here the trailing x bits of S, and S2 are equal, but the leadinq (b+ir-x) bits are not equal, and the leadinq (b^m-x) bits of D| and D2 are equal. Since x— fc+ni-x Since the leading (b+m-x) bits of D, and D2 are equal and ni>x, we have dp=du and ^pa— ^uv* Since n^fPL ^^^ dp=du, p has to be equal to u. So Sp=Sy . This implies that tpa=tuv since S,^ S^ . S,s S^ and Xep(^=dyM>eu^ Ey Lemma 2.1 and (a), Sp ^ Sy or tp^ 5^ t^^^ (2.1) If XiH, let A=X/M=2*. By Lemira 2.1 and (b) , Sp« s^ and tp<^= t^^^ (2.2) From (2.1) and (2.2) we get Spl^Sy (2.3) and Sp= Su (2.4) By detinition, (c) implies X dpM+ey^ s X N dqM + ey^ d^M since epq»eyy to denote ele:nent (k,K,y) of an axbxc array. Here 0 > (*, x,y) symbolizes the mapping ot element (k,x,y) to eleirents (0 ,x,y) , ( 1 , x , y) , . .. . (a- 1 , x,y) . Now we can show six extensions of the broadcast theorems. '^or constant k and for all values of x,y,*, (i) Q^^^ t ((k,x,y) --->(*, x , y) } (ii) n^bc t ( ---> ( x , * , y) < h, a, c>} (iii)n3^J^ t r(k,x,y) >(x, y , *) < b, c, a >} (^v) Habc t f ('^.,x,y) Xy , x, *) } iff a>c (V) r^abc ^ f(k,x,y) —->{*, y,x) } (vi) Habc ^ f (k,x,v) ---> (y , *,x) } Proof: Let a' = loq(a), b' = log(h), c'^log{c), and r'=log{r) C^) £} 1'f(k) >(*)}# d 1-tc-many broadcasting function, and X^bc t{(^#Y) > (x , y) ) , an identity. Therefore, (i) is proved by applying Theorem 2.1. 39 (ii) First .*^ will prove: 12 gb t f C^/X) > (x, *) } . Assume it is falsp, it implies that there exist Xf,X;,r,D,q _ r such that x.^K.f kb*-x. = kh+x. and x.a + p E X;a + q. 10 I TT J ' ab J If c>b, we have x. ^ x, and x- = x: , a contca diction. If r (x , *) } . Using Theorem 2.1 agairi, noting thatQ^tf(Y) >(y)}r (ii) is proved . (iii) Proof ot (iii) is similar to that of fl^^f f (k,x)— >(x,*)} in (ii). (iv) For this case, we can represent the tags (s. ,d.) (s ; ,d •. ) as follows: a' b • s . , k , X . c' Y; J L c Y X . I a* P and *f^ I iii (1 'J L J I '-i I ~t — r' a« <-b'+c' -r ' Fach tag is divided into 3 parts of lengths a', b' and c' respectively. If a>c, assume Habc^ f C^/XfY) > (y , x, *) } . This irrplies that there exist s. ,d.,s.,d', and r such I I J J that S; 5E s; , and the least significart r' bits of s. ' abc J ' 40 (v) rhir, case can be represented pictoriallj as: a' b' C b» X . d : d»+b«+c'-r Tt b>c,we can pick (s. rd. ) F, (Sj,dj) such that p=q, yi=Yj' and the ir^ost significant (b'-c') bits of Xj S Xj equal, with the least significant c* bits unique. Then i s : . By letting c=c, we can see that s, = Sj. Also abc we have the most significant [ a' +c' + ( b' - c') ] bits of dj F, d; being equal, which is equal to a' + t'*c'-r'. Hence we lust show that ^2 abc ^ fC^ . x, y ) > (*r y#x) } . v. =y. and If c>b. We pick (s.,d.) & (s-^d:) such that p=q, X. ^y- . Then S; ^ S; . We also pick r = c, then s- E s- . The number of leading bits of dj Fj qj that are the same = a'+r' >a'+b' =a • +b ' + (c ' -c • ) . Hence we can see that QQ^^^f{a,^,y)<^.^,c> — > (*,y,x)}. (vi) We can represent this case as: b' X • I I ^' I L ^1 X . 1j a'+b'+c»-r If a • + L' +c' >?c' +a' (i.e. b'>c'), we pick y. = y • r p=q and the 41 and Sj are equal, while the most siqnificant (a • fh •♦c'-r' ) bits of d; and dj are equal. If r>c, then from s. and s . , we can see that Yj ~Vj • Also, the least significant (r'-c') c£ x. and X: are equal. From d. and d ; , we can see that the most siqnificant (a ' +1 • +c ' -r' -c' ) bits of x. and x: ace equal. These add up to a'+b'-c' I'its of x. and x,, and is qreater than or equal to b' if a>c. So x. =X;. This contradicts s. ^ s;. a be If rc, Og^^^t f (k , x, y) (y ,x, *) } . f a > (y ,x , *) } . If ab>c, we pick (s.,d.) and (s,,d;) such that y, =y; and p -q . Then x.=Xj except for the most siqrif leant (c'-i') _ r. bits. Then for r'=a'+b', wo have s. = s; and d; = d-, , I y. J ' abc J and since s. ^ s- , we show that Qa^c "^ f C^* x, 7) abc > (y,x, *) } . If ab - — > (y , x , *) } . en I IS •0 i 42 last (b*-c') bits of x 5 x equal. Then there will be conflict if r'=b«. This implies Q abc ^H^ ,x ,y) - — > (Y,*,x) } . If a' + b* +c' <2c'+a» (i.p. b' - — > (y, ♦rX )} . These broadcasting theorenrs are alsc essential in ostablishinq th^ rocurr^nce alcjorithtrs in Chapter 3 ani are some of the more important properties of the cmoqa network. 43 2 . 2 . t| Gen eral Admissibility It is veil known that the omega network can only pass a sirall fraction of the total number of M permutations (N** (N/2) /N!) • To improve the permutation ability and to understand better the relationship between permutations and the owega and inverse omega networks, we would like to extend some of the results of Pease[16]. Without relabelling, the indirect binary n-cube array described in [16] is actually an inverse omega network[4] after rearrangement of the switches. He can extend his results to the omega network in a very simple manner. Let n-i n-2 + the index p be represented as (P, 2 +P22 •»-....?„,, 2*p^) instead of (P|+2p2 + ...2 P^) » as in Formula 1 of [16]. Then we can use exactly the same theorems as in [16], except now we should note that the index bits are reversed. iZ Let X and y be expanded in binary notation as (X, ,X2 ,. .. ,x^) and (y, #y2 # • • • ^y^) with x, and y, being the most significant bits, x„ and y^ the least significant. The function describing a permutation P can be written as a set of functions Y; = P; (X,,X #X^) . (2.1U) + This notation is consistent with other sections of this thesis. 44 The principal theorem for the omega network will then be: Theoreir 2. U F is adirissible by an omega network if and only if the functions (2.1U) defining P can be written in the form Yf = xj®f; (y, ,...,yi., ,Xf^ ^r^) (2.15) for 1 = Xj^f ,' (y^, ...,y,- + , rX,_, ,...,x,) for 1®Pn-2^n®Pn-i'^n-,®S„_2 (x^_, ,X^) n-i where Sj is the carry from the lower ordered bits to the ith bit. It is immediately obvious that this is an upper triangular permutation defined in Definition 2.3. Hence by Theorem 2,1, we can see that all compositions of shifts and odd-ordered unscrambling permutations are omega passable. Defining x=(x, x^ ...x„) , and y= (y , J^ •••Yn^ " ^ permutation is linear if there exists an n x n nonsingular binary matrix, P, such that y = P • X (2.19) f ^: C5 It 1^ \i :s 48 Extendinq Pease* resalt to our notations, a linear permutation is oinega passable if P can be decomposed into the matrix product LU, vhere L is a lover unit triangular matrix and is an upper unit triangular matrix. (Dnit means that all coefficients on the main diagonal are ones) . By analogy, a linear permutation is inverse omega passable if P can be decomposed into OL. One important result from [ 16 ] is that any nonsinqular P can be decomposed into L,0L2« This implies that P can be decomposed as L,0 and Lq* So P can be passed by two omega passes. It can also be decomposed as L| and UI2. Hence it can be passed by two inverse omega passes. This is a very significant result because it increases the permutation ability of the omega network. The perfect shuffle permutation is a good example. Although the omega network is made up of stages of perfect shuffles, a perfect shuffle permutation cannot be passed by a fixed log N stages omega network. However, it is a linear permutation. So it can be passed by two omega passes. A diagram showing the permutation abilities of the omega and inverse omega networks is shown in Figure 2.8. 49 n Passable U Passable Others CD i-i rt c 13 (D i-j rt •-< LU deco- mposable without pivoting matrix Langular 5 ->• C M 03 UL deco mposable without Divot ine assable) 2-fi-passable (or 2- -^^"■'•-p linear permutations •ii; f;i r- IS Figure 2.8 Permutation Abilities of Omega Network and Inverse Omega Network 50 2. 3 Batcher_Net wock _and „, the 5h uf f le_Cgnnection Cnc network that bears a great reseirblence of the omeqa network is the Batcher's merging network. The only structural difference is that instead of using tag bit comparison at each of the switching elemerts, the Batcher merger compares the magnitude of the two whole tag words at each switch t-o dotetrr.ine which of the two output ports to select . Pefore we continue the discussion, w€ first define the order set of a set of N elements as the relative ordering of the elements, "^or example the order set of (8,12,17,3,9) is {1,3,U,n,2). It can te observed that the Batcher's merging ilqorithm for a set of distinct elements is equivalent to the omega tag routing algorithm for the corresponding order set. It shoul-i, then, be obvious that the cmoga partition theorems also applies to the Batcher merger. Now we can deduce an alternate proof of Stone's implementation [R] of the bitonic sorter on the perfect shuffle network. The basic idea of an NxN Batcher bitonic sorter (i'l being a power of 2) is quite simple. Given two sorted sequences of l^inqth L each, if the first sequence is in 51 ascending order whilo the second sequence is in descendin>j order, then the ?Lx2L Eatcher merger will sort the bitonic sequence tor'^ed by the juxtaposition ol the two sequences. A bitonic sorter consists of loq N stages of meiging nettfor"'CS, where stage i ( '' £ i S lo92'^) consists of N/2' bitonic mergers of size 2' x 2'. (Some switching elenents will need to have their outputs reversed in order to produce the required descending order.) By the extexision of the omega partition theorem we can use an NxN Bitonic merger to 'simulate' N/2 bitonic mergers of size 2' x 2', by setting the switches in the first (log^ N-i) columns straight through. Hence, the sorting algorithm in [ 8 ] iirpienren ts the Batcher bitonic sorter on a perfect shuffle network. C) C3 52 2 . U Benes^Network andthe^ Shu f f le, Connection It can be proved that the binary Beres network of size NxN (whefe N is a powec of 2) is equivalent to a cascade of an inverse omeqa network and an omega network , with the middle two columns of switching elements collapsed into one. The best known control time for the Benes network [17] requires on the order of Nxlog^N operations. However, the control time for an omega network is 0(log2^N), so for all omega passable or inverse omega passable permutations, the control time is only on the order of log2N. 53 Sinc3 many connection networks ire sh uf f le-fcasod, if we build a one stage perfect shuffle network to interconnect all the proc'-»ssors, we can simulate any of these networks by cycling sufficient nuirbor of times through the network. lloroover, thn complexity of the network will only be 0(M), which will be a great deal totter than that of other networks. One o^'vious shortcoming ot the omega retwork is its inability to pass some permutations. In this section, we are searching for the best strategies to pass any permutation through a one stage rocyclic perfect shuffle network. By recycling a one stage perfect shuffle network a sufficient number of times, we hope to oe able to pass any general permutation. Lang [13] proposed to use queues at the outputs of the switching elements, and then cycle the network for as many times as it i.s needed for each ot the log ^ N steps. Following this strategy, the number of shuffle cycles required in the worst case is found *-o be =2j2N -3 for log N being odd, =3Jn -3 for loqjN being even. The length of the queues can grow to: for log N being odd. a * Ml ill 111 I for log N being even. 54 Lang's algorithm is good in general. However, the building of two 0(jN)-long queues into each switching element certainly complicates the design of the switching elements. Hence in another possible strategy, we prooose routing algorithms that do not reguire to use queues at the switches. The routing strategy is similar to that used in the Destination Tag Method. Every input port will generate a logjN bit tag representation of its destination and push it (together with the data) through the network. The switching functions of the switchirg elements are still the sa.ne. Flowever, in case of conflict at any switch/ one input will be honored while the other has to be switched to an undesired output port and restart from the most significant tag bit. At any given stage, the bit positions to be examined for each tag may le different. So we need a bit count associated with each tag to indicate which bit position will have to be examined. Aft-^r the last tag bit of each input has been examined, it will be stored away in a register and taken off the network. The conflict rf'solution can be: a) gatp straight. b) honor the input furthest away from its destination. c) honor tht^ input nearest to the destination. Sirulatiori usir.g random permutations shows that, on the average, all three resolutions are iu£t about equally 55 effective. Instead of restarting from the most significant tag bit for the data at the wrong output port, we can use a built-in table to determine which bit to examine. In the discussori below, we let n equal to loy^N, For the destination tag method, assume tnere is no conflict at any stage. We can observe that at stage k (k=2 to n) # the data word whose destination tag is d,d2....dn will be at switching eleirent i (with binary representation i|i2»»««in-|)* where d,d2...d|^ =in-i<»»»»in-i • Hence by comparing the destination tags with the switch tox numbers, wt; can find out how far a data word with certain destination tag is frcir its destination. A more useful information is which tag hit d,^(1 o 2: +-) J- OJ 4-> r— C »4- O ««- lO CO $- > O CO c cu +-> o O) <4- S- O) Q. O) CD fO +-> (>0 ^: o o -o O) s- ■M O in O cu CJ5 in O 10 e4 O CM ift in sa[0/C3 >|jom:^9n ^0 uaqiunfj 58 ii NETgOPK_LfTILI2ATIQN IN , F& RALLEL PROCESS ING , SYSTEMS 3» "^ Introduction To build a rreaningful processing system, we have to be able to handle efficiently most of the application demands of the users. In this section, we will investigate the alignment requirements of some common operations or algorithms and with what efficiency they can be handled by the alignment networks. Array operatiors are probably the most common type of operations found in ordinary Fortran programs and they have the most potential for high speedup and efficiency. So the most important criterion of a good parallel processing system is the efficient handling of array operations. Budnik and Kuck [19] and Lawrie [U] discussed ways of organizing the memories to allow conflict-free access to various slices of arrays. Linear skewing is a standard technigue. However, the data output will sometimes form a p-ordered vector, which cannot be unscrambled by means of a simple shifter, [4] discussed the alignment reguirements for some of the most common types of array accessings. In ordinary programs, operations that are not scalar nor array operations very likely belong to the class of "^ What we mean by array operations are the obvious type of vector operations found in programs, not the type we obtain by carefully rearranging the operation seguences of a particular algorithm. 59 recurrence operations. Recurrence operations, if not treated properlvr will degrade a parallel processing system to a serial machine. Kogge and Stone [20], Heller [21], and Chen and Kuck [22] have shown various algorithms to speed up recurrence operations. Section 3.2 will discuss the adaptation of various recurrence solving algorithms onto parallel processing systems. It will be shewn that, with careful planning, the alignnrent reguirement £ can be greatly simplified. flenre, we would not n.Be'i to use a full crossbar. Insteadf a simpler alignment network, such as an omega network, will suffice. The adaptation of recurrence operatiors onto parallel processina actually serves as a good example of how a wall known comoutation algorithir can he tailored according to the limited number of available perirutations of the alignent network to minimi7e alignment time. In the extreme cases, the alignirent network may have only limited number of connections (lik^ the Illiac IV shifter cr a one-stagci perfect shuffle network). To obtain any general permutation, the network has to be recycled many times. For example, the one-stage perfect shuffle network described in Section 2.5 may reguire (Jn) aiignnont steps before we can start on a processing step. By carefully rearranging some of the operation seguences in normal algorithms and by assigning intermediate storage patterns in a deliberate fashion, we can hopofully reduce the number of alignments per processing step n Hi CO ,1:4 Cm ^3 £3 n ii 60 down to a con-Gtant (not dependent on H) . Pease [10] and Stone [8] showed how the Fast Fourier Transform can be done efficiently on a multiprocessing system interconnected with the perfect ?5huffle connection. In Section 3.3 we are going to show how matrix multiplication can be done in a more efficient way in a multiprocessing system with a certain class of conn«^ction networks. The number of alignment steps is shown to be reduced ty a factor of Jn or IcgoN. The algorithms described in this section are i^i such (ietails that they can be easily micropro crammed into the respective parallel processing systems. The intermediate storage and alianment patterns are all clearly specified. Masks are needed occasionally to prohibit some processors from doing the prescribed operations at some steps. Throughout this section, we are considering parallel orocessinq systems structured like that in Figure 3.1. The central control unit is not shown in the figure, but is actually the master unit ot the array of processors. It sends the ir icroinstruc tions to all the processors together with the masks. Each processor will address only its own memory. If the data words obtained need to be sent to (different processors, they will he gated to the Alignment Send Register (ASR) . After the roc,uired alignment is done, thf>y will he returned to the Alignment Peceive Register (AR R) . fcith this archit.^cture, wo can align internal registers as well as memory r<-^gisters. 61 Si R A S 1 1 1 ^ * — -► D R 1 1 ____ A R R l^ 1 1 1 1 1 ALIGNMENT NETWORK j) Memory Data Register \SR Alignment Send Register \RR Alignment Receive Register !>^- Memory Module i 1'^. Processor i Figure 3.1 A Parallel Processor System Configuration 62 3.? Adcipt.ation, of _Recai:i:ence_ Solvers Chen and Kuck [22] provided many good algorithms to handle recurrence systems. To actually implement these algorithms on a parallel processing system, ore would require some cateful partitioning of the recurrence system, and a qood, uniform way to allocate the initial and intermediate data so as to minimize the data routing time and the amount of intermediate storage space. The solution of a ? recurrence system is actually equivalent to the solution of a bandod unit lower triangular matriv system with matrix size n x n and the number of nonz^^ro bands =m*-1- In general, he have to solve (3.1) for X to gt t the recurrence results. A X (3. 1) where A is a lower triangular matrix with 1's on the miin diagonal ari m. more nonzero subdiagonals. [23] and [ 2U ] reorganized some of the recurrence algorithms into partitioned matrix notations to simplify understanding. According to the number of processors available and thi-^ values of m and n, we have to use different recurrence solving algorithms for higher efficiency. In general, there are three major algorithms to handle recurrence systems. Ihe first algorithm uses a limitei number of processors, and evolves from [23] and Algorithm 5 63 of [25]. The second algorithm assumes the presence of a large numjuer of processors, but will do the folding when the number of available orccessors is less than the upper bound. It evolves from [2^4] and Algorithm 2 of [22]. The third algorithm is similar tc the second algorithm €xcept it uses a less parallel method in solving the small full recurrance systems in the initialization stage. The number of processors used will be between that of the first and the second algorithms. Given D(the number of processors), and values of m and n, the execution time for the first algorithm = ^mn (m+2) /p + log^(p/m) => (m +^m/2+ 1 ) -ni'^-9m/2-2 , for pn, for p=mn, for pmn. 2 for p>m n. ■eMt 1 .1,1 C3 n = (loq n (2 + log m) -loggm (log2rr+ 1 ) /2) m n/p , for pq) , RA=h+1. (vi) broadcast into riqht R2. (vii) multiply R1 ancl E2 into R3, (viii) aad r3 to ACC. c) noqate ACT. d) store left ACC into right half fi, RA=h, c) store riqht ACC into right half h, SA=h+1. D. Done. Staqe U : 1. Form k m-Partitions. 2. Repeat for h=0 to n/p-2; a. Perform an rr x n iratrix transpose frorr c;|FA: (hm) to m(ti-H)-l) to H. b. Fetch from f, RA=h, into ACC . c. Repeat for l=0 to m-1; ii Ii 70 (i) fetch from H, RA=i# into R1. (ii) fetch V element from meirory i. (iii) ' broadcast to the right neighbor m-partition into P?. • (iv) irultiply R^ and R2 into R3. (v) subtract R3 from ACC. d. Store ACC into y, PA=h. 3. Done. An al ysis : Throughout the algorithm, partitions of size m or 2r (=2 m) are used. By Corollary 2.1, we can see that the omega network (or some of the full permutatior. networks) is capable of performing the necessary alignments because of its partition ability. As tor the different kinds of alignment patterns that are required within the partitions, we have right and left shifts, flips and 1-to-many broadcasting. All of these patterns can be passed by the omega network. One of the noteworthy patterns can he found in step C,5.e of Stage 3. The broadcasting function has the fcrm {(k,x) > (X , *) <2, r>} , That this function can be passed by the omega network is proved in part (ii) of Section 2.2.1.2 of this thesis. Tn step 2'C. (iii) of Stage U, the connection function can be passed by the oirega network, by virtue of Theorem 2.1, after setting F^ to a 1-shift permutation and 71 Pj^'s to 1-to-many broadcasting The m X pi matrix transpose in step A cf Stage 3 and step 2. a of Stage ^i can be implemented as a 'subroutine' that takes (m-1) steps. Assume element (i/j) of matrix K is stored in memory -j with relative address i {0. The summation of partial products are done by shifts (of 2 .r.n) and add, 0»,o)RC»*,i) R(*,nH) f (*) M?^ M^ Mr.. ^iqure 3.3 Initial Array Storage G consists of the h right hand columns and H ^ is the ith column of the left hand matrix, L, (1 and are indexed from A{0) to A (0-1). For calculation of G , we will first broadcast Hi to the left half Rl»s. Then we broadcast I elements to the (j+n will be left half R2»s. Then the partial results of G in R3 (0,0,*,x,y) <2,n/4r,r,2n,n>. The summation of partial results are done by shifts (of 2 *n ) and add, 0 n ^ ^ ' '^ ^ .2r '2i+l t=n+l-(i+l)2r ,(j+l) i=1.2,...,(n/2r-l) Figure 3.4 mP"*"^ ^ Calculation lis t2 78 The algorithm is shown belov: Algopjlthin: A. Repeat for j = to (log2n-2) ; 1. Let r=2**j, r •=max (r/«t,1) , r'»»inax (r/2, 1 ) , r=2r". 2. Declare S(PA=G) as Q<2,n/U, 2n,n>. Declare S (BA=M) as il1<2,n/2r • ,n/r,r • ,r, n>. Declare S(FA=P!) as W2<2,n/2r" ,n/2r, r",2r,n>. Then for i=0 to (n/2r-1) , (J) G <2n,n> is in Q(0,0,*,*) y^'^' <2n,r> is in Q (0,0,*,#r> (r-1) ) Mg"^.^ is in W1 ( 1,0, 2i,0,*,*) , M^^^ IS immaterial. (J) "21 + 1 is in W 1 (1 , 0,2i>1,0, *,*) , T^^^ is in Wl (1,0,2i,0 ,*,#r* (2i M^j"^'^ <2r,n> is in W2 ( 1,0,i,0,*,*) . 3. Declare P as P1<2,n/Ur,r ,2n,n>. Declare P also as P2<2,n/r,n/2r,r,f ,n>. Then G calculation uses P1 (0,0, *,*,*) , while n^'!*^^ calculation uses P2 (1 ,0,i, ♦,*,*) . «». Fetch »\^^ {*,*) from W1 ( 1,0, 1 ,0,*,*) . 5. Broadcast D (1 ,0, 1 ,0, x,y)* to R1 (0,0, x,»,y) of PI, Vx,v. '^ D is memory data register. Declaration always follows whatever is to be fetched or stored. 79 6. Fetch Wgi + i (*f*) from W1 (1,0, 2i+1, 0, ♦,*) . 7. Broadcast D (1,0, 2i*1 ,0,t ,y) to R1 (1 , 0,i ,x, ♦, y) of P2, Vx,y, 8. Fetch Y^^' (*,*) from Q (0 ,0 ,♦, #r* (r-1) ) . 9. Broadcast D(0,0,x,y) to B2 (0, 0,y,x,*) of P1,Vx, and vyef*r*(r-1)} . 10. Fetch T^^V (*»♦) from Wl (1,0,2i,0#*, #r«-( 2i+ 1) r-1) . 11. Broadcast D (1,0, 2i,0, x,y) to R2 (1,0,i,y ,x, *) of P2, Vx, and V y eflr+(r-1) } . 12. Multiply R1 and R2 into R3. 13. Repeat for g=0 to (1-1); a) Set RU=0. 2 - a _ b) Declare P as P3<2,n / (2** (q*2) r) , 2, 2 ,r,n>. c) Left shift R3 ( 1,*, 1,0, *,*) of P3 by 2 rn into R4. d) Declare P as PU<2,n/2** (q + 3) ,2,2'^,2n,n>. e) Left shift R3 (0,* ,1,0 ,*,*) of PU by 2^.2n^ into Ra. f) Add P3 and RU into R3. la. Fetch M 2; (♦,♦) from Hi ( 1,0,2i,0, *,♦) . 15. Right shift D (l , 0,2i ,0,x,y) by (ir n/2) to R2 (1,0,i,x,y) ,Vx,y. 16. Fetch G^"'' (*,*) from Q(0,0,*,*). 17. Transfer D(0,0,x,y) to R2 (0,0,0, x, y) of P1,Vx,y. '^ This step will be skipped when j=0* ^n^ Transfers need no alignment. i0» 1 <• I ■■3 -1* c: ;c3 C3 80 18. Add F2 and B3 into R2. 19. Transfer R2 (0,0, 0,x, y) of PI to D(0,0,x,y) of Q. 20. Store D(0,0,*,*) to G^"*"^'^ (♦#♦)• 21. Fetch H^^li (*,*) from Wl (1,0,2i>1 ,0, ♦,*) . 22. Right shift D (1, 0,2i+1,0 ,x, y) of 81 to E (1,0,i,0,x*r,y) of H2 by (ir^n/2-r^n/U* m) V K,y. 23. Transfer R2 (1 ,0,i,0, x,y) of P2 to (1,0 ,i,0,x,y) of W2, Vx,y. 2a. store D(1,0,i,0,x,y) into M fj+i) (x,y) Vx,y. B. For i = loqgn-l; 1. Let r=n/2. 2. Declare P as P5. 3. Declare S(PA=G) as Q1. Declare S(RA=«) as H3<2, 8,n/8,n/2,n>. Then G ^ <2n»n/2> is in Q1(0,*,*), Y^^^ <2n,n/2> is in Q1 (0, *, # (n/2) ♦(n/2-1) ) , (J) (J) is in W3 (1 , 1 ,0,*, *) . U. Fetch «7 (*,*) from H3 (1, 1 » 0,*,*) . 5. Broadcast D(1,1,0,x,y) to R1 (x,*,y) of P5, Vx,y. 6. Fetch Y^"^^ (♦,*) from Q1 (0, *,# (n/2) ♦ (n/2-1) ) . 7. Broadcast D(0,x,y) to B2(y,x,*) of P5, Vx# and Vyef* (n/2) ♦(n/2-1)). 8. Multiply R1 and R2 into R3. 9. Repeat for q=0 to j-1; a) Set r4=0. b) Declare P as P6. 81 c) Left shift R3 (♦, 1 ,0, *,*) of P6 by 2 .2n.n into Ri», d) Add R3 and RU into R3. 10. Fetch G^"^^ (*,*) from Ql(0,*,*). 11. Transfer D(0,x,y) to E2(0,x,y) of P5, V x,y. 12. Add R2 and R3 into R2. 13. Transfer R2(0,x,y) of P5 to D(0,x,y) of Q1,Vr,y. lU. Store D(0,*,*) to G^*'*'^ (*»♦). C. Done. A nalysis ; Steps A. 5 and A. 7 use a broadcasting function that is omega passable. We first apply part (ii) of the broadcasting theorems which shows that { (K:, x,y) <2n,rrn>---> (x,* »y) } is omega passable. Then we can apply the omega partition theoreni to allow for the shift in partitions. The broadcasting function in Step A. 9 and A. 11 are of the form f (!Cr*^#y) > (y, x,*) } . They are also omega passable because of Part (iv) of the broadcasting theorem (notice that a>c since a=c=n) ) , and the omega Partition Theorem, Step A. 13 is the repetitive shifts and adds described earlier in this section. The broadcasting function in Step B.5 is of the form f(k,x,y )<2n,n/2,n> > (x,*,y) } and that in Step B.7 is of the form f C^^x^y) > (y,x,*) ) . Both ace passable by omega network. es -A is IS is 82 The operation times for this algorithir are: Fetch : lloq^nrk Store' : 21og2n-1 Align ! (loq^n) ♦lloggn+l Processor: log n (log2n*3)/2 83 3.2.3 Usin£_^an_^_P£ocesso£S This -Algorithm is derived from [ 2^^ ]. We will solve 2 ? with p=n^ ru However, if the numter of availaole processors is less than this, we will have to use foldinq. The thp'oretical processor bound found in [22] is m (m* 1) n/?-m3 . However, if m ^n6 n are powers of two, for d 2 to be li power of two also, we have to use p-m (2n) n/2 = m n. The matrix L arid the vector f can he viritten in the for.Ti, Lo Rg L2 R m ' my i = ft f2 fn s-i .1 >• ira C 5 -It 3 r: .>>> a (J) (j) At step -j + 1 (0< j, while . , (1,i,0,*,0) <2,n/r,m/2,c,m>. f^-^ (0 in two steps: and fj at (1,i,*,0) <2,n/2,2,2> respectively. We then require two fetches and aliqns to route them to (0) (0) at (0,1,':^,*,*) <2,n/2,1,2,2> anl f. at (1,i,0,*,0) <2,n/2, 1 ,2, 2> respectively. 3. If m>2, W2 will use the method described in Section ~ 3 The Gj calculation will be done in (0, i ,*) <2,n/m,m /2> 87 and 1 calculation be done in (1 # i» ♦) <2 ,n/ni ,nn /2> for 0, Resulting Gj and t; will be in where 3j and f j were. For stage 2, we want G ^?^ to he in (0,i,0,*,*) (0) <2, n/r., m/2,Trir nn> in row rrajor fashion and t j to be in {1,i,C,*,0) <2,n/rr.,ir./2,m,m>. Hence we want to route (0,i,C,x,y) to (C,i,0,y,x), and also (0,1,1,0,7.) to (1,i,G,z,0) Vx,y,z. Both routes are linear permutations and can le realized by the omega networVc in two passes, due to the results described in Section 2. 2. U, Stage 2 : A, Repeat for i - '^ to log (n/2ir.) ; 1. Let r=2**i.m. 2. Declare S(p.'\=G or f) as M<2 , n/2r ,m, r,m >. Then for C is in M (C , i, ,*, *) , G ^ faeirg all O's. r,m> is in M ( G ,i , m/2 , * ,*) , 2i+l f ^l? is in iv n,i,C,*,0) , (j) ^ ii+i '^'^^ i^ i" M(1,i,in/2,*,C) , rest of !•: C ,*,*,*, *) = 0. 3. Declare P as '^1<2,n/2r,n,r,m>. Its ili is 88 CJ) U. Fetch G 2j+, [*,*) from M (0,i ,iri/2, *, *) . 5. Broadcast C (0,i ,iii/2 , x, y) to B1 (*, i, y,x ,*) of P1, Vx,y. 6. Fetch G^^^ (»m+(r-m),*) and f j**;^ (#in4(r-iT)) from M (*,i,0,#in+ (r-m) ,*) . 7. nroddcast D(z,i,C,x,y) to E2 (z,i ,x, *, y) of Pi, VyrZ# and Vx€f#n+ (r-m) ) . fi. Multiuly PI and F2 into E3. 9. Repeat for q= to (loggm-l) ; a) Declare P as F2. b) Left shift R3 (*, 1 ,0, ♦, ♦) of P2 by 2 rm into RU. c) Add ^3 and R4 into P3. 10. Set P'4=0. (0) 11. Fetch ^ 2i + | <*^ ^^°'" M(1,i,rr,/2,*,0) . 12. Left shift D (1 , i ,ir/2 ,* , C) into FU by rmV2. 13. Subtract R3 froir Ru into F4. 14. I^ight shift RU ty rir into D. 15. 3toce D(*,i,1,*,*) into M (* , i, 1 , * , *) . B. Done. ^na l^s is : The broadcasting function in Step A. 5 of Stage 2 is of the form f (l' , x, y) > (y # x , *) ) and is omega passable. The broadcasting function in Step A, 7 of Stage 2 I 89 is of the form [ (k , x#y) > (x, *,y) } and is also omega passable. The operation times for Stage 1 cdn be easily obtainad by substitutirg n by m in the operation times listed in section 3.2.2. However, we have to add tour alignment passes for the linear permutation functions described in the last part of Stage 1 if an omega network is used. If a crossbar (or any other full permutation network) is used, we only need to add two passes. The operation times for stage 1 are then: Fetch : ^log^m-U Store : 2iog-m-1 Align : (log-m) +mog2m+1 Ptocessor: log m (log-m+3) /2 For ^tage 2, the operation times are: Fetch : 11ogj^{n/m) Store log^Cn/m) ml a* Align : log (n/m) +log2(n/m) log^m Processor: 2iog (n/m) + log (n/ir) log.m The total times for this algorithm arc then: Fetch : 31og n+Ulog^rr-U Store Align loq^n+loggm-l loggia* log^m + loa -n + 31og m* 1 Ptocessor: log. n.logm- (lcg,p ) /2 + 21og„n-log_m/2 90 ^.2.'* 0sin(i_a_J1O(]erate_NuOTber__o£_Pcocessors "his algorithm is deLived from [2a], For the matrix multiplication, however, this implementation does not use the loqsum method. Instead, it uses the uore efficient parallel-product serial-surr metLod. The preliminary discussion of this algorithm will be similar to that of Section 3.2.3. For each staqe (j+1), {0< j. 3) Pepeat for h=0,1; 1. Transfer L (i, 1 , x) /nA=h , to Ll(i,j,x), V i,i. and V X such that m-j ^ rm^/2 H -rm ^ |#— rm — J( A3) ^2i •2rm -^ ^2i+l ^2i V. ? (J+1) -/ ^ ,..*a ■ ?! (l,i, *.*,*) <2,n/2r,m.r,m> Figure 3.5 Storage Map at Step j of Stage 2 94 G. Multiply F1 and R2 into R3. f . Fetch f (i,i,0) , V i, j. g. Transfer D(i,j,0) to R2. h. Add R2 and r3 into R3. i. Tran3f9r n3 to D. j. Store D(i,j,0) into f(i,j,0), V i,j, C) Done. Staqe 2: A) Repeat for j=0 to log(n/rT)-1; 1. Set r = 2''m. 2. Declare all arrays ris . 3. Fetch f (i,k,0) , RA^I , i i,k. U. Transfer ^{i,k,0) to hCC. "i. Repeat for q = to m-l; a) fetch :{(i,k,q),FA, Vi,k. b) left shift D by q to R1. c) fetch f (i,r-iT+q,0) , :^A=0, V i. d) broadcast D (i,r-Tn*q , C) to R2(i,*,0). a) multiply PI and R2 into R3. f) subtract R3 from ACC, 6. Fetch f (i, 1,0) , RA=0, V i,j. 7. Transfer D(i,-i,C) to R1. B. Transfer P1(i,j,0) to D, V j and V i even, 9. Swap ACCd*-),!)) in 2rm-part i tions to D, V j and 95 10, 11, 12. 13. ia. 16. 17. IB. n. 20. 21. 22. 23. 24. V i eVen. StouG D to f, RA=0. Swap ni(i,j,0) in 2rir-pirt itionii to D, V j and V i o'ld. Transfer ACC(i,j,0) to D, V i odd, and V j. Store D to f, FA=1. If i=loT (n/rr) -1 ^ then qoto B. 2 Set ACC=0, Repeat for q= to it- 1 ; a) fetch F(i,k,q), P.A = 1, ¥ i,!c. b) broadcast 0(i,k,q) to R1(i,k,*). c) fetch ? (i,r-ir + q,k) , PA^O, V i,k. d) broadcast >) (i, r-m-^^/K) to I^2(i,*,k). e) inultiply "1 and F2 into R3. f) subtract R3 from ACC. Fetch P (i , i , k) , r. A=0, V i,j,k. Transfer n(i, -],>;) to HI, . Transfer R1(i,1,k) to D, V i,k and V i even. Swap ACC(i,-j,K;) in 2riit- partitions to D, V j,k and V i even. Store b to H, RA=0. Swap Pl(i,j,k) in 2rir-p irtit ions to D, V j,k and y i odd . Transfer ACC(i,j,k) to , V i odd, and V i,k. Store D to R, ?A=1. ■^! -.1 22 C3 n) Done. 96 Jlnalxs is : All of the =iliqnment functions used in this algorithm can be easily shown to be omega passable, by the simple application of the omeya Partition Theorems. Steps 9, 11, 20, and 72 of Stage 2 show the swap operation described earlier in this section. The total times for this algorithm are: Fetch: log2(n/ir) (Um + 3) Store: Uloq (n/m) ♦Um-2 Align: log (n/m) (Um + U) +6rr. Processor: 4m, loq (n/m) ■♦•6m 97 3 . 3 Matrix, Hill tipXication_oii_a^Par§ll9l_PCQcessinq_?Y^t em A Fortran code section that performs matrix multiplication is as follows: 10 DO 10 T=1,N DO 10 J=%N DO 10 K=1,N 5(I,J) =A(I,J) +B(I,K)*C (K,J) An efficient way to perform the calculation would be to compile the product by rows (parallel on J) as shown below, DO 10 T=1,N DO 10 K = '',N 10 A(I,*) =A (I,*)+B(I,K)*C (K,*) 2 This algorithm will require (H ) shifts to align the operand matrices. A one-stage perfect shuffle network simulating an omega network will take log^ N steps per shift, and the Illiac IV type of switch will take (Jn) steps per 2 2 r~" shift on the average." So a total of 0(N log.N) or O(NJN) routing steps are required for matrix multiplication. However, using the algorithm which follows, we need only n(N^) steps. 101 ■Ail - jt3 'ill \\ We first need to define the following two notions: NQtation: If G is a permutation of some input set, G 98 implies i consecutive applications of the permutation G to the input set. 2sfi!liti2Q* * G-permutation is defined as a permutation G 2 3 M such that G, G , G ,...,G are distinct and form a group with G =1, the identity permutation. Fvery G permutation can be uniquely represented as a cycle (io,i, ,. . .Xm., ) where G (io) =i| ,G (i, ) =i2#. . #G (i m., ) ^i^, . Two obvious G-permutations are the ♦1 shift permutation and the -1 shift permutation. In general, ^k shift and -k shift permutations will be G-permutations if k is relatively prime to N. some nonshifting G-permutations can be found using a perfect shuffle based permutation. The G-permntations have a general form of: G(i) = [2i*b(i) ]mod N, where h (i) = b(i+N/2) V i=0.. .N/2-1, and b (i) = or 1 V i. A list of all {b (i) ,i=C.. ..N/2-1} that will give G-permutations for N=U and 8 and the corresponding G-permutations are listed in Table 3.1. 99 Size b(i) G-permutation u 1 1 (0 13 2) 8 110 1 (0137652a) 10 11 (0 1 2 5 3 7 6 U) Table 3. 1 Assume we want to multiply two matrices A and B to form C and that they are all of size NxN. The first method uses N processors and requires that the storage scheme for the matrices be 1-skew and 1-skip. The storage pattern is shown in Figure 3.7, Each processor will have a corresponding memory from which it can fetch data. Any data a processor wants but not in its own memory will have to be routed from the other processors. This algorithm also calculates the relative address (RA) for each array it references . Memoi •y c N 1 2 3 (0, 0) (0, 1) (0, r2) (0 r3) (1. -3) (1. ^) (1l 1) (1, ,2) (2. 2) (2, 3) (2, 0) (2< r1) (3, 1) (3, ?) (3, -3) (3 rO) Figure 3."^ 1-skew 1-skip Storage Scheme Fach processor has a wired-in processor port number. .-t 5C) J0I .1 <• ,J3I 15 100 PPN (0 < PPN < N-1 ). T is a temporary array. Rlaorithm: ft) Repeat for TC = to N-1; 1. fetch A, Rft=ir, into R1. 2. set IR = (PPN-IC) mod N. 3. repeat for IT = to N-1; a. fetch B, RA = IR, into P2. b. multiply R1 and R2 into P3. c. K-permute IP. d. store T, PA= (PPN-IR,mod N) from R3. e. G-permute R1. U. set P1=0. 5. repeat for IT = to N-1; a. fetch T, RA=IR, into R2. b. add R1 and R2 into R2. c. G-permute R1. d. G-permute IR. 6. store C(RA=IC) from PI. B) Done. The significance of this result is that for certain one stage networks, if there exists a G-permutation, then each intermediate routing will take only 0(1) time instead of 101 OfloqgN) time or (Jn) time. This greatly reduces the alignment time for the system. There are many variations of this algorithm, each for a different memory skewing scheme. Two of them can be ased for a parallel processing system with twice the number of memory modules. They will be presented in Appendix A. 2 There is another algorithm which uses N processors. 2 2 However, it works only for a N xN network with tN shift and ♦1 shift connections. This algorithm takes a total of N+2 memory fetches and N+1 memory stores . The total number of alignment reguests is 3N and the total number of arithmetic operations is 2N. Hence the alignment time matches in order of magnitude with the memory and arithmetic operations. The initial storage scheme is simple. All matrices are stored in a linear manner, i.e., element (i,j) will be stored in memory (Ni+j) • R.lc[orithm: ■iCD 1:J i| •a :s A) Fetch B into R1, P) Repeat for | =0 to N-1; 1. store R1 into T, PA=j' 2. left shift PI by N, C) Fetch A into P1. D) Set ACC=0. SM: 102 E) Repeat for 1 =0 to N-1; 1. Fetch T, RA=(PPN-i-j)modN# into R2. 2. multiply R1 and R2 into R3. 3. add R3 to ACC. U. riqht shift Rl by N into R2. 5. transfer R2(i,0) to Rl (i,0) . 6. left shift F1 by 1. F) Store ACC into C. The above two algorithms show how computations can be tailored to fit a simple network so as to minimize the routing times. 103 a. PROCESSOR,. SYSTEM SIHDLMIO N.,TECHNIQOES a, 1 Introduction In order to evaluate the true effectiveness of a parallel architecture, we irust hypothesi2e a compiler capable of compiling ordinary programs into code which most effectively utilizes the architecture, especially the data alignment capabilities. The resulting code could then be simulated and the important performance measures determined. This is the objective of our Analyzer/Simulator project. It involves the simulation of program execution on some proposed parallel processing systems. The front end of this project is a program analyzer which accepts Fortran source programs, and by detailed analysis of the control and data dependencies it produces a highly parallelized version of the original program (see [26]). Next, this parallelized version is input to another program, the Pesource Request Generator (PRG) , which atteirpts to compile the parallelized progranr into simulatable code. The code is a set of machine resource requests with data dependencies embedded in it. A machine resource can be a scalar or array processor, an alignment network, or the whole bank of array memories. The task of the PPG is to decide on the best way to slice the comoutation specified by each instruction node, based on the size of the matrices, the number of available processors, the matrix storage scheme, and the type of alignment 5 I mil III \\ :? IS 104 network. Finally, the output of the RPG is input to a simulator capable of siirulating a wide variety of architectures. Here the time required, utilization of various resources, and speedup and efficiency of the program's execution in the qiven parallel processing system will be calculated. Machine organization parameters can be specified by the user. These parameters include the storage schenie, the alignment network, the processor and memory speeds, the number of array memories, and the number of processors in the array processor system. A block diagram showing the general organization of the software is shown in Figure u.l. The Program Analyzer is described elsewhere [26] and we will not discuss it here. In this section we will describe the F EG and machine simulation. Some experimental results will also be presented. In Section U.2 we will discuss the input data structure ani available machine parameters for the REG. In Section U,3 we will describe the output of the simulator in the form of performance measures. Then in Sections 4.4 through U.6, some of the algorithms and strategies of the RRG will be described. Finally, in Section 4.7, we will discuss some of the preliminary results of the initial set of exoeriments. 105 0) o to c s- ^ • r- T3 (0 C ZJ 3 O 03 CO > LU 1 OJ to O 4-> S- to =J +J S- to (TJ Z3 OJ S- o 3 OJ (/) 3- C OJ (U (U a: c i; C3 to O rO •!- •f- M- .£: •!- CJ o I- (V > c C o •r— +-) rO N •r- c m o o +J OJ N >^ rd c CD s- o O i. •U. Oi. 106 U . 2 sijTul atoc Inpu t Specifications 4.2.1 Input Instruc tion Nod es The most easily recognizable form of parallelism is typified by a matrix addition shown belo". BO 10 1=1, N DC 10 J=1,f1 10 A(I»J) =B (I,J) -^C (I,J) The Program Analyzer will determine what the dependency limitations are for each program segment (in this case, there are none), and then break them into machine-code-like instruction nodes. Each instruction node will provide all the information concerning the operator, the two operands and the result. After the Fortran Analyzer phase, all parallel DO loop indices are distributed into each instruction node. The DO loop limits are noriralized to start with and have increments of 1 only. We first assume that there are n active DO Loop indices in a particular instruction. The -j-th DC loop index, I:, inay have an upper limit, Uj , as a function of I, ,1^ . . . . , Ij.( . Assuming the function is a linear function, the U: 's can be represented by a nx{n+1) matrix, D, such that 107 '". 1 K "2 • = r 1 • ."". ' ' n+1 I2 In 1 Note that except for the last coluirn, the D matrix is strictly lower triangular. In a Similar fashion, each k-dimensional array (with linear^ subscripts) being referenced in a node with n active no loop indices, will have a corresponding k by (n*1) coefficient matrix C, Let E; be the subscript expression of the jth dimension, then -1 1 ^ ^2 • = } • .^n. 1 . n+1 II l2 • • ^ In ^ 1 ,».'J I Definition-- Let there be n memory units. A p-ordered N-vector (mod M) is defined as a vector of N elements whose i-th logical element is stored in memory unit pi+c (mod M) where c is an arbitrary constant. 108 The idea of a p-ordered N-vector is V€ry useful in finding the number of cycles required to access a vector or to aliqn it usinq certain alignment networks. Usinq a qeneraliz^d skewing scheme as in [Lawrie 1 ], for an array with k dinrensions, we will have (m, ,in2f . . . . #m,^) skewing. Assuming an array operand in an instruction node has k dimensions and n active indices, then we define an order vector, V, of n+ 1 elements as: < n + ^ > « k > ^ I V J= [m, m2..,mj n+1 For an array element defined by any particular set of values fl, =h, , I2=h2 , . . . 1^=^^ } , the element will be stored in memory port z, where 109 z = C 'h,' h2 • • hn 1 In Addition, the importance of the order vector lies on the fact that for any partition of the array formed by running the jth active index parallel, the partition is a Vj -ordered vector (rrod M), where H is the number of ireniories. When the order and nuirber of elements of a partition are calculated, the number of cycles required to access and aliqn the vector can be easily determined. For Fortran statements that cannot be easily dispatched as array or scalar operations, they will be grouped as recurrences nodes. Each node represents a R system {c.f.[22]). Each R system will be broken into as many smaller recurrence units as possible. Information such as the number of smaller recurrence units and the values of n and m for the units can be found in a recurrence node. With this information, we can determine which is the best recurrence solving algorithm to use and its corresponding execution time. ..•:^i: C9 .1 <• C3 110 U.2.2 Machine Parameters ■ In the parallel architecture that He simulate we assume that the resources can operate in an overlapping (or pipelining) fashion. However, we still honor the dependency between different instruction nodes. Each resource will have its own resource queue to hold the waiting requests. Hence one node may ne using the alignment network while an independent node can start fetching its operands from the memory system. It is impossible to siirulate every known parallel architecture. So we concentrate on two classes of architectures. The first class is shown in Figure ^,2 and the second in Figure U.3. Note that the one in Figure 4.2 resembles that assumed in Chapter 3. The second type has two alignment networks, one for input to the processing system and the other for output to the memory system. This class can be chosen by setting the parameter option M_PAPRM. TW0_2^L_NfT to 1. The scalar men^ory and scalar processor in Figure 4.2 and U. 3 are optional and can be chosen by setting M__PARAM.SM and/or M_PARAM.SP to 1 's. The number of processing elements in the processing array and the number of memories can be selected using the parameters ?!_PARAM- NU«_PPOC and M_PARAM. Ndm_M EM. Ill " — f ALIGNMENT NETWORK r.\ — » % ^'°/ rs\ "i vv • • • - • 9 9 • ""1 (^,\ ^N-l -^ ^-if ^ Data Path Control Path ■3 :;!r !2 Figure 4.2 Machine Configuration A 112 Data Path Control Path Figure 4.3 Machine Configuration B 113 The skewing system chosen can be specified by assigning values to the array M_PAEAM. WEM_Ma E. For example to get (1,1) skewing, we would put the numters 0,0,0,1,1 into the hEm_MAP array. As for the alignment network, right now we can choose any one of the four possible networks by setting M_PARAM. AN_TYPe to the appropriate value. AN TYPB = 1 = 2 = 3 = U crossbar omega network */P#11 shift network ♦I shift network The memory Cycie time can be specified by using H_PARAM, MCYCLE_TIM'5, and the scalar memory tiire be specified using n_PAEAM. S_MEM_TIME. To allow for pipelining, we have two separate time fields for each resourc€ request. The first is RT which contains the time that must lapse before another request for the same resource can be started. The other is IT, which contains the time required to finish processing the request. If a particular resource is pipelined, then FT will be the pipeline segmert time and IT will be the length of the whole pipe. In this case, IT is greater than or equal to et. For a memory request, however, BT will be the cycle time and IT will be the access time. In this case, IT is less than or equal to RT. ■A " It 114 The alignment times are specified by M__PARAH. A_CT_IN and M_PAFAM. A_CT__OOT. The processing times can be assigned by the user using the array H__PARAM,OP_TIHE. The elements are times reouired for simPle assignment, addition, subtraction, multiplication, and division, respectively. tfe also allow the users to define their own built-in function and user defined function times in n_PARAM. BUILTIN_TIME and ri_PARAM. nSE?FCN_TIME respectively. A sweeping index is defined to be the active index that is to be run parallel in order to produce the desired partition, one option that the user has is tc declare what he wants as the sweeping index. The other is to let the Simulator choose the best index in terms of execution times. To choose the first option, we will have to set ?1_PARAM. SWPCPT to and to set the array M_P ARAM. SWEEP_INDX to the running indices reguired. For standard algorithmic procedures, such as recurrence handling, the operation times for various resources have been calculated in Chapter 3. Thus we just need to substitute in the formulas for the operation times rather than perform detailed simulation of the algorithm. However, we will be missing certain overlap parallelism. This overlap can be assigned by the user using M_PAPAM.CVFFLAP. In appropriate cases, IT will be set to OVKPLAp+RT/lOO, and will be less than RT. 115 In this section, we have discussed various options that are available to the users. By setting all the appropriate options, the usee has defined a machine configuration that he wants to study. In the next section, we will discuss the outputs available from this Simulator. ma 9 'I s 5» C9 ^5 ^3 116 4.3 Sirnulator Output s T^e output of the si'iulator is a set cf performance measures. One such rreasure is Tp» the tine required for simulated execution of the program graph froir the Program Analyzer in the specified machine organization using p processors. If T, is the execution time for the same program qraph, then we dofine another measure, the speed factor, Fp, as T, /Tp. In addition, the Simulator calculates measures of tne utilizations of various system resources. The utilization of each resource is broken down into several separate utilizations, U^, Us and Uip . First, U^ , the array iiHtX ^ycle, is the percentage of time that at least one processor is pertorming a computation. However, whenever an array operation is being performed, only some of the processors may be actually doing useful work. This is measured by the slicing ut ilization, Ug. For example, to add two 30 element vectors together using 20 processors would requiro two steps. The first step wculd form the first 20 ^^rr\s and ^ould use all 2C processors resulting in a slicing utilization* Ug , of 100%. The second, step would form the last 10 sums using only 1C processors and would re'^ult in n^^sOi;. The overall Ug would then be 75%. Finally, some processors are turned off because of IF statements in the original programs, and this is measured by Ojp . For examole, assume that in the following program, 1/3 of the B (I) are less than zero: 117 CO 10 1=1,30 10 IF (B(T).GE.O) A(I)=A (I) ^B (I) Then Ujp =67%. Thus, using 20 processors on this program, Ua might be 80%, for example, because the processors are waiting for memory access or data alignment. Of this 80% of the time, only 75% of the processors could be used because of the difference between the number of processors and the array size Vh-"^^^) r ^"^ o^ these 75% of the processors, only 67% are turned on (U^p =67%) . Thus, the total average processor duty cycle, Oj , is equal tc Og *U5 *Uip = 80%*75%*67% = U0%. By separating the components of processor utilization in this way we car determine the source of processor inefficiencies. »*r ';;|"T1 •a «3 118 U • ^ Sweeping indices As described in Section 4,2,2^ when an instruction node can be swept by more than one index, there are two options for the user to define the sweeping irdex. One is to specify it in the pararreter W_PARAn, SHEEP_INDX. The other is to let the RRG choose the best index for each individual node. When there are other indices having upper limits that depend on the sweeping index* we need to modify the C and D matrices before the sweeping is allowed. In other words, an instruction node can be swept on an index I,- if and only if, in the D matrix, the i-th column has all zeroes. For example, if the instruction node looks like: DO T = 0,N-1 DO J = G,I-»-k A (I, J) = then C = 1 1 and D = N-1 1 k To sweep on index I, we need to expand the node to two nodes: DO J = 0,k-1 DO I = 0,N-1 A {I,J)=0 and EO J = 0,N-1 CO I = 0,N-J-1 A(I+J,J+k)=0 119 Now there are two sets of C and D matrices. They are: c, = 1 ^0 1 0_ c, = 'c c N-1 .0 k-1 Ca = '1 1 0^ .0 1 K D2 = 'c -1 N-f N- -^ Note that after this transformaticn, the first column of D, and D2 are all zeroes* fience the two transforired nodes can be swept on index I. In general, given a node DO 1=0, N-1 DO J=0,hl+k A(I,J)=0 •,Mf*' ,-■.^€1; :.'» ■•11 •I ►;,l,jy 51 • s ^3 a^d we want to sweep on index I, the original loop will have to be transformed into: DO J=0,hN-h+k DC 1 = 0, f (J) ft (T,J)=0 120 The first problem here is what should be the equation for f (J) in general. If h is not equal to 1, f ( J) Can contain many irodulo functions, which are ronlinear, and cannot be represented easily in our linear D iratrices. Another problem is that if h is large, many of the vectors \(*,J) will be sirall vectors which can seriously degrade the efficiencies of a parallel system. So the solution we picked is to do this kind of + ransf oriration orly if h=1. above: Consider a more general case than the one shown CO 1=0, N-1 Do J=0,I+k A(pI+qJ+r,xI+yJ+z) =0 i.e, and C = D = P ^1 X y N- 1 1 k We can split the node into two parts: DO 1=0, N-1 CO J^0,k-1 A (pI*qJ + r,xI + yJ + z) =0 121 and CO 1=0, N-1 CO J = '',I A(pI+qJ + r + qk,xn-yj + 7 + yk)=0 The first part thus has: and ^i = p q r .X y z c. = 'c c N-l' k- -1 = c The second oart is equivalent to: DO J=o,N-1 V.C I = J,N-1 A (pl+qJ+r+qk^xI+yJ+z+yk) =0 and is also equivalent to: DO J=0,N-1 DO I=^,N-J-1 A(pl+ (p*-q) J + r + qk,xI+ (x + y) J + z-^yk) =0 ■II It • ^ or p p+q r+qk X x-»-y 2 + yk Do = -1 N-1 C N-1 Hence C, = C * X 122 where X = 1 1 1 k 1 Therefore, V, = M * C, and Vj = M * C2 will form a pair of order vectors that can determine if the matrix can be swept by the index I. In qeneral, if we reduce U: from Ij^~d€pendent to I^- independent, X will be a (n+1)x(n+1) matrix with ones on the diaqonal* another 1 at position (j^h) and a k at position (h,n-H) . •/« 123 4 . 5 Array Slicing When a" instruction node represents a larger operation than the processor system can handle, the array operands in the node have to be sliced. Let us define the required number of slices as S. The slicirg utilization, Ug , discussed in Section 4.3, is defined as the percentage of the amount of a resource that is being utilized. These are the two most important quantities to be discussed in this section. When the upper index bounds are all independent (i.e. the first n columns of D are all 0*s) , it is easy to find S and 11$ : S = fN/p] IT [D(I,n+1)+l) 1 = 1 Iftis Ds = N/(f N/p] *p) *100% where Ij is the sweeping index, N=D (i^ ,]-!••■ 1) ♦I , and P is the number of processors. After transf oriring the loop as discussed in Section ^,^, no upper index hound will be dependent on Ij . If, however, the upper bound of I5 depends on index h, we have to calculate S and Ug differently. if we have this kind of instruction node: .XSSi ,'C 1 «3 124 DO h=0, N-1 DO I =0, ah-«-b instruction a rough estimate of S can be calculated as follows: the average upper bound of Ig is aN/2+b. Hence S = [(aN/2 + b)/pl ♦N and Us = (aN/2<-b)/ (S*p) ♦100% If a=1, we have an upper triangular systeir and we can find more accurate values for S and u^ , Let us first consider S for a purely triangular system, as shown below: N Breaking it into [n/p] sets of columns- The first set will contribute N to S. Ihe second set will contribute (N-p) to S, and so on. So S =Y^ (N-Pi) 125 = rN/pl fN-pCfN/pl -1)/2). Mo** let Us return to the original triangular system, as shown below: N where K=N+b- fb/pl ^p =N-(p-b)niod p. The first half contributes N* fb/pl to S. The second half is a ourely triangular systerr with siz€ MxM, and thus contributes Fm/pI {M-p ( Fm/pI -1) /2} to S. Therefore S=N Fb/pl + M/D {M-p( M/P -1)/2}. The total nurrber of elements = N(N + 1*2b)/2. Hence Hs = N (N+1 + 2b)/(2Sp) *100i;. ■.tvS ■3lC U ,.jj»i -i- ,..^C 1 «2 •a when S is greater than one, we will have to replicate the resource requests S times. However, in order to save simulation time, we will devise the following orocedure. Ve first Observe that in general, each slice will 126 foJ-low the same general pattern: a fetch^ an alignrnentr a processor cycle^ then another alignirent and finally a store. When there are more slices, they will be duplicates of the sequences* but with slight time displacements, like: F A P A S Thp middle part of the operation will be the concurrent operations of F-A-P-A-S, each working on a different slice. It is repealed (S-U) times. To expedite simulation, we put an implied DO loop around this middle part. This DO loop will be simulated repeatedly until no other request node is using any system resource. Then one more iteration will be done to figure out the iteration time. This time is then multiplied by the number of remaining iterations to find the total time. This method will reduce the amount of parallelism slightly; however, it reduces the simulation time greatly. Only one DO loop can be active at any time for any 127 One level specified, in order for the atove simulation method to work. This can be achieved by generating a resource request (for that level) at the DO node. The resource will be released at the END node# when the required iterations have been finished. This effectively locks out any other independent DO loop activity which would interfere with tinging the "last" iteration. After the resource is released, another DO can be activated by being granted that resource. .^f 1 ,1 1 128 U • 6 Resource Time C a lculation A^ter we have found S and various utilizations, wg will proceed to find the resource times of various needed resources for that particular instruction node. Scalar memory and processor tiires are simple to calculate and we will not elaborate on these. However, recurrence and ordinary vector operations need further explanation. Shift networks nresent a different set of calculation and will be treated in a separate section. 4.6.1 Bec urren ce H andling For recurrence nodes, we have analyzed in Chapter 3 the conditions under which certain recurrence solving algorithms should be used. We ^ave alsc f")':nd th-i correspondinq resource times for -^ach algorithm. uence, we can save a iDt of simulation time by simply putting in the corresponding resource times when a recurrence node is encountered. This way, we assumie once a recurrence node is encountered, we will preempt the machine to do just the recurrence, fJsually there is overlap between various resource times, i.e. , the sum of all the resource times should be greater than the total execution tine. To account for this effect, we set the total execution time to a constant parameter, OVERLAP, multiplied by the sum of the resource times. This OVERLAP can be found by first writing 129 the recurrence solving alqorithm in Fortran and then running it through the Analyzer/Simulator. The average OVERLAP is calculated for various array sizes and machine Configurations. This rrethod will not give us the true value for the execution tiire of each recurrence calculation. HoWeVer, it "ill give us a moderately reliable estimate. ■ntlM' C3 '■,i*« 3 130 U . 6 . 2 Vector Operation s (asinq crossbar or omega network) For a vector operation node using crossbar or omega networks, the resource times are easy to calculate after the order vectors (described ir Section U.2.1) for the operands are calculated. For merrcry accesses, if the order for that particular ooerand on a particular sweep is p, then consecutive eleirents can be found p memory modules apart. Hence we need g=qcd(p,M) merrory fetches before we can fetch the entire slice. In general, the number of memory cycles required to access a p-ordered N-vector (defined in Section U.2.1) stored in Jl memories = fN*q/w| . After each memory cycle, the time required tor aligning a p-ordered vector using a crossbar and for aligning before storing into a p-ordered vector using omega network is equal to 1 network cycle. Nevertheless, to fetch a p-ordered vector slice using the omega network, we also need g network cycles. Hence the corresponding total nuirter of network cycles are: 1) crossbar — fN*g/M] . 2) omega -- gfN*g/ff| for fetching, rN*g/Ml for storing. 131 4.6.3 II liac_Typ o_5 hif t Ne twor ks For processing systems with N processors, if the interprocessor connections are + /n and ♦l-shifts, we call the alignment network the I lliac type shift n€tworlc. F'or this type of network, when a uniform shift permutation is requested, the resource time is easy to calculate. Let s be the shift distance required. We first set ss=min (s, N-s) # where N is the number of processors, Also let n be /n. The shift time required = L^s/nJ ♦ (ss mod n) for (ss mod n n/2). (4.1) When a triangular type array is accessed* and the shift distance for each slice is different, we have to find the average shift time for the array. Let A (x) be the average shift time for shift distances 1,2, ...,%• If x=Nk-»'W where N is the number of processors, then Nk + w (^.2) , j!»l «8 I'- ll where D(w) is the average shift time for shift distances 1,2,...,w,(w n/2 to account for the extra shift to reach the other end of the n-partition. yD(y) =y(y + '')/2 for 0n/2. i=f+l =ny-y(y-1)/2-N/a ('».'♦) We then calculate the times required to get to the respective n-partitions. They are: 133 tj : 0,1,..., (n/2-1), (n/2-1), (n/2-2) ,...,2,1. Let w*f be t^he total number of n-shifts we have to rio for all s in the first z n-partitions- if=ny^tf^nz(2-1)/2 for C N/2, wo need to subtract n/2 from wD (w) , since we have overcounted the shift tiir^e at N/2. i.e. wD(w) = znd+yD(y) +wf ^q for wq-n/2 for w>N/2. (4.7) Now we want to find D(N). For w=N, 2=n and y=0*-1=1. Hence ND (N) =n (N/4 + n/2) ♦n- 1/2+ 1/2-N/a+nN-nN/2- K/2 -nN/4+0-n/2 :uis^ LlL :■.«£ ■ f1 •I fcl 3 «2 :s "fl =nN/2-N/4-n/2 C^.S) NOw A (x) in (U.2) can be found easily once ND (N) and wD(w) 134 are known. If a random shift is required, then ar average time of /n/2 will be used. If a broadcast function is needed, then we will use the worst case result of |n. When the Perirutation is other than shift or broadcast, we will apply Orcutt's result of 8(/n-1) for omega passable permuta tions[ 27 ]. 135 H , 7 Experi"'e"tal Results Our initial experi^^ents will deal with the effects of the following architectural pacameters: 1) The number of array processors, and the speed of the processors relative to the array ireirory system. Initiallyr the processors will be restricted to a single group of processors operating from a single instruction stream (SIKD) . 2) The presence or absence of an independent scalar processor and/or meirory* The absence of a scalar processor forces scalar operations to be performed by the array processors* 3) The memory system^ including the array storage scheme (1-skew, etc.)» and the number of memories (power of two or prime) . U) The type of alignment network: a) Crossbar, h) omega network, c) +i, + /p shifter (Illiac IV), •'a ..m . .cfc I 3 n ■4 iU These parameters will te studied for a large variety of application programs, and in addition the size of the application programs (i.e. the array sizes) will be varied in order to produce fairilies of performance figures. 136 The tables below present some p reliminary results of Experiments on three prograins. He would like to stress at this point that these results are preliminary. The three programs can hardly be construed as representative of any large population of applications. The first program, ADVV, is a U-point relaxation scheme. ADVV was chosen because of its highly parallel nature. The second program, ELMBAK, forms the eigenvectors of a real matrix by back transforming those of the corresponding upper Hessenberg matrix. ELMBAK is reasonably complicated, but has no recurrences. The third program, sLeqI, is a Gauss-Jordan reduction program. SLFQI was chosen because it contains a recurrence relation (a P.<18,1> system). We present the results of these three programs only as an indication of the types of results we expect frotr our experiments, and an illustration of how to interpret the results. The complete tables of experimental results for these three programs are shown in Appendix E. Some of the more interesting figures are grouped together in Tables 4.1 through U.U. Table 4.1 shows the speed factor, Fp=T,/Tp, and processor utilization Dj using 16 processors, 17 memories, a crossbar alignment network, skewed storage, and separate scalar processor and mem.ory. The results are presented as a function of N, the data array sizes. Notice for ACVV, the 137 s^ 6^ iH 0^ r^ ro o 1 o f .^ 1 iH CN vO 0^ &^ 6-S &^ O CNJ un r-^ CO -cr O ^ r. ^ „ II 0^ o in 2 • • . Ln C30 ■-l p- PQ Cr- W) p> s W o Q hJ •-1 M <^ W CO Ph T) 1 OJ o :$ u 0) a rM V) vO rH T3 60 rt C •H ^ W u 3 o /-"s u E- 0) tD d ^— ' 4-1 c c O (U •H a 4J c cd ao ^3 •H •H rH t— 1 (t) •H 4J y> D 0) ^1 w O M . to O U) ^1 CJ at U N o •H o •^ W i-i /'-N D. r-^ >% rH cfl 13 rH M C 1 1 U nj U-l nj ^^ O m vO ^w^ 4-J rH CO [H w -a '•^-^ OJ •H Q) J-i jn O 4-1 ^3 s QJ 0) to CU e •H a en iH 2 r-l • r. • -i J3 cn o CTt 0) ■U H a w ,-.'-f ; I ff p e 1 t? «3 138 speed factor quickly approaches the maximum value of 15. Processor utilization ranges from ^3% to 71%, The result for N=16 indicates that Ua = 70% ( Us=i00% since N=p=16 and for ADVV, Ujp =1C0?E) . Thus, the processors are only busy 70% of the time due to non-perfect overlap of array processor operations with alignment, memory, and scalar operations. However, the speed factor is 16 which would indicate a similar degree of non-perfect overlap in a conparable serial processor. The other programs, ELHBAK and SLEQ1 indicate much lower speed factors and utilizations. SLEQ1 contains recurrences* which are handled in parallel but much less efficiently than the pure vector operations in ADVV. Notice, however, that even though SLECl contains a recurrence, the speed factor of 1*^.5 is very close to the maximum of 16 when N is 60. We believe it is significant that we are able to handle recurrences this well. The reason the ELMEAK results are so low illustrates an interesting situation. At the present tine programs are compiled into three address vector or scalar instructions. If the vectors are of sufficient length, then an implicit loop is established in order to cycle the processors, memories, etc. a sufficient number of times. Within this implicit loop there is usually overlap between processor, alignment and memory operations. However, b etween separ at e vector instructions, there is no overlap. Thus, one instruction must finish before the next starts. This is 139 what causes the low figures for ELMBAK. This indicates to us that it is inrportant to design the vectcr instructions and control unit so that different vector instructions overlap each other. It is also interesting to note that Pp and Uj continue to increase with N for both ELMBAK and SLEQI. This is due to increased overlap of operations within the implied loops of vector instructions and, in SLEQI, more efficient recurrence algorithms which are used when N is sufficiently larger than the number of processors. Table U.2 indicates the effectiveness of various alignment networks and skewing schemes. As we can see, the crossbar and omega networks performed egually well. The Illiac network perforired somewhat better, at least for ADVV and ELKBAK, This is due to two facts. First, the Illiac network was set to operate four times faster than the other networks. This reflects the difference in the complexity of the networks. Second, we were able to "compile" the programs using very simple alignm.ent requirements which could be easily handled by all three networks. The lack of difference between staight storage (0,1) and skewed storage (1,1) is also a reflection of this second foint. He were able to compile the program.s so they only needed access to rows, and thus they do not benefit from skewed storage. However, we do not believe this result will held for larger. r! r •"I ,ti^t I. » n 140 Illiac IV Straight Skewed 1384 (9.9) 1282 (2.8) -K 1384 (9.9) 1261 (2.8) ^C Omega Straight Skewed 1416 (9.7) 1760 (2.0) 2644 (4.5) 1416 (9.7) 1760 (2.0) 2644 (4.5) ssbar Skewed 1416 (9.7) 1760 (2.0) 2644 (4.5) Crc Straight 1416 (9.7) o o v^ . r-, CM iH -^^ 2644 (4.5) e CO 00 o 1 < X W rH o- w CO 1 .^.S c >-l o o a 3 »-. 4-1 0) 0) 4-1 c c •H iJ c » c 4-1 00 •H U tH CO to •H iH t/5 rH D M O •H r- ^ 5 to > -a di 00 c •H to to i-t 3 60 O x— ^ ^-1 CL til 4-1 v_/ OJ ^^ o 4-1 4-1 O u c to QJ > T3 CO CU x: OJ a dj en *> • X) ^ C o CO •H ^ «t /— • 3 a. vO H rH en II dJ •* O u OJ »^^ C 6 OJ •H l-l 4-1 u • 3 c tn u o -l 4-1 tu 3 x: w CJ u c 0) w •H ^ (0 w 00 4J c c • CM •H O U) • 3 u c (U o .i«i rH -H 0) tn O- 4J .H W U XI "O J ~ 8%, - -, 9% 8%, 9% Table 4,4 The effect on execution time, Tp, of a scalar memory and scalar processor. The percentage figures are scalar memory utilization and scalar processor utilization respectively. i:!J •fl ^3 144 such things as subscript calculation. The above discussion is based Only on limited amount of experimental results. It can only be regarded as an illustration of how to interpret the results. According to which "benchmark" programs a user is interested in, he can conduct experiments using that set of programs. In the end, he will then be able to determine on the kind of machine configurations that is most suitable for him. 145 5. CONCLUSION This thesis concerns the utilization and effectiveness of interprocessor connection networks for parallel (SLID) type computers. The problems concerning interconnection networks can be divided into three areas: capabilities, exploitation, and effectiveness. Capabilities include network properties and network control methods. One of the networks that we have examined closely is the omeqa network. The omeqa network is one of the more attractive multistage networks, it is moderate in network coirplexity and quite powerful in its permutation capabilities. If we concentrate on only seme of the more common permutations, we can further reduce the complexity of its control algorithms. Three different control methods are shown in this thesis. He have discussed a significant new property of the omega n<=*twork, the partitioning property. We showed that a large size omega network can be regarded as a conglomeration of irany smaller size omega networks, each passing a different smaller omega-passable connection function. This partitioning property of the omega network proves to be vital tor the efficient handling of many computation algorithms. We also discussed another important property of the omega network, the broadcastirg ability. He showed the conditions under which a 2-dimensional data array can be broadcast to a J-dimensional data array using the IrtpJT ; J Co is It 146 omega network. This data transfer ability is necessary for example in certain matrix multiplication and recurrence solving algorithms, in Chapter 2, we were also able to extend the capabilities of the omega network further using the concept of linear permutations. The shuffle connection is the basis of many interconnection networks, like the omega network, the Batcher network, and the binary Eenes network. Because of this similarity, we can apply some of the properties of one network to the others, and hence increase further Understanding of such networks. Some such extensions are shown in Section 2.3 and 2.U. Because of the great simplicity in gate counts, the one stage perfect shuffle exchange network is also carefully examined. Algorithms were presented in Section 2.5 to show how such network can be used for performing' permutations. with the old and new knowledge we have acguired on network capabilities, we would like to apply them to some common computations. Recurrence solving algorithms and matrix multiplication algorithms are two examples that we used in Chapter 3. The efficient handling of recurrence operations is essential in parallel processing systems because the parallel system will be degraded to a serial machine otherwise. With careful planning, we were able to simplify the alignment requirements of various recurrence 147 solving algorithms. So, instead of a full crossbar* wg can now use a simple alignment network, such as the omega network. In Section 3.3, we show how a comrron computation algorithm (such as matrix multiplication), if detected, can be adapted onto a parallel processing systeir equipped with only a one stage network. Hence Chapter 3 has been dedicated to techniques for exploiting various interconnection networks. To evaluate the tru© effectiveness of a particular interconnection network, we have to determine the effectiveness on real programs of a parallel processing system equipped with such a network. This can be achieved with the help of the Analyzer/Simulator project currently being developed. The program analyzer first generates a highly parallelized version of the program. Then the RRG will compile it into suitable pseudo machine code from the information about the parallel processing systems that the user defines. This pseudo compilation is dore based on the capabilities of the architecture to be studied, including the type of interconnection network used. From the results of the simulator, we can determine how well does an interconnection network work- mfV' ■-'!', *• ,■(?" c > . IK' :s :» «2 '-5 •a with the methodology described in this thesis, the true effectiveness of an inte rprocessor connection can then be determined. 148 We conclude this thesis by giving the following topics that are worthwhile for further research: 1) IS there a set of basic permutation patterns for the omega ROM control method from which other useful permutation patterns can be generated by doing logical o^'prations on some members of the set? For example, how can we generate the control pattern of a k-shifted p-ordered spread permutation from that of a k-shift permutation and that of a p-ordered permutation? 2) Finding the analytic bounds and averages for the time required to pass a permutation using a one stage perfect shuffle network is a worthwhile project. 3) We are able to adapt recurrence solving algorithms on a parallel processing system equipped with a (log n)-stage network. A highly significant result would be to show how recurrence algorithms could be handled a one-stage network (perhaps coupled shuffle and shift connections). This would lead to a very cheap and effective prallel architecture. H) Control unit times should be carefully added to the Simulator. Subscript calculations and register and scalar usages should be accounted for more accurately. 149 I^IST OF REFERENCES [1] K.J .Thurber# "interconnection Networks — h Survey and Assessment," AFIPS Conference Proceedings , Vol.43, pp. 909-919, nay 1974. [2] K.E. Batcher, "Sorting Networks and Tl^eir Applications," Proceedings of the 1968 SJCC.pp. 307-314. [3] V.E.Benes, Mathematical T heory of Connecting Netw orks an4 Teleph one Traffic . Academic Press, New York, 1965. [4] D.H.Lawrie, "Access and Alignment of Data in an Array Processor," IEEE Transactions on Computers, pp. 1145- 1155, December 1975. [5] T.Feng, "Data Manipulating Functions in Parallel Processors and Their Implementations," I ^EE Transacti ons or^ Computers, pp. 309-318, March 1974. [6] G.H. Barnes, et al,"The Illiac IV Computer," I EEE Transactions on C omputers , pp. 746-757, Aug ust 1968. [7] B.C. Swanson, "interconnections for Parallel Memories to Unscramble p-ordered Vectors, "IEEE Transaqtions of} Computers^ pp. 1 IO5-I 1 1 5, November 1974. [8] H.S.Stone, "Parallel Processing With the Perfect Shuffle," IEEE T ransactions on Compu ters»pp. 153-161, February 1971. :5P ^3 150 [9] S. H. GO lomb, "Permutation by Cutting and Shaf fling, "S^Ag Review, vol. 3, no, U, pp. 293-297, October 1961. [10] B.C. Pease, "An Adaptation of the Fast Fourier Transform to Parallel Processing, "JACM, pp. 252-264, April 1968. [11] L.R.Goke and G.J. Lipovski, "Banyan Networks for Partitioning Multiprocessor Systems, " 1st Annual Compu ter Architecture C onf e renc e, Gains ville, Florida, December 1973, pp. 21-28. [12] K. E. Batcher, "The Multi-Dimensional Access Memory in STARAN," submitted to lEEETC. [13] K-T.Hen and D.H .Lawrie, "Omega Network Control Plane Iirplementation," Dept. of Computer Science, Univ. of Illinois, Urbana-champaign, unpublished memo, Sept. 1976. [14] D.H. Lawrie, "Memory-Processor Connection Networks," Univ. of Illinois, Urbana-Champaign, Computer Science Report 557, Feb. IQ^B. [15] T.Lang and H.S.Stone, "A Shuffle-Exchange Network with Simplified Control. "IEEE T ransact i ons on C ompute rs, pp. 55-65, January 1976. [16] M .c. Pease, "The Indirect Binary n-Cube Microprocessor Array," submitted for publication. V . , 151 [17] D.C.Opferman and N.T.Tsao-Wu,"On a Class of Rearrange- able Switching Networks," ^el\ Syste^n Techn i cal JournaJ., vol.50, pp. 1579-1618, Hay-June 1971. [18] T« Lang, "Interconnections Between Processors and Memory Hodales Using the Shuffle-Exchange Hetwork," I EEE T ransactions on Com puters, pp. 496-503. May 1976. [19] P.Budnik and D, J. Rack, "The Organisation and Use of Parallel Memories," IEEE Transactions on C omputers , pp. 1566-1569, December 1971. [20] P.M.Kogge and H.S.Stone, "A parallel Algorithm for the Efficient Solution of a General Class of Recurrence Equations," IgEE T ran s actions on Computers, pp. 786-7 92, August 1973. [21] D. Heller, "On the Efficient Computation of Recurrence Relations," I nst . Comput. A ppI. S ci. Eng. (ICASE^ » June 197U. [22] S.C.Chen and D. J. Kuck, "Time and Parallel Processor Bounds for Linear Recurrence Systems, "IEEE Transactions OQ Computers, pp. 701-717, July 1975. C9 - 1 J IT 5 - &? n [23] S.C.Chen, D.J. Kuck, and A. H. Sameh, "Pract ical Parallel Triangular System Solvers," in preparation. [2i»] A. H. Sameh and R« P. Brent, "Solving Triangular Systems on a Parallel Computer," Univ. of Illinois, Urbana- 152 Champaign, Compatec Science Feport 76 6, November 1975. [25] S.C.Chen, "Speedup of Iterative Programs in Hulti- pcocessing Systems," Oniv. of Illinois, Orbana- Champaign, Computer Science Report 69 U, Jan. 1975. [26] B.R.Leasure, "Compiling Serial Languages for Parallel Machines," H. S. Thesis, University of Illinois^ 1976. [27] S. E.Orcutt,"In)pieirentation of Permutation Functions in ILLIAC IV-Type Computers. " IEEE Transactions on C ompute rs, pp. 929-936, September 1976. 153 APPENDIX A We will first show a rrultiplicat ion algorithm that takes N processors and 2N uieinoiies. The skewing scheme is /TT+l-skew 2-slcip. .^t^mories 2i and {2i-»-N+ /n+ 1) mod (2N) ara connected to processor i. fill even PA's refer to memory 2i and all odd RA's refer to memory (2i + N-«- /TI + 1) mod (2N) . An illustration of the memory map is shown in Figure A,1, ■■— -■■ PO PI P2 P3 K:0 hi ^12 r'l m M^ Me M^' (0,C) (0,1) (C,2) (0,3) (1,2) (1,3) (nc) (1,1) (2,1) (2,2) (2,3) (2,0) (3,3) {^,^) (3,1) (3,2) Figure \,} yN + 1-Sk.ew 2-Skip Scheme A l£0 r it: h m : A. Pepeat for IC=0 to N-1; 1. Fetch A, P^^IC, into t? 1 . 2. Set k=r ( N + 1) *IC ]mod (2N) . 3. If k is odd then k= (h-N- N-1)nod(2N). ^. k=k/2. 5. IR= (PPN-k) modN. 6. Repeat for IT=0 to N-1; a. fetch n, ^A=IR, into P2. vC w It % 154 b. multiply P1 and P2 into R3. c. G-permute IR. d. J=(PPN-( N + 1) IR/2 )modN. e. if IR is odd then J= (J + N/2) inodN f. store R3 into T, RA=J. q. G-pt»rrnutc ?1. 7. Sot R1=C. 3, Repeat for I'"=0 to N-1; a. fetch T, Bk=lh, into PA. b. add R"! and F2 into F1. c. G-permute HI, d. G-PGcnute T^. 9. Store PI into C, RA=IC. P. Done. Another algorithm also uses 2N memories, except now it uses a l-skew ?-skip storage scheme. For this scheme, memories 2i and ( 2i+ K+ 1) mod (2N) are connected to processor i. The iremory scheme is illustrated in Figure A. 2. 155 PO MO M5 PI «12 J^7 P2 Ma Ml P3 M6 M3 (0,0) 0,2) (2,3) (3,1) (0,1) (1,3) (3,2) (0,2) (1,0) (2,1) (3,3) (0,3) (1r1) (2,2) (3,0) Fiqare \,2 1-3kew 2-Skip Scheme A 1 c[o r i t hrr : \, Rf?peat for IC^O to N-1; 1. Fetch ^, P»-TC, into F1. 2. If IC is old then k= ( TC-N- 1 ) mod (2N) . 3. k=k/?. a. IE=(FPM-k) niodN. 5. r^epeat for IT = to N-1; a. fetch B, RA=IB, into R2. b. inultiply K1 and P2 into P. 3. c. G-permute I^.. d. J=(PFN- IP/2 )inodN. e. if 12 is odd then J= (J + N/2) inodN. f. store P^ into T, BA=J. q, G-permute R 1 . 6. Sf^t P1=0. 7. Repeat for IT = "! to II- 1; a. fetch T, RA=IE, into HA. ■tftf £? 3 9 5»» «3 156 b. add Pi and R2 into Rl. c. G-permute HI, d. G-permute IR. 8. Store Rl into C, RA=IC. B. Done, 157 APPEND I X_B The following twelve tables are the experimental results of three Fortran programs; ADVV, ELMBAK, and SLEQI, The column AN shows what alignment network is chosen. The column (H=) indicates whether the number of menories is prime or not. The column SKEW shows the skewing scheme being chosen. Finally, the column SPEED (P/A/M) gives the relative speeds of processor, alignment network and memory. As for the headings, the number of processors is given as P=icx. SP/SiM indicates whether the switches SP and Sm (c.f. Section U.2.?) are turned on or not. For the tabale on Execution Time(Tp) and Spaed Factor (Fp) , the main number in each entry is To and the number in parentheses is Fp . The remaining tables show utilization measures of various system resources. They are in the order: array memory, input alignment network, output alignment network, vector processor system, scalar memory and scalar processor. r"' ■ A ^]2 ^ it* 1 If Iff " -^ •0 when the entries in some row j is the same as that of row i, it will be marked "same as row i". II II , — . 00 • f^ • ^ CT> CO C3^ O CT> r— . 00 • CO • I— «:!• CT> CO Cn O CTt II a. «* O CT. o o r— ' 1 — — ' s_ i- I— CO 00 ,^.— «^ (»^-v <-—*«» CVJ CO C\J ^ ^ o^ CSJ ^ en • CO • 00 • CO • CO CTi o en r— "* — ' CO en O CT> I— CO lo pv. 'v^ cr> U3 «^ r— . CO • CO • «d- en CO en o en ^-^ ID '-> «^ en o o 00 en «;3- . CM • 00 • Lo CO in en o CO o ^ — en I— CM • <^ • un en o en I/) II Q. O O II oo a. CO QJ (U E E (T3 . • "=^ • I — • CM CO CM CO O CO ^-— ' «* — > co> UJ .. ol C3 r-l O a: •• 00 CO 00 CO 00 00 CO 00 00 CO ^ en CM sj- CX3 • CO • CO en o en Q < s- o i- a. s- o s- o +J o •a (U CL -o c Or- r^ r— CM • r^ • LD en o en E •r- s- Q. CQ X CO CO CO 00 "^ CO "=* 00 CO LD "^ o CM r>. CM ^ . en • CM CO en CO <^ > — CM ' 00 CM 00 CO CM CM a o CO r~> CM ^d" • en • CM CO en CO CM — ' CO CM CO c o •r- 4-> 3 O 0) X LU CO eu s- 00 CM 1 — CM CO LD to 00 en I— CM II a. 00 11 a. kO CO CO CO KO ^ U3 CO 73,68, 68,12, 75,35, 69,12, 73,68, 93. 8, 73,68, 68,12, VO CO CO CO ^ ^ VO CO 73,68, 68,12, 75,35, 69,12, 50,46, 93. 8, 73,68, 68,12, 00 CO S3- CO V£) ^ 00 CO 78,71, 62, -, 80,36, 63, -, Ln VO o Ln en 78,71, 62, -, r^ r^ rN tn en CO CO Ln r>. VO VO I CO O VO CO VO VO O I Ln VO CO Ln CTi o ^ ^-* II II S oo a. ' Q_ 00 rN I CO vn I r^ 00 CO VO I— CO I— CO ■ — r 00 Ln r<> VO E s*- I r> I CO r i>^ CO vn r-; CO r-^ CTi Ps !> CO VO 00 VO Ln cn p ^ uj > o Ln en o CO 00 VO r^ o in en 1 ro •« CO CM • 1 1 n A CM 00 CM 1 1 O CM 1 1 O CO in CM o r>> Ln en Ln CM OP> LO en 1 I 1 1 I 1 1 1 CO I CO I CO CM O VO VO CO 00 VO in en II II 00 I Ln J >— en 00 pn. CO VO in en 00 CM CO CO CO I CM I o VO VO CO CO VO in en I I I I CM I— cn 00 r> 00 VO in en CO CM 00 00 159 to i- cn o S- Q. &- O to (U o i- o (U DC to Z3 O •r— S- «4- o fO O) >» <_) >> +-> Q O) •— CO CO in VO 00 CM cn O r— CO .a .'.-•In ■1; (I ' ' . ;« H "J "? *■ 2 2» Cl -9 Q .. ol O r-l o a. z ^ ^ o o O o f^ o o o o •> LD 1— V£) r- •s^ >— ^^ •* r~" ^ A •v VO V— «o -o *o •>o II LO O WD O CO o o o II ^ r-^ ^" ^■~ n— ^^ f— oo «t A M #\ •\ #1 «\ *k Q. "^ LO *X> (^ VO CO VO CO CO Q. r"^ 1"^ ^^ ^^ OO o ' O o o o O o o o> 1 — OO r- 1 l~~ " — ^ tn • VO « A «\ r — "O -O •>o -o 9t CTi o OO O CO O o o 1 tn r— (^ r- CO r— ,_ cr, n r- O CO CO ^ O «* O «:*• «* ^ ^ CO CO CO CO 1 o o CTi CO CO CD CD 1 O o o •* r^ •» !—• *• 1-^ A r^ n CO •« S 2 K£> »> S 2 2 2 00 •■ 2 to " 2 r— LD 1 o o en 1 O o o O LD 1 o 1 o CM CO LD tX) s^ S- en CO s- S- S- s- cn CO en CO S- cn CO cn CO s- VD ■ — II Q_ •* 9,59 00, - 3,63 00, - 1 1 CO o 1 1 "O o o o Lf> 1 — 00 l/) «^ r— CO LO co 10 to r— CO co CTi 1— 03 n3 CO 1— to v£) <* 1 (O (O 03 03 CO 1— CO to 03 0S ^ CO 1— to CD 1 1 03 i-"^ 5 "^ o A «« OJ O) A «k cu OJ dJ (U #> •» 0) •» •* 1 IT) t 1 1 t 1 o LD r^N ^ — ' •» •* «b «t •* •» •\ *i II CM 1 00 1 CO 1 CD 1 II s: UD r> CD CM CO «t *t 9S «\ •« «t ft « a. ^ CM CM r^ CM r^ CM I^ CM Q_ ^ 00 r>, 00 r>. CO r^ 00 oo _^. ^ 1 1 1 - 00 00 '^ 00 ^ 00 CO ^3- CO ^ CO ^ UJ ex. 00 <* «* CO ^ CO 00 «d- CM ^ CM ^ 00 Q. 00 00 00 CO CO 00 00 00 CO CO 00 CO ' "^ ^.^ ^ ^ _ _ _ 3 ^- r— r— f— ^— ^— LU A 9t A «k M •« :^ 1 r— o , — , — o CO "" ^ " ^^"^ ' " — ' "■ — ' O) II E J^ -^ Js (U o S- o to (U to O •r- s- ro O c o +J N ID a> c •r- U 00 LD r— CM CO to 00 CTi r— CM 1^ P = 64 SP/SM=(1, 1— CO ^1- CO r— ^ .— CO .— ^ .— ^ 1 CO 1 *3- 1 CO 1 «:»■ O CM LO CM PN. 00 r— CM (v. 00 .8, 7, 15, 8, 12,10, 11.12, 8, 7. 15, 8. 12. 0. 11,12, O 00 CO LO CM r— COLO CO LO CO CO CM CO CM ^ «^ CO CM «d- ^ ^ 32.29. 3 58, 8, 4 47,21. - 44,12. 3 32.14. - 59, 8. 4 1 CO O CM 1 ^ O CM O CM ^1— CM 1— CO CO ^ ^ 30,27 58, 8, CO CM cn CO ^1— CM ^O CO CM 00 "id- ^ CO LO 32, 0, 59, 8, o CO CO CM CO CM ^ CO CO CO ^ CM «:*• 1 CO 1 ^ 1 CO 1 «* tx> M r^ 1 C3^ I CO r— O C3^ r— O ** CO ^ ^ 29,28 56, -, o 1 II i s: •r- CM J»4 CM CM < X oi 1 — r 161 Q i. «2 I— CM CO LO CO CTl O I— CM CO • •o U3r- O an •• ^-s 1— •% «:f ,^ , — * - — » , — » > — » ,. s , — » ^ — ., • O 1— o ^ en en O LO o r^ r^ ,— II lO o in ^ CM 1— 00 r>. CM en r^ r«» r- cn II z: r^ • <^ • o . CM • r^ • CM . r^ • oo r— CM I— CM r— CM I— CM !>-. CM r— CM r^ CM o. r— o^ O^ CO en CM LO O LO «;!- CM r- 00 r*. «* 00 LO CO co en ^— r^ • ^ . o . CM • r<- • CM • LO . •>_^ 1— CsJ 1— CM 1— CM r^ en ^- ^- CO r- CM i^ cvj f— CM r^ CM ^_^ o "!:r 1— CM CO O cn CM ^ cn .— 00 CM «t LT) en LD CM in 1— S S 2 s 2 cn Ln <^ r-. r^. Ln CO 00 r— 00 . LO • o . o o o o o CO • o . CO • cn . ^--«' ^— r— .— CM •— CM s- s- s- S- s- r— CM CO CM 1— CM 1^ CM ■ — ^ — ' * — ' "~^ — ' VD ' — II Cl. ^^ ^.^ ^-^ ,— ^ ,-^ ^-^ ,_^ ^-^ 1 — O r- o •* CO en CvJ LO en r- O •^ «« VO O LD «:;*• CM r— l/l CO 00 00 l/) 00 rv. "s:!- CO LO 00 CO cn o r-s . ^ • o . fO (O - 1— . * I— CM 1— CM V) 1/1 LO I/) LO 1— CM CO CM f— CM r^ CM o* •* o , — , , — , , — , , — » < — , * — » * s '!t O r- O viD CO CM Ln CO Ln Ln Ln r^ Ln cn II O VO O 00 CO LO CO O en o en en CM cn II s: CM . cn • CO • r> • o • r^ • 1^ • 00 C\J r— f-^ f-^ r— r— r— CM I— CM r— r— r^ f— o. to ^^ ^ O T~ 00 00 •^ CO «* 00 CO >* CO •vi- CO •^ Ul — . ■^^ -~^ ■■^^ -^ — ^ — . •*^ UJ < 00 •^ •* 00 •vf CO CO ^ CM 1 — CM f— o_ ^^^ *^^ ■*«v^ ^ ^"v^ "•»N^ "^*^ ^N^ ^*v^ *** — ' — ^ o-> Q. CO 00 CO CO 00 CO 00 00 CO CO 00 CO ^ ^ ^ ^ ^ ^ ^ ^^ ^ ^ 3 r — 1 — 1 — f^ I^ ^— UJ » *k •k «H A 0k iii^ 1— , — o 1— ^- o 00 ^■■"^ '^■^ ^~^ ' ^ ^ OJ II E -ili Ji^ .:x^ ^~ Q. CM CM CM z CO G "^ <: X 1— ( r— CM CO «* Ln LO I^ 00 cn o , CM 162 CO t- cn o Cl. s- o o o <0 a. CO -o c 0) o +-> o X CD a. vo II CO C\J ^ LT) O r- C\J ,— 1 — 1 — C\J CM «« *k A •» •* f^ 0 f— n- r— CM r— en CT> 1— CM VO ';r Lf) r— ^ n- CO CVJ ^ Ln O r- OJ r— r— t— CM C\J #1 «i A •* •* •* CT» en I— CM VO ' ^ Ln 1— ^ 1— ^ OvJ "5r «>r 1— O CM r— 1^ t— CM CM «i •« «% A •» « «^ 1 O 1 O 1 ^ CO C\J #1 •% O CO a\ en 1— r— VO IT) UD 1— LD 1— CO CM ':^ I o CM VO CM VO LO CM 1— CM t— r- I— CM I— ^ 1 vo 1 vn 1 II ^ CM CO II ^ 9% n •* A •I A 00 CM CM CO r^ r-v I— Q. Q. to r>> CO 00 CO in IT) 00 Q_ OO 00 o » CM r^ I— r^ I— ko CM I vo I r^ <— CM CM I cn I CM I— I I I I CM I— CM I— r^ o lo CM vo «!:a- I I 00 CM 00 I— rN. I — vo CM I I I I cn p^ vo ^ 00 o vo o vo CO LD LO I I CM I CM cn I 00 CO 00 00 rs. CM vo ^ CO I o I 1 — T— CM LO cn o^d- r>. vo 00 CM CO 00 CM I 00 I CM I— 00 cn o CO 00 CO r>. vo 00 CM 00 00 CM tz- CM LO ^O 163 1 r^ 1 CO CM 1 r«. 1 00 CM CM 1— Q ^ CM I— cn rs. vo «3- r^ CO vo 1— LO r— LO CM CO CO vo 1— vo CM in CM CO E s- O i~ a. o to > o 00 E (d S- cn o i~ cl. titeM 3 :c "2 CO cn o f— CM , — ^ o , o A ^ r- ^— «« «!r •^^ «o lO II (^ o s 1 — II oo Q. o o VD I— ^— ^ r— •» , • o «k «d- O n— CVJ I— O CO o o ^ r— VjD O IT) >£) o O o O 1 r^ t n" •* •* «« o «o CT> O CO o r— r~~ A «% A •« LO WD LO VO 164 o o CO f— to o CM I— CVJ C^i CM VO o o o o 1 1— 1 r— «t 0i «o *o ^ o r^ o CO 1— CM 1— •« X O^ •t •» CM CO CM CO CM U3 CM V£> II o. in o r— o r^ o r— o 1 o o 1 o o 2 o o CVi 1 2 o o 5 o S- o s- o « CO 1 5 o CO •« OJ 1 o J- r^ CO f— U3 cr> CO en CO CT) CO r— KO I/) (T3 «r> I • o ^ o CM r— O 1— CO CO ir> I cr> i E p- fO r> I— I— CO to O) II CL. o o II tn D- 00 c/i a. en CO CO 1— iO «5f U3 UJ CO CO LU .. Ol O r-l o CO CO 00 CO CO E •r- S- O- CQ X VO O (/) CM f— fO CM r— CM CO .— I E CM (d •< *> to en r— I— CO to rt3 to n3 lO IT3 O) E fO to E m to (U to cr> LO CM LT) O r— CM CM a I I 00 I CM LO LD VD 'd- CO ^ 00 CO «::f CO «* CO «* CO CO ^ CM 00 CO CO CO 00 00 CO CM to I I «;r o CO I— CM I— CM CO I I O I E CO n3 •> •> to cr> r— I— CO 00 CM I— fO CM r— CM CO I I O I E CO fO « » to CTl (— r- CO I I CM I LT) C\J LD LD to 00 ^ CM •— 00 CO CQ re CT) O L- O- i- o 0) o s- o i- Z3 CD CO to r^ 00 cn CM vo — - II II :s: — r- C\ CVJ CVJ I— LD r— r- vo II o CVJ C\J CO OJ c\j c*- ^ CM CVJ LO ro 1— >— 1— c\ *X) CM ^ LO LO 1— I — CM CM CM r— I— ro CM CM ro ^ CM OO ,— 1— CM r^ CM vo p— o vD CM r^ en c ^ CM CM <;r oo o r— I— CM *£> I ^ I lO O VO CM r^ CT> o II a. o o II CM CO t VD CM ^ ID LTJ 1 — O r-» CM CO (T> ^ I CM I CO I VO I ^ I— VO CJ> o ^ CM *;}■ O •— CM 1^ I VO I i— VO o o •^ I 'S' I r^ CM VO I— ** I ^ I VO I r>. I t^ OVO CMOO I r— I LD I Or— «;d- «:J- ^ n CO CM n CM CM CO CO 00 00 00 <;!• 00 CO (U E •r- S- CO I r- I CM cn 2 O O) r^ I— o CO CO CM CO CO 00 ^ 00 '^J- 00 00 2 o CO 00 00 CM CO X r— CM LO o to (U O) E E rtJ ro lO to , 2. - ,17,17 1 00 CM CM ^ 1 CM 2, - 14.28 «;*• CM CO CO 'd- CM CO CO , 8. - ,17,17 , 7, - ,14,28 LO 4. - 14.28 LO 00 CM CO LO 00 CsJ «d- 1 LD 1 r>. CM . VO 1 r^ CM CO 1 r^ 1 LO "S- 1 LO r«^ CO CO LO 00 CO ro 00 f— 1 1 LO 1 1 1 LO cyi CM LO r— n— LO cr> CM lO 165 II II 00 I r^ I LO 00 CO ^ ^ I CO I I I I LO I «;3- I LO CT> CO LO I I CM I cy> i r^r^s. r>>cM lo vo vor— ^CM CO^ <::1- CM CO^ 00 'd- 00 "sa- 00 00 CM cs 00 ^ CM I— 00 CO 00 CM 00 00 OQ E i- o> o i~ S- o to O) u 3 o CO O $- O N ■o O r- CM ^O - — II II ^ 00 Q-, -^ O- 00 ^ CO 00 • CM CTt CO . 00 • CM en o Ln ^ in o — vo "^ tn ^ wo ^ CM ' CM ' 1— ^— ^ • CM «X) c— O • V£3 ta- ^ • CM >X> r— O . II O- CO CO CM ^ 00 (X) • CO • CM • C\J >-^ CM — ■■ I— •>-- s CO CO CTl • CM • r«. ^ UD CO CM 1—^ — - O KD • CM ^ LD O UD 00 <— ^ • . V^ • VD "^ LD ^ LO ^ CM ^-^ CM — ' I II o o co o 2 o o to "d- CM LO 166 ^ ir> CO r— ^ . (JD • l£5 'd- LO 'd- CM ^— ' ■ ' CO CO CM ^*- CO cr> (£) . CO • CM • r^ ^ r^ ^ U3 CO CM ^-> CM — -' ■ — ' Q UJ O- oo r>. CM LO CM LO^> LO CM I — O O • CO • •^ CM 1 — CM LO^> CO'^^ to to o s- CM ^ CO • 1^ ^ CM — ' in 2 O 5 o to O VO as • LO ^ CM ' CO CO CO CT> QJ wo • CM r^ «^ wo CO (Tj CM ,— ^~> 10 CM 1— o r^ • CO • LO CM 1 — CM LO^^ CO to ro to ro UJ 00 (O S- cn o i~ o. i- o o 4-> u ro t3 (U (U Q- 00 -o c ro CM "^ O) 0) 0) CO • E E fc= r^ *:!■ ro ro rO CM to to to LU 00 c UJ -J CO .. ol t3 r-l O Qi •• a. s: O) CL. CO X CO ■* 00 '=^ CO CO CO CO CO CO CM 00 00 00 LO CM O • ^ CM LO — ■> 00 «* CO CO 0) E o •r— u 00 00 CO si- 00 0) zn •r- U- CM a CO LO to 00 CTl 1 — CM I OJ O .— .— CO CO ^B^ I— |COC\J .— OJ r— C^0 c\j p^ c\j r-» CO o a. 00 cr> 00 cvj r>>. CO ^ r^ ^00 CO en en o CvJ I— CM •— c\j 00 ■— r^ >X3 ID CT» O LO O CM «— r— I— ^ ID CM I— CM 00 CM 00 I— rv. vo 00 in CM 00 CT> CO 00 CT> CM «;r CT> CT» CO ^ cn <0 CO "Si- r— CM (D CO LT) LO r>. r»» r^ r — cm O^ CO (T> CO CO VD CM 00 ^ CO CT» CO r^ li) CTl O ^ 1— CM t— CM CM 00 r— P*, CM 00 cn CO VO CO r^ i£> 00 I— CM CT> CO CO <^ en I CM CVJ CM CO CM 00 LO CO CO pv. o ^ o II 11 Q- CO a. 00 CM 00 CT» ^ 00 I CO *i- CO «;*• I P% 00 CM LO I LO CO LO I LT) LO LO Ps LO 1—00 cTi *;r cn ^ CO p> CT> 1 CM CM CM CO LO CO Pn LO C\J 00 P-s f— CT> ^ P^ 00 00 I C^O CM LO I LO LO I LD LO LO 1—00 CT> «::}• 00 r^ "^ I CO I CM I— CM I CM CO I LO CO t CO LO I LO O CO CO LO O p> cn ^ cy> «::j- 00 Ps. Ui < UJ 00 3i U' 4 00 00 00 CO «* CO 00 OJ CO X CM CM I CM CO I LO LO I LO O CO O P> cT> ^ 00 r«>. CO «* 00 "^^r 00 00 tA O 2 o to O) E 0) 00 CO CO 00 CM o (/1 (TJ dJ (A 167 LO O CM 00 CO ^ cn en CO LO o CM 00 CO «;f cn en CO ':d- cn CO r^ p^ CT> CO LO CM 00 CO ^ CO cn ^ «:3- I CO P> LO cn <;f CO I CO I CO 0% A CO LO cn "d- LO o LO 2 o 2 o Q) CC (/) 3 O •r- S- o > ■M CM E cn o O- .jbdOa. ;8 CM c; I— CM CO LO LO 00 cn o I— CM ,_ o A CO o «^ r— 1 — ^_^ •< « II ro o II s: CM o 00 •t n^ a. >^ i^~ •k D- f— CM oo O ro O ^ — s CM r— f-^ «« •» «k r- O r— LT) O 23, 36,1 WD II O O K£3 I— CO o CO I— 00 CM o o o I— o <:1- o CM 168 OJ o CM O (^ •« *^ I I— «£) o CM ro $_ CO I CM •< «o I— o IT) I— 2 o o o <:a- O o o CM I— CM 1— ^ «« ^ •« CM O <* o LT) O in o m> r— •* r^ ^ " CM •• CM VO ^ o CM O CM O r^ •> (T> •> "^ 1 ^ 1 9\ «l S- A t\ j£ CM WO o O WD O CM CO s- CM CO S- <^ 1 WD 1 CM " CM •> «o «o CM O n5 N •r— U 00 CO 0) cn I 3^ n\ gpi ^_^ r— M <— o 1 1 — 1 CO CO o .— vo CO O CO to r- O ^ r- ro CO CO f — r— I — , — VO ^-^ •> •> « #k «t «> 0k 0k 0k 0k 0k 0k A 0k •« A II LD r^ CO r>» CO o CO 00 cr> r^ to 00 1— p>. CM CO II z: r— ^— r-~ CO r— CO CO 00 A ^ A «t A m 0k 0k •^ •» 0k 0k 0k M 0k 0k cu -^ m CT> uo cr> CO r— r^ ^ vo r^ to «:a- in r^ r^ ^ o. r^ oo vo o CO o Ln vo r>. o vo ^o CO O to to ^ o ^"^ r-^ r— ^— r— r— r— ^— ^— r— A » A #1 «« M •» •* «s «« 0k 0k 0k 0k 0k 0k •t CO 00 VO 00 U3 P«» CO 00 r>» r^ «?^ 00 CO r^ r«. CO n— CO I — CO CO CO CO CO ^-^ M •« A A A •« #t A 0k 0k #k 0k 0k 0t 0k 0k f— «;!■ r- ^ r-^ CO CO ^ CO CO o ^ to CO CO «;1- CM r— CO 1— r— CO CO 1— 1— CO CO r- I— CO «^ CO t— LO to r^ V£> CTl CO cn un vo r^ en to to r^ CTl to to CO CD , (•'^^ ^mm ^.^ ^^ o •% A «« «t 0k 0k 0k 0k A A A 0k A A A « A r— 1 in 1 <■£> 1 1 — 1 r^ 1 CO 1 CO 1 to 1 ^- CO f— CO CO CO CO CO v.- ^ •« 0^ M «« 0k 0k 0k 0k W\ #k •> M 0k 0k S M •> s 5 5 o CO O CO t^ CO 1— CO CO CO O CO r^ CO o CO CO o o o VO CO 1— CO t— I— CO CO 1— I— CO CO 1— 1— CO s- CM r- S- $- s_ II Q. U3 1 CO I ID 1 r^ 1 to 1 CO 1 to 1 ^ 1 -"-^H r— " •> 0k 0k «t 0k 0k 0k A A •1 A 0k 0k •* •» A CsJ 00 <£) 00 vo r-o, CO CO r^ rv. ^ 00 CO r>. to 00 o CO r— cu CO CO CO CO *•— > 0\ 0* 0i 0t •k 0k 0k 0k •* •* 0k 0k 0k 0k 0k 0k rrr ^ r- ^ r>N «:J- CO «d- o^^ o ^ r-N CO to CO Lf> m to 10 CO I— CO t— f— CO CO I— r— CO CO 1— 1— CO <0 CO 1— fO rxj «o ^O 1 CO 1 LO I rs. 1 to I r-N 1 to 1 CO 1 ^""^ O •» •« A ffh •> 0k 0k 0k «l «t 0k 0k 0k 0k 0k 0k •« r— 1 LO 1 to I 1 — 1 r>N 1 CO I CO 1 to 1 o CO 1 — CO CO CO CO CO SI) CO (O •» 0k CO «;1- E E re CO 1 — CO 1— r— CO CO 1— I— CO CO f— 1— CO (/) CM 1— lO to 10 ^ o" M r- 1 vo ■ O 1 ^ 1 CO 1 ^ 1 CO 1 r^ 1 ^ O r*~ ^_ » — ' •I 0i #1 «k •* A 0k 0k M A 0k 0k 0k 0k 0k 0k 11 11 ^— 1 ^— 1 »^ 1 KO 1 r— 1 r^ 1 CO 1 ^ 1 21 ^ CO CO v 00 CO ^ CO r% r^ CO CO ^ 1^ O 00 (Tt CO a. CO CO CO CO CO CO ^ ^ CO «* <:!• ^ CO «* ^ ^ CO _ U4 ^.^ CO CO ^ 00 ^ 00 "* CO CO '^ CO ^ UJ < 00 ^ •^ CO ^ 00 ^ CO ^ ^ 00 ^ 00 Q- CO CO 00 00 CO 00 CO CO 00 00 00 00 ^^ ^.^ ^.^ _ LJ ' — ■ r— F— ^— , — •« 0k A 0k 0k ^ r— ,— o f— o CO "" "—^ CU II E -iiJ .^ z: si Q. CO CO < CO X cs r— CJ CO ^ ID to r^ 00 (y> o _ CO 169 I oo E o i- o o I. o (/) o: (/) 3 O •r- S- (O O (O N -o o C_) 0) C7» ^ al a •i .J '■'T- 3 ;i *) '.iT^^'^a. C3 1 e3 r. 3» It Sm 170 VITA Kuo Yon Wen wis born in Shanghai, China on November m, 19U9. He received his B.S. degree in Electrical Engineering and Coirputer Science in 1971, and his M.S. degree in Computer Science in 197u, both from University of Illinois, nrbara-Chairpaign. From 1971 to 1576, he was a research assistant in the Department cf Computer Science. He was associated with the Illiac III pro-ject from 1971 to 1973 and with the Machine and Software Organization project since 1973. He was the coauthor with Prof. Duncan H. Lawrie of a paper, "Effectiveness of Various Processor/Memory Interconnections," presented at the 197f International Tonference on Parallel Processing. He is a member of the Institute of Electrical and Electronic Engineers. I IIIOCRAPHICDATA 1. Report No eport No. UIUCDCS-R-76-830 lie and Subtitle Interprocessor Connections-- Capabmties, Exploitation and Effectiveness y[hor(s) JO Yen Wen Irforming Organization Name and Address diversity of Illinois at Urbana-Champaign Ijpartment of Computer Science itana, Illinois 61801 t, )ori soring Organization Name and Address litional Science Foundation lishington, D. C. 3. Recipient'a Acceaaion No. S- Report Date October, 1976 4. t> Performing Organization Rept. ''°- UIUCDCS-R-76-830 10. Proi«ct/Taak/Work Unit No. 11. Contract /Grant No. US NSF MCS73-07980 A03 13. Type of Report 8t Period Covered Doctoral Dissertation 14. i. jpplementary Notes I. bstracts licently, some research interests has centered around interprocessor connections for ::MD type parallel machines. However, we still lack a methodology for evaluating irious networks. In this paper, we first present some new results on network I'operties. Then we show how to exploit various networks in ordinary computations. Inally we describe how we can apply the theoretical results to predict the lirformance of some network in a real program environment, which is the true measure r network effectiveness. '• ey Words and Document Analysis. 17a. Descriptors i /,Tay Slicing ('inputation Adaptation ftwork Control Irinutation Capabilities !MD Type Machine Itmulation 'fc|dentifiers/Open-Ended Te 'e,:OSATl Field/Group ••jailability Statement fjlease Unlimited '^ NTIS-3B (10-70) 19. Security Class (This Report) 5I£J U:iC:LASSIFIED cunty Class (Thi; 20. Security Class ( 1 his Page UNCLASSIFIED 21- No. of Pages 176 22. Price USCOMM-OC 40329-P7I ■if 'Si -'M 3 «2 «3 I ms^ ■irtCTP Ik ..-ran e 3 3 5» 'S -a ->' m' m^ t ■M' ..my. TJ^^ Ml Mk JAN ^ 9 197b UNIVERSITY OF ILLINOIS-URBANA 510e4IL6Rno C002 no 830835(1976 Implementation ol the language CLEOPATRA 3 0112 088403073