; ■ . ■;.- • : ' <llii?tKi 
 
 A'.Hiiir 
 
LIBRARY OF THE 
 
 UNIVERSITY OF ILLINOIS 
 
 AT URBANA-CHAMPAIGN 
 
 510.84 
 no.9e>0'855 
 

I! 
 
 a 
 
 n 
 Ik 
 
 m 
 
 n 
 

iyi^f^ Report No. UIUCDCS-R-76-830 
 
 NSF-0CA-MCS73-07980 A03-000024 
 
 INTERPROCESSOR CONNECTIONS- 
 CAPABILITIES, EXPLOITATION AND EFFECTIVENESS 
 
 ^ 
 
 by 
 Kuo Yen Wen 
 
 October 1976 
 
!^^<(.><VJf<»'J 
 
Report No. UIUCDCS-R-76-830 
 
 INTERPROCESSOR CONNECTIONS-- 
 CAPABILITIES, EXPLOITATION AND EFFECTIVENESS 
 
 by 
 
 Kuo Yen Wen 
 
 1i 
 
 October 1976 
 
 Department of Computer Science 
 
 University of Illinois at Urbana-Champaign 
 
 Urbana, Illinois 61801 
 
 This work was supported in part by the National Science Foundation under 
 Grant No. US NSF-MCS73-07980 A03 and was submitted in partial fulfillment 
 of the requirements for the degree of Doctor of Philosophy in Computer 
 Science, October 1976. 
 
TNTEPPRCCESSOR CONNECTIONS 
 
 CAPAPILITIES, EXPLOITATION AND EFFECTIVENESS 
 
 Kuo Yen Wen, PhD 
 repartirent cf Ccmputec Science 
 University of Illinois at Urbana-Chanipaign, 1976 
 
 Pecontly, sorre research interests has centered around 
 interprocessor connections for SIMD type parallel machines. 
 However, we still lack a methodlogy for evaluating various 
 networks. In this paper, we first present some new results 
 on network properties. Then we show how to exploit various 
 networks in ordinary computations. Finally we describe how 
 we can apply the theoretical results to predict the 
 performance of some network in a real program environment, 
 which is the true measure of network effectiveness. 
 
 I 
 
^^smmmmmmmmm 
 
6 jo. 6"/ 
 
 S^iT 
 
 ill 
 
 ACKNOWLECGFMENT 
 
 The author is very grateful to Professors Duncan H. 
 Lawrie and David J, Kuck for their constant guidance and 
 suggestions. Special thanks should go to Ross Towle, Bruce 
 I.easure ard Plike Wolfe for providing the Analyzer outputs, 
 and to Donald Chang for writing the simulator, Yuzo Hayashi 
 has been very helpful in the course of debugging and in the 
 designing of experirrents. 
 
 Sincere thanks again go to Donald Chang for drawing 
 some of the illustrations. 
 
 t 
 
iv 
 
 TABLE CF CONTENTS 
 
 1. TNTPODUCTICN 
 
 Page 
 1 
 
 2. NT^TWORK CAPRP.ILI'^IES 
 
 2.1 Introduction , u 
 
 2.2 Omeqa Network , 7 
 
 2.2.1 Control Structucf^'s for Cmeqa NetworT< "^ 
 
 2.2.1.1 Source/DcEtindt ion Toq l^othod 7 
 
 2.2.1.2 Colurrn Control Method 10 
 
 2.2.1.3 -70^1 Control Method 15 
 
 2.2.2 Orreqa Partition Theorems 25 
 
 2.2.3 Omeqa Proadcast Theorems 3^. 
 
 2,2. ii General A dmissib i lit v 4 3 
 
 2.3 Batcher Network and the Shuffle Connection 50 
 
 2. U Bones Network and the Shuffle Connection 32 
 
 2,5 One Staqe Perfect Shuffle Network 53 
 
 3. NETWORK UTILIZATION IN P^'^.ALLFL 
 
 PPorEssiNG SYSTEMS SB 
 
 
 0ti 
 
 n 
 
 '5 
 
 3.1 Introduction 5^3 
 
 3.2 Adaptation of Pecurronce Solvers 62 
 
 3.2.1 Hsinq a Limited Number of Processors, 6 5 
 
 3.2.2 Full Pecurronc^^ System Solver "^ U 
 
 3.2.3 I'sinq "any Processors 83 
 
 3. 2. U Tlsi.nq a Vod^rate Nam'^er of Processors 90 
 
 3.3 Matrix [Multiplication on a 
 
 Parallel Processinq System "^7 
 

V 
 
 a. PROCESSOR SYSTEM SIMULATION TECHKigiirS 101 
 
 U. 1 Intro'^uction 10 3 
 
 4.2 Simulator Input Specifications 106 
 
 'i.2,1 Input Instruction Nodes.. 106 
 
 U.2.2 Machine Parairetecs 110 
 
 U. 3 Simulator Outputs 116 
 
 U.4 Sweeping ^ndicos , 118 
 
 U.5 \rray Slicing 123 
 
 U.6 Resource Time Calculation 123 
 
 U.^.l Recurrence Handling 123 
 
 u.6.2 Vector Operations 130 
 
 a. 6. 3 I Iliac '^ypo Shift Networks 13 1 
 
 U. 7 Experimental Results 13 3 
 
 '"^. CONCLUSION ms 
 
 LIST or PFFFPEN'^ES 1U9 
 
 ?\PFrNDIX A 153 
 
 APPENDIX B IS 7 
 
 V IT ft. ... 170 
 
 t 
 
 
1. INTRCDUC^ION 
 
 Pocently, component speeds have been improving at a 
 tremendous r^te. Yet there are certain physical limitations 
 to component speeds. Multiprocessing then seems to be the 
 area to show the mcst proinise for any further speedup of 
 computations. The arrival of the cheap but powerful LSI 
 microprocessors greatly increases the attractiveness of 
 multiprocessing systems. However, a big problem arises in 
 finding the best way to interconnect all the processors. 
 The questions that are yet to be answered are what kind of 
 network should we use, how should we compile or restructure 
 computation algorithms in order to use it, and how well does 
 it work on ordinary Fortran programs. 
 
 Many interconnection schemes have beer proposed or 
 built in recent years. Thurber [ 1 ] gives a survey on some 
 of the Hore important cnes. However, each of the networks 
 proposed or built has different requirements to fulfill and 
 their implementations are based on different theoretical 
 i3ackqroands. Frequently, their capabilities are 
 incompletely known, and their control algorithms are poorly 
 understood. Hence it is very difficult tc categorize or 
 assess th^ merits of each of these networks. Chapter 2 of 
 this thesis investigates the theoretical part of network 
 capabilities. New network properties are presented which 
 helo us to utilize certain networks more efficiently. This 
 
 f 
 
 
 I 
 
 III 
 
section also finds some ways to simplify the control 
 algorithms for realizing some coirmonly used permutations on 
 certain interconnection networks. By sirrplifying the 
 control algorithms of certain networks, we reduce their 
 network corrplexity anij increase its feasibility. 
 
 The attractiveness of a connection network depends 
 not only on its ability to handle some permutations, but 
 also on the efficiency in which some common computations can 
 be mapped into the processing system where it is located, 
 3y mapping, we refer to the entire spectrum of strategies 
 including arithmetic operation scheduling, memory storage 
 scheme, and intermediate data routing. It car be shown that 
 a great part of any Fortran program can be represented as 
 either array operations or recurrence systems. So a 
 meaningful processing system will have to be highly 
 efficient, in mapping array or recurrence operations into the 
 system. However, it is also desirable to recognize some 
 commonly used computation algorithms, like matrix 
 multiplication and Fast Fourier Transform, and exploit the 
 best execution sequences and mapping strategies in some 
 given processing systems. Chapter 3 deals with the 
 restructuring of certain computation algorithms and the 
 exploitation of certain processor systems. 
 
 The true measure of the effectiveness of a processor 
 interconnection scheme (or even a certain parallel computer 
 
orqanization) lies on its performance in a real program 
 environment, not on operation by operatior, basis, nor on 
 computation by computation basis, A current joint project 
 deals with the simulation of parallel processing systems. 
 Input programs will first be parallelized by a program 
 analyzer. The parallelized program graph is then input into 
 the Resource Reguest Generator (RPG) which then compile it 
 into ps.eudo machine code for a specific parallel processing 
 system configuration. The RRG will use most of the known 
 capabilities and properties of interconnection networks and 
 parallel processing systems* Finally the fseudo machine 
 code is simulated on a Simulator which will produce a set of 
 performance measures. This allows us to evaluate various 
 oarallel processing system designs upon their performances 
 on real programs. Strategies and algorithms for this work 
 are presented in Chapter U. Chapter U will also include a 
 discussion of some preliminary results. 
 
 
 
 ill 
 
 i^2 
 
2. NETWORK CAPABILITIES 
 
 2. 1 Introduction 
 
 In general, processor (/memory) interconnection 
 networks can ho divided into two classes. The first class 
 has multiple stiaes of switching elements. The second class 
 has only a single stage of switching elements and this stage 
 may have to be recycled many times to obtain certain 
 permutations. '^yamples of the first class are the Catcher 
 networkf2^, the feenes networkf 3 ], the omega retwork[U], the 
 barrel shifter, and the Fenq»s data manipulat cr[ 5 ]. Networks 
 such as the Tlliac IV connection[ 6 ], the Swanson 
 connect ion[ 7 1, the +1 shift network ani the perfect shuffle 
 network[6 1 arc good examples of the one stage networks. 
 Although single stage rietworks may he slower in performing 
 general permutations of data than the multistage networks* 
 they are much cheapor in comparison. Tf wt can restructure 
 and recomfiile sorne of the comironly used computation 
 algorithms into algorithins which fully utilize the available 
 connectivities, we can retain the performance level while 
 drastically decreasing the cost of tae processing system. 
 
 Another way ot categorizing interconrection network 
 is based on the shuffle connection. liolomb [:)], Pease [10], 
 and Stone [8] riafine the perfect shuffle permutation of a 
 vector of N indices as 
 
P(i)=:i C<i< N/2 
 
 = (2i +1) irodN N/2 < i < N. 
 let. us extend this definition to (x,y) shuffle by defining 
 
 S(i) - x(i mod y)•^[i/yJ o < i < N 
 
 tlote that th<o perfect shuffle is just a. {2,H/2) shuffle. 
 
 laterconnect ion networks sucn as the Eatcher network, 
 thf Benps network, the oiTiegi network, and thf Banyan network 
 [11] have stages of switchiny elcrrents irite rconnectod witn 
 shuffle connections. Cn the other hand, examples of the 
 nor.-sh uf fle-based networks are the crossbar switch, the 
 Tlliac I^ petworK, the barrel shifter, the Swanson 
 connection, and Fcnq's data manip'.: lator . 
 
 Cne of the it uLt iple-stage shuffle-based networks 
 'which has been of particular interest to us in recent years 
 is the omeaa n'^twork. This network cannot perforin all 
 connections of its inputs to outputs, yet it is capable of 
 producing rrost of the connections required by nunerical 
 prograrrs. Because ot the incomplete capatilitics of this 
 network, it is n-jcessdry to anily2G the network to determine 
 exactly which connections it can produce. Section 2.? will 
 discuss the control structures and core rew results on 
 capabilities of the ornega network. These results are 
 
 essential to some of the algorithm solving abilities 
 discussed in Chanter 3, 
 
 -'(3 
 
 M 
 
 \1 
 
 IS 
 
 he relations ot the Eitcher and Penes networks to 
 
tho shuffle connections are being discussed in Sections 2,3 
 and 2.4. Section 2.'i discussed various routing techniques 
 for permutations using one-stage perfect shuffle networks. 
 
2. ? 0[nG(ja_Nf>twork 
 
 2.2. 1 £ontrol_S true turGs_fgr_Ome£a_Not work 
 
 2.2.1. 1 Source/Dostindtion_Ta2_MGtho^ 
 
 •"he omcTn network is attractive not orly because of 
 its low qate complexity, bat also b<^'caus€ of its control 
 simplicity. There are a numhor of ways to control the omeqi 
 network. T\\^ most fandamental laethoci is the destination taq 
 method[''-i] which uses K destinatioii tags, each of log N bits, 
 ^ach source port has a tag which r'^-inresents the destination 
 port numbers the data element intends to reach. log N stages 
 will be required to set the network, and as each stage is 
 set, the data elements will be switched accordingly. This 
 control method can he used to pass all omoqa-passa ble 
 ii^XiBMl^ t ions . 
 
 A more general method is the source tagging met hoi 
 [U]. Instead of using the destination tags, source tags are 
 used. One source tag is associated with each output and 
 represents the input to which that output port will be 
 connected. The source tagging method c<\n pass all 
 oneqa- passa.ble connections, including t h<=' one- to-many- 
 connections that Cannot be realize-i by th(^ destination tai 
 methoa. However, the iiiain drawback is that the network 
 control has to be set stage by stage fLora the last stage to 
 the first stage before the data elements can be passed 
 through the network, flenca an extra C (log N) gate delays 
 will be nee.ied Cor a conection. Details of these two control 
 
 iJ 
 
methods are dosccibed in [ U ], 
 
 A inodifiGd destination tag method that will allow 
 certain broadcasting (one-to-many) connections will be 
 presented here. Using this modified method, he will tio away 
 with the extra ©(log.N) gate delays needed for the source 
 taoging method. however, the broadcasting furctions will be 
 limited to only one- to-power-of-two elemients. For example, 
 the following connection will not be realizable by this 
 method, cut can he realized using the source tag rn thod. 
 
 Example 
 
 Source 
 
 Hestinition 
 
 
 
 
 
 
 
 1 
 
 
 
 2 
 
 2 
 
 3 
 
 For this modified tag method, instead of allowing 
 only or T for each of the log^N tag bits, we allow '0', 
 •1', •♦• or '-' for each of the log.N destination tag 
 characters. So a source/iest inat ion pair may now look like: 
 
 (0 1 1 ■) 1, * 1 1 *) 
 
 This pair will represent a one-to-four broadcasting 
 function of: 
 
 (0 1 1 1, C 1 1 0) 
 (0 1 1 1, C 1 1 1) 
 
(0 1 1 1, 1 1 1 0) 
 (0 1 1 \ 1 1 1 1) 
 
 The tag characteu •*' takes on the value of all 
 possible binary diqits, while 'C and ' '' » still have the 
 original meaning. For completeness, we have to use the tag 
 character: '-' to specify no connection. So for a complete 
 Source/destination set, we might have: 
 
 Fyample 
 
 (0 0, * 0) 
 
 (0 1, 1 1) 
 
 (1 0, 1) 
 
 (1 n - -) 
 
 source port destination port 
 
 
 
 2 
 
 1 3 
 
 2 1 
 
 It should bG clear by nOw that this modiiied tag 
 method can only be used if the power cf two (say, 2 ) 
 destination ports that a source port is connected to have the 
 same (log. N-h) bits in their tag bit representations. This 
 modified destination tag inethod also provides a great 
 notational advantage over the source tag method when we have 
 to describe a one-to-pcwe r-of- two broadcasting functions. As 
 can be seen in later chapters, most of the common 
 broadcasting functions are of this type and can be easily 
 described oy this modified tag method. 
 
 SIS 
 
 "I 
 
10 
 
 2.2.1.2 Co 1 iiiin_ Con trol_ Method 
 
 To control an omega nctwoclc of size N kN for any omega 
 passable E§£iDJiil3.ili2Q/ ^^ would require Nlog N/2 control bits. 
 TSach switching element will either be set to the 'cross over* 
 or *strdiqht through' state according to the value of the 
 corresponding control bit. Suppose we are willing to 
 sacrifice some of the capabilities of the network in order to 
 further simplify the control structure. If we use only log N 
 control bits, each controlling a complete stage of switching 
 elements, then we will have the column control method. An 
 omega network utilizing column control method turns out to be 
 exactly the same as Batcher's scrambling/unscrambling 
 network[ 12 !• As pointed out in [12], the scrambling/ 
 uncramblinq network can be constructed with 1 eg N levels of 
 selections and perfect shuffles, just as an onrega network. 
 
 Let s.. he the 1th most significant bit of the bit 
 •J 
 
 representation at the ith source port and d— be the ith most 
 
 significant hit of the bit representa tior. of the ith 
 
 destination port. Mso lat p. be the jth irost significant 
 
 J 
 
 control bit. Then for any permutation to be realizable by 
 
 Vj=1...1og2N, 
 Vi=1 . . ..N. 
 
 where ® is the exclusive-or operation. More details can be 
 
 found in [ 12 1. 
 
 the column control method, dj. = Sj. © p: 
 
 Since there is a total of only 2**(log N) (=N) 
 
11 
 
 different sots of p. '3, we can pass at most i: rjistinct 
 permutations using this method. Using the aigumonts in [12], 
 we can see that if the (i, j) th element of an NxN matrix is 
 stored in the itb position of memory module i®j (0<i,j<N), 
 then we can fetch any row or column of the matrix without 
 conflict =\nd the data can be aligned usir.g this column 
 control method. Hence if column and row accessings are all 
 that are required, we can simply use a very easily controlled 
 column control method for cur omega network. 
 
 .... (log N-1) -shuffled versions of the permutations. 
 
 It is logical to think that by increasing the number 
 of shuffle-exchange stages, we would te able to obtain morq 
 permutations that can utlilize this column control method. 
 Unfortunately, it can be shown that by increasing the number 
 of shuffle-exchange stages, we can only derive the original 
 set of permutations plus the 1-shuffled, 2-shuffled, 
 
 Hence 
 
 the maximum number of this set does not «^xceed NlogoN. The 
 proof will not be (elaborated here, but Figure 2.1 shouli 
 illustrate this phenomenon. Suppose that wt want to have a 
 five-stage np>twork using this column control method (either 
 by building five stages of shut f le-exchange s or recycling a 
 one stage network five times) , and also that we set p to 
 CI 111. , The permutation that we get at the output will be 
 (5 7 1 3 4 6 2) for an 6x8 network. However, the fourth 
 most significant bit of p will force a shuttle and then 
 'exclusive or' with the most significant i>it of the source 
 
12 
 
 c 
 
 
 
 
 
 2 
 
 2 
 
 3 
 
 3 
 
 7 
 
 7 
 
 5 
 
 1 
 
 a 
 
 2 
 
 
 
 3 
 
 2 
 
 7 
 
 3 
 
 5 
 
 7 
 
 2 
 
 1 
 
 a 
 
 8 
 
 
 
 1 
 
 2 
 
 6 
 
 
 1 
 
 3 
 
 5 
 
 <5, 
 
 4 
 
 1 
 
 
 
 6 
 
 -> 
 
 1 
 
 3 
 
 H 
 
 2 
 
 1 
 
 3 
 
 6 
 
 7 
 
 1 
 
 5 
 
 6 
 
 U 
 
 5 
 
 6 
 
 3 
 
 1 
 
 7 
 
 6 
 
 5 
 
 1 
 
 U 
 
 6 
 
 6 
 
 3 
 
 5 
 
 7 
 
 U 
 
 5 
 
 C 
 
 U 
 
 2 
 
 
 
 7 
 
 7 
 
 7 
 
 f) 
 
 5 
 
 4 
 
 a 
 
 
 
 
 
 2 
 
 Fiqure 2.1.1 Intecmedidte Patterns using p=C1111 
 
 s 
 
 y 
 
 S S 
 
 E 
 
 S S 
 
 c 
 
 
 
 a 
 
 u 
 
 a 
 
 5 
 
 '■) 
 
 5 
 
 1 
 
 '4 
 
 
 
 A 
 
 5 
 
 u 
 
 1 
 
 7 
 
 2 
 
 1 
 
 5 
 
 
 
 6 
 
 7 
 
 U 
 
 1 
 
 3 
 
 5 
 
 1 
 
 2 
 
 7 
 
 6 
 
 
 
 3 
 
 U 
 
 2 
 
 6 
 
 5 
 
 
 
 1 
 
 7 
 
 a 
 
 5 
 
 6 
 
 2 
 
 7 
 
 1 
 
 
 
 3 
 
 6 
 
 f. 
 
 3 
 
 7 
 
 1 
 
 2 
 
 3 
 
 6 
 
 
 
 7 
 
 7 
 
 3 
 
 3 
 
 3 
 
 2 
 
 2 
 
 2 
 
 Figure 2.1.2 IntcrmGdiate patterns asing p=0 11®1 1 r=1 01 
 
 plus two shuffles at the end 
 
 T'igure 2. 1 
 
13 
 
 taqs ^qain. A similar etfect will b« produced hy the least 
 significant bit of p on the second most significant bit of 
 the source tags. The net effect will then be equivalent to 
 setting p=0 1 1©1 1 C (= 10 1 ) and adding two extra shuffles at the 
 end. Note that the output of Figure 2.1.2 is also 
 (5 7 1 3 U '^ 2) . So by setting p=01111, we only result in 
 the 2-shuffled version of setting p=101. However, it should 
 be notsd that the Nlog N upper limit on number of passable 
 permutations onlv applies to the column control method. Foe 
 any individual switch control irethod (such as the source/ 
 destination tag method and the ROM method), the upper linit 
 depends on the nnmher of stages. 
 
 The perinutation capabilities of the log N stag^^ 
 column controlled omega network are well delirod by [12], and 
 we will not rePeat his results here. However, we can 
 increase the capabilities of a column controlled network to 
 allow certain broadcasting functions by usi rg two control 
 bits, bo and b, t in each column. In Chapter 3, it will be 
 shown what broadcasting functions can be realized by this 
 method. The switching functions for various values of b^ and 
 b, are shown in Figure 2.2. 
 
 u^ 
 
 Ihl 
 
14 
 
 ^0 t>, 
 
 Action 
 
 111 ustration 
 
 
 upper hcoadcast 
 straight pass 
 cross Over 
 lower broadcast 
 
 
 
 
 C 
 
 
 \ 
 
 ► 
 
 
 
 
 
 
 
 1 
 
 - — ^ 
 
 
 — ► 
 
 
 
 
 
 
 
 
 
 1 
 
 ^ 
 
 ► 
 
 X 
 
 — ► 
 ► 
 
 
 
 
 1 1 
 
 ^ 
 
 y 
 
 ► 
 
 
 
 
 
 Figure 2,2 
 
15 
 
 2.2.1.3 PC iM _Con t ro l.He t hod 
 
 To implement the source/destination tag method, we 
 either use a fast method [13] which would require 
 
 SNlog^ N (<^* (loGfj^"^^ ^■^^ qates, or a slower method [T^] wnich 
 needs (Ud* 1 1 ) Nlog N gates but requires the use of strobes at 
 each stage to pass the tag bits along. The column control 
 method also needs the use of stroties if a one-stage 
 shuffle-exchange network is used* So the propagation delay 
 through the network is On the order of log N clocks. In this 
 section, we will propose another control method which can 
 eliminate the use of the strobes (clockings) without paying 
 too much penalty in gate counts. This POM control method 
 provides a faster m--»thod to evaluate the control function at 
 each of th^ Nloq N/2 switches simultaneously and these 
 functions are imposed on all stages of the network at the 
 same time. 3o instead of taking log N clocks through th-^ 
 network, we would only require a couple of clocks for the 
 source data to be routed through. Ihis oreatly reduces the 
 network delay for the processing system. This method does 
 not pass as many permutations as the source/ destination tag 
 method, but it will pass many of the more common ones. It 
 car pass all shift, flip, (c-i) , and odd-ordered vector 
 unscrambling permutations in any power of two partitions. 
 
 Ho again assume, as in Section 2.2.1.^;, that a 
 control bit value of 1 will set d switching element to the 
 
16 
 •cLOss over* state and the value of will set it to the 
 •straight throuqh' state. The basic idea is to fetch N/2 
 control bits from each of log^ N ROM's, accocding to which 
 permutation function is called for. The array of log N x N/2 
 control bits is called the control iritrix ax\d can be imposed 
 on the omega network to facilitate the corresponding 
 permutation function. 
 
 For ex-imple, the control matrix for a 1-shift 
 permutation in a UxU oirega network is: 
 
 Imposing it on an omega network, we g€t: 
 
 o O 
 
 -o 1 
 
 O 2 
 
 o 3 
 
 fee will iirmediatcly get the following permutation : 
 
17 
 
 -■ 
 
 1 -• 
 
 2 - 
 
 3 -■ 
 
 > 1 
 •> 2 
 
 > 3 
 •> 
 
 , which is a 1-shift permutation. 
 
 In order to miniinizc the amount of ROf! space required 
 for different families of periratations, as many common 
 characteristics are recognized frorr the control patterns as 
 possible. Then, by using sone extra logical operations, i<e 
 can form the same control pattern with much less ROM space 
 required. We first observe that the flip permutation 
 actually belongs to the f airily of (c-i) permutations, with c 
 set to N-1. Then we observe that the cor.trol matrix for 
 (c-i) permutation is the bit by bit con'pleirent of the control 
 matrix for a (N-1-c) shift permutation. Note that (N-1-c) is 
 the hit conrplement of the bit representation cf c. 
 
 Example: 
 
 For S- shift: 
 
 For N = '^, the control matrices are: 
 
 1 1 
 
 1 1 j 
 
 I 1 1 
 
 1 1 
 
 For (2-i) perm. : 
 
 C 1 
 C 1 
 COO 
 1 
 
 kith this knowledge, we can already reduce the three 
 classes of permutations (sh if t , (c-i), and flip) into one set 
 of control oatterns. 
 
 We further observe that for any pernutation to be 
 
 IS 
 
 fD 
 12 
 
 IN 
 
18 
 
 done in a smaller partition (say, 2 ) , we only need to use the 
 same control matrix as for the same permutation in the full 
 network, with the exception that the leftmost (iog^N-h) 
 columns will be set to all O's, This fact enables us to 
 greatly reduce the number of control matrices for all the 
 partitioned permutations by simply using a sirall number of 
 AND gates controlled by the partition size. 
 
 Figure S.3 shows the control matrices for both the 
 3-shift permutation in full 8x8 network and the 3-shift 
 permutation in U-partition of an 8x8 network. 
 
 1 1 
 
 1 1 1 
 1 1 
 
 1 1 
 
 3-shift 
 
 
 
 1 
 
 1 
 
 
 
 1 
 
 1 
 
 c 
 
 c 
 
 1 
 
 c 
 
 c 
 
 1 
 
 3-shift in U-partition 
 
 Figure 2.3 Control Matrices for 3-shift 
 in different oartitions 
 
 Note that the only difference between the t^o 
 matrices is that the second one has all C's in the first 
 column. 
 
 Fven with these Control tecbniaues, the amount of ROM 
 
19 
 
 space required for all the shift patterns are still high. 
 There are N different shifts and each requires Nlog-N/2 bits. 
 So a total of N log-N/2 bits will oe required. However, we 
 can extract more inf orinat ion froir the shift patterns to 
 
 2 
 
 reduce the total R0?1 space down to N /2 bits, which greatly 
 increases the feasibility of this control method. The 
 control patterns for various shift permutations for N=3 are 
 shown in Figure 2.U. Let n=log N and S|S2....s^ be the bit 
 representation of the shift distance. In general, there are 
 N different patterns for the iaftinost colurrn. However, only 
 N/? of them are the basic patterns, and shift distances with 
 the sarre S2Sg....s^ will have the same basic patterns. Hence 
 s 3^....s^ can be used as the address to access the 
 corresponding pattern stored in the PCM. The RGM for this 
 first column will have N/2 entries of N/2 tits. The exact 
 pattern will bo the 'exclusive or' of s, independently with 
 the column of control bits. For the second column of the 
 control natrices, there are N/4 different basic patterns and 
 shift distances with the same S2S^..,.s^ will have the same 
 basic patterns. The exact pattern will bo the 'exclusive or' 
 of Sg with the column of bits. The number of basic patterns 
 decrease from left to right. The last column will have only 
 one basic pattern and will be exclusive-ored with s^, to form 
 the correct oatterns. The total number of basic patterns 
 required for the shift, (c-i) and flip perirutations in any 
 power of two partition can then be realized using a total clOM 
 
 
 > 
 X 
 
 I 
 
20 
 
 shift 
 distance 
 
 
 control 
 
 pattern 
 
 
 
 c 
 
 
 
 1 
 1 
 1 
 1 
 
 
 
 
 
 
 1 
 1 
 
 1 
 1 
 
 shift 
 
 c 
 
 lontrol 
 
 distance 
 
 pattern 
 
 4 
 
 1 
 
 
 
 
 1 
 
 
 
 
 1 
 
 
 
 
 1 
 
 
 
 5 
 
 1 
 
 1 
 
 
 1 
 
 1 
 
 
 1 
 
 1 1 
 
 
 
 
 1 1 
 
 6 
 
 1 
 
 1 
 
 
 1 
 
 1 
 
 
 
 
 1 
 
 
 
 
 1 
 
 7 
 
 1 
 
 1 1 
 
 
 
 
 1 1 
 
 
 
 
 1 
 
 
 
 
 1 
 
 igure 2.4 Control Patterns for Shift Peciriutations 
 
 N = 8 
 
21 
 
 space of ( U2+U+. . . N/2) bits =N(N-1)/2 bits, with the help of 
 some additional loqic elements. 
 
 R similar phenomenon to that in the shift patterns 
 can be seen for odd-ordered vector uascramblir.g permutations. 
 The control patterns for various odd-ordered vector 
 unscramblinq for N=16 are shown in Figure 2.5, Let 
 
 P P/s* • • 
 
 p be the bit representation of the order of 
 
 Unscramblinq. Then P.P^....p« . will be used as an address 
 to fetch the basic pattern for the first column, P2.,,.p^_, 
 as the address to fetch the pattern for the second column, 
 and so on. The output, however, does net need to be 
 exclusive-ored with p. (as in the case of shift patterns) to 
 produce the correct patterns. 
 
 A possible organization of the control system is 
 shown in Fiqure 2.6. Using microprogramiring, we can set 
 k,k2....k^ to 3,S2....Sj^ for shifts, to c^c^'-'-c^ for (c-i) 
 
 permutations, and to Op^ pg 
 unscrairbling. 
 
 Pn-. 
 
 for odd-ordered vectoc 
 
 The basic control patterns have to be generated and 
 input into the POI's. The basic shift patterns can be 
 generated quite easily. There will be N/2'* entries in the 
 jth RCM from the lof t ( 1< j<log N) . The 0th entry in each of 
 the FCK's will have all 1's. For the kth entry (1<k<N/2'') in 
 the -jth FCM, the least significant (k.2*'" ) control bits will 
 be 1's, while the rest are all O's. To generate the 
 
 St! 
 
 tl 
 
 4 
 
 
 > 
 
 
 
 
 :> 
 
 
 ■s 
 
 iN 
 
22 
 
 
 control 
 
 
 order 
 
 pattern 
 
 
 1 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 c 
 
 (5 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 3 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 1 
 
 
 
 
 
 
 
 
 1 
 
 1 
 
 1 
 
 
 
 
 1 
 
 1 
 
 1 
 
 
 
 
 
 
 
 
 1 
 
 
 
 
 
 
 
 
 1 
 
 
 
 5 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 1 
 
 1 
 
 
 
 
 
 
 1 
 
 1 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 1 
 
 
 
 
 
 
 
 
 1 
 
 1 
 
 
 
 
 
 
 
 
 1 
 
 
 
 
 
 7 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 1 
 
 1 
 
 
 
 
 
 
 
 
 1 
 
 
 
 
 
 
 1 
 
 1 
 
 1 
 
 
 
 
 
 
 1 
 
 1 
 
 
 
 
 1 
 
 1 
 
 1 
 
 
 
 
 
 
 1 
 
 1 
 
 
 
 
 control 
 
 
 order 
 
 pattern 
 
 
 9 
 
 
 
 
 
 
 
 
 
 
 1 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 1 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 1 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 1 
 
 
 
 
 
 
 
 11 
 
 
 
 
 
 
 
 
 
 
 1 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 1 
 
 1 
 
 1 
 
 
 
 
 
 
 1 
 
 1 
 
 
 
 
 
 
 
 
 1 
 
 
 
 ■ 
 
 1 
 
 
 
 1 
 
 
 
 13 
 
 
 
 
 
 
 
 
 
 
 1 
 
 
 
 
 
 3 
 
 
 1 
 
 t 
 
 
 
 
 
 
 -4 
 
 .f^ 
 
 ^^' 
 
 
 
 
 
 
 
 
 *D' 
 
 
 
 
 
 
 
 
 -0 
 
 
 
 
 1 
 
 1 
 
 
 
 Q 
 
 
 1 
 
 1 
 
 
 
 
 
 15 
 
 
 
 
 
 
 
 
 
 
 1 
 
 
 
 
 
 
 
 
 1 
 
 1 
 
 
 
 
 
 
 1 
 
 1 
 
 
 
 
 
 
 1 
 
 1 
 
 1 
 
 
 
 
 1 
 
 1 
 
 1 
 
 
 
 
 1 
 
 1 
 
 1 
 
 
 
 
 1 
 
 1 
 
 1 
 
 
 
 Figure 2.5 Control Patterns for p-unscrambling 
 
23 
 
 J3 
 
 C 
 
 o 
 
 CM +J 
 ^ rO 
 Z Q. 
 
 Sk 
 
 Z3 
 O 
 
 o 
 
 4-> 
 
 c: 
 o 
 
 o 
 
 a: 
 
 C7> 
 
 O 
 
 C 
 < 
 
 CsJ 
 
 s- 
 
 3 
 
 cn 
 
 .(£ 
 
 > 
 
 » — 
 
 
 
 r^ 
 
 
 
 ^ 
 
 
 
 E 
 
 
 
 <TJ 
 
 
 
 S- 
 
 QJ 
 
 
 o 
 
 N 
 
 
 CO 
 
 •^ 
 
 
 c: 
 
 </) 
 
 
 :3 
 
 C 
 
 
 s- 
 
 o 
 
 
 o 
 
 •1— 
 
 
 -i-> 
 
 +j 
 
 
 o 
 
 'f— 
 
 
 O) 
 
 +-> 
 
 
 > 
 
 n3 
 
 
 -o 
 
 Q. 
 
 
 CD 
 
 
 
 J- 
 
 
 
 OJ 
 
 C 
 
 +J -o 
 
 -M 
 
 M- 
 
 $_ 
 
 • 
 
 •^ 
 
 o 
 
 • 
 
 JZ 
 
 1 
 
 • 
 
 (O 
 
 Q. 
 
 CM 
 
24 
 
 p-ordered vactor unscrambling patterns, it requires a bit 
 irore work. For the first stage pattern, the ith control bit 
 (0<i<N/2) for p-unscrambling (i.e. the (p- 1) /2 th entry in 
 the PCM), is the (log N) th least significant bit of the 
 product (p.i). For the jth stage pattern (i/1), the ith 
 control idt (0<i<N) in the kth entry (0<k<N/? ) will be the 
 same as the (|i/2''"'J *2^"' ) th control bit in the kth entry of 
 the first stage !?ca. 
 
 This section indicates how the use of ROM's can 
 eliminate the need to clock the omega network, at the expense 
 of some relatively inexpensive hardware. The set of 
 allowable permutations is quite rich. Hence the ROM method 
 should be considered as a major alternative to the 
 source/destination tag method. 
 
 It should be pointed out here that an alternative to 
 the POM may be the use of an initial set of control bits that 
 are being routed through the network, but with some logical 
 operations at each stage. This method has been discussed by 
 Lang and Stone[15], and will not be repeated here. 
 
25 
 
 2.2,2 Om og a^Partit lQn_Th»^o reins 
 
 One important property of the ometja ret«ork is its 
 ability to be partitioned. The theoreins in this Section will 
 show that a large size omega network can be regarded as a 
 conglomeration o£ riany smaller size omega networks, each 
 passing a different smaller omega-passa tie connection 
 function. These partition theorems help to establish many 
 capabilities of a larger sizo network on smaller partition 
 connections. 
 
 (iiven an 6x^ omega network. Assume that source ports 
 0-3 want to do an end-around 1- shift. As^5ume also that 
 destination ports U-7 request data from source port 5. So 
 the corplete set of source destination pairs is P = 
 C(0,1) ,(1,2) , (2,^) , (3,0) , (5,4) , (5,5), (5,C) , (^,7)} . He know 
 that a uxU omeqa networ.< can perform an end-around 1-shift, 
 as well as a one-to-many broadcasting functior. 3y using the 
 partition theorem stated below, we can be sure that an 8x3 
 omega rietwork can pass P, 
 
 I ' 
 
 Before we '^tate the partition theorems, we would liice 
 to list, without proofs, some numier theory results. 
 
 RO) a=b, c=d --> ax+cy = bx+dv 
 mm rn 
 
 Rl) x+a = y+rt --> x=y 
 m m 
 
 P2) x=y --> ax = ay 
 m m 
 
 F;3) If a is prime to ir (i,e. gcd (a,rii) =1 ) y then 
 
 ax = ay --> x=y 
 
 m m 
 
26 
 
 FU) ax = ay < — > x=y 
 
 am m 
 
 ES) If 0<x,y<in, then 
 
 X = y <--> x=y 
 
 am TO 
 
 m 
 
 P6) x=Y and xf^y --> x ^ y 
 
 El through B6 can be found in [U]. 
 
 We will also present Leir.ma 2.1 which is extended from 
 Lemma 1 of [ft! and some of the above number theory results. 
 
 Lemma., 2. 1 
 
 Let 0<x, ,y, <n-1, and O^Xj^y <a-1. Then 
 
 ax.+Xg = ^y.+Ya <-"> «i =7, and X2=yj,. 
 an 
 
 Proof : 
 
 a) to prove ax, ^-Xg e ^y, +y — >x,=y, and ^z'^iz 
 
 an 
 
 Lemma 1 of [ U ]. 
 
 proved in 
 
 b) to provt^ ^1 ^Yi '^'^d X2 = y2 — >dx, +X2 = ay, ^l^' assume x,=y, 
 
 an 
 
 and X2=Y2 • 
 
 By TRU, ax = ay,- Since 0<:Lr.,'i^<^, then by P5, we 
 an 
 
 have X- = y„ . So by RC, we qet ax,+x. = ay +y,, QED. 
 
 2 ~ '2 
 an 
 
 I -^2 
 
 an 
 
 1 -'2 
 
 To simplify the proof of the partition theorems, we 
 need to restate Theorem 2 in [U], which dictates whether a 
 given connection is omega-passable or not. 
 
 Lemma_ 2.2 (Equivalent Statement of Theorem 2 in [ U ]) 
 
 Given a set of desired input-output connection P = 
 
 {(5. ,D.) I 0<i<N} , then an NxN omega passes P^^ if and only if 
 
 k 
 for all S-D pairs in P^ and for all m = 2 , where ''l^SloggNr 
 
 3. = S. or S. 53; or D.^feD.. 
 
27 
 
 Let PL=f (S;,di ) |0<i<L} and Pmj = H t ij , e j j ) |0<j<M}, 
 
 3<i<L. We define Pn=?lX {P„^ , P^,^ , ?m^_, } 
 
 = {(S;M*tij ,d;M*e,j ) |0<i<L, 0<j<M} 
 
 For example, let L=U, M^U and N=16. if 
 P,^ = f (0, 1) , (1,2) , (2,3) , (3,0)} , a 1-shift permutation, 
 ^M, ^ f ^^'^) ' ^^''') ' ('^'2) , (0,3)} , a 1-to-a broadcast connection, 
 
 PM2 = n^'^) ' <^'^' ' ^-'''^ ' ^^'^n r a flip permutation, 
 
 ■^1*13 = C(*^' ^) » C'"^) ' (2'2) , (3, 1)} , a 3-ordGr unscrambling, and 
 
 °t=n-»2) , (1,3) , (2,0) , (3,1)} , a 2-shift permutation. 
 
 Then by definition, Pf^= { (C ,9) , (1 ,1 C) , (2, 1 1) , (3, 8) , 
 (4,12) , (a,13), (4,V4) , (U,15), (8,3) , (9,2) , (10,1), (11,0) , (12,u) , 
 (13,7) , (n,6) , (15,5)} , 
 
 Tn words, the sources and destinations of P^ are 
 divided into U partitions. P^ is the inter-partition 
 permutation function, and Vf^. 's are the individual partition 
 permutations. ?^_ moves partition *0 to partition ^2 and then 
 the individual elements in partition #2 will he T\oved 
 according to P>^ , and so on, A pictorial illustration of Pm 
 is shown in Figure 2.7. 
 
 >i*r 
 I « 
 
 re 
 > 
 
 I IS 
 
 > 
 
 ■ M 
 
 ik' 
 
 With all these preliminary definitions and lemmas, we 
 can present the first cf the oireqa partition theorems. 
 
28 
 
 Partition 
 
 //O 
 
 
 
 1 
 
 2 
 
 
 
 o 
 o 
 
 
 
 3 
 
 
 
 
 
 4 
 
 
 
 Partition 
 
 n 
 
 5 
 6 
 
 o 
 o 
 
 
 
 7 
 
 o 
 
 
 
 8 
 
 o 
 
 Partition 
 
 n 
 
 9 
 10 
 
 o 
 
 
 
 
 
 11 
 
 o 
 
 
 
 12 
 
 o 
 
 Partition 
 
 in 
 
 13 
 
 14 
 
 o 
 o 
 
 
 
 15 
 
 o 
 
 Flip 
 
 3-order unscrambling 
 
 1-shift 
 
 1-to-many broadcast 
 
 Figure 2.7 A Partitioned permutation 
 
29 
 
 Theore iT 2. 1 
 
 Let nLt^L ^"^ ^t^^fA- ' Olii^' ^^^ l^*" N=Lx«, then 
 ^N^^N^^L^fPo'P \-,>- 
 
 We first present a simple sketch proof in Proof 1. 
 then we present a more rigorous proof in Proof 2» 
 
 Proof 1 
 
 Assume n.^P|^. By lemina 2.2, there exist S, -SpM+tpq , 
 
 D.=dpM+epq, S2~Su^**-MV» ^l~^^^*^o\/ ^^^ ^ such that S, ^ S2 and 
 
 X 
 
 D, S D2. 
 
 Let m=logM, n=loqN, b=logL, and x=loqX. 
 It X>Mr pictorially we have: 
 
 ^2 
 
 4— b- 
 
 Sp 
 
 -».«— 
 
 — m — ► 
 
 tpcj 
 
 Su 
 
 
 t^jv 
 
 
 
 — . , „ , ■* 
 
 [J^A 
 
 1± 
 
 'UV I 
 
 D2 
 
 m 
 
 c;; 
 
 i^» 
 
 b*-ir-x 
 
 Here the trailing x bits of S, and S2 are equal, but 
 the leadinq (b+ir-x) bits are not equal, and the leadinq 
 (b^m-x) bits of D| and D2 are equal. Since x<in, the trailinq 
 (a = x-in) bits of Sp and s^^ are equal but the leading (b-a) 
 bits are different and the leading (b-a) bits of dp and du 
 are equal. This contradicts Slfv, . 
 
 M 
 
30 
 
 If X <M, pictorially we have: 
 
 b — *•< — ir. 
 
 ^1 I ^p r '^pi 
 
 uv 
 
 dp 
 
 
 epq 
 
 du 
 
 
 ®uw 
 
 ^ — ->— 
 
 
 
 fc+ni-x 
 
 Since the leading (b+m-x) bits of D, and D2 are equal 
 and ni>x, we have dp=du and ^pa— ^uv* Since n^fPL ^^^ dp=du, 
 p has to be equal to u. So Sp=Sy . This implies that tpa=tuv 
 since S,^ S^ . S,s S^ and X<M imply that tpa«t^j^. Setting 
 p=u, we get eposepy, <^po^ tp^ and tpostpy. They imply that 
 ^m' °Mp ' which is a contradiction. Hence, Theorem 2.1 is 
 proved. 
 
 Proof 2 
 
 Assume n^^P^r then by Lemrra 2.2, 3 (SpM + tpq ,dpM-»-ep<^) 
 and (SuM+tyj^ ,duM+euv) and X=2* such that: 
 (a) SpM+tpcj^s^H + tuv' 
 
 and 
 and 
 
 (b) spM*tpq =^s^M*tm/ 
 
 (c) dpn>ep(^=dyM>eu^ 
 
 Ey Lemma 2.1 and (a), Sp ^ Sy or tp^ 5^ t^^^ (2.1) 
 
 If XiH, let A=X/M=2*. 
 
 By Lemira 2.1 and (b) , Sp« s^ and tp<^= t^^^ (2.2) 
 
 From (2.1) and (2.2) we get Spl^Sy (2.3) 
 
 and Sp= Su (2.4) 
 
By detinition, (c) implies X 
 
 dpM+ey^ 
 
 s X 
 
 N 
 
 dqM + ey^ 
 
 d^M 
 
 since epq»eyy<f1<X 
 
 p^ 
 
 MALdp/\| s f1ALdu/A| 
 
 A [dp/Aj s A [dy/Aj by R4 
 
 A 
 dps du 
 
 (2.5) 
 
 (2.3), (2.U) and (2.5) imply Hl^P^ r by Lemma 2.2, 
 contradiction. 
 
 If X<M, let B=1/X. 
 
 (b) ^=:» SpBX+tpcj s SyBX + t„v 
 
 ^PS Y *^"^ 
 
 (2.6), since SpBX,SuBX = 
 
 (c) 
 
 dpH*-epfl 
 
 N 
 
 duM♦e^^^ 
 
 XdpE+xlep^/xJ = Xd^E*X|_ey^/Xj since M/X is integer 
 
 dpM*X [^pc,/x|= d^M + X [e^./x] 
 
 dp = du and X lep^^ /xJ^X [jey^/xj by Lemma 2.1 
 
 and since 0<dp,dt,<L, and 0<X I ep^ /X j <epo<M, 
 and 0<x [euy /xj <€u^ <M. 
 
 (2.7) 
 (2.8) 
 For (2.7) to he true and net contradicting HLtP^^ , 
 
 p = u (2.9) 
 
 Obviously SpS s^ and theretore (2.1) implies tpcj# tuw (2,10) 
 Rewriting (2.10), (2.6) and (2.8) setting p=u, we get: 
 
 and 
 
 dp = d^^ 
 
 ^n M ®"^ 
 
 tpc, ^ tp^ 
 
 (2.11) 
 
 i 
 
 tu 
 
 ■i*ir 
 
 a 
 
 IS 
 
 :2 
 
 I 
 
32 
 
 tpc,^ tpv (2. 12) 
 
 anf^ «^p^ « ep», (2. 13) 
 
 (2. 1 1) , (2, 12) and (2.13) irrply Cl^^^f^ , contradiciton. 
 Hence ^,^1?^, C.^.C. 
 
 The following corollary is a direct consequence of 
 Iheoreir 2.1, 
 
 Coro llary 2.1 
 
 An NxN omeaa networl? can he made to behave as N/M 
 independent MxM oireqa networks. 
 
 Proof: 
 
 Let L=N/M and F= { (i ,i) | 0<i< }, then HJPl trivially. 
 Hence it we can pick N/f P^'s (E^ ,9^, #...,Pm ) that are 
 omega passable, then, by Theorein 2.1, OmT?^ = 
 
 ^l.^ ^^M„ '^M, •• • • '^M^., ^ • 
 
33 
 
 In Theorem 2.1, the tag bits denoting the 
 partitioning are the most significant log^L bits. In the 
 following two theorems, we extend the result to any set of 
 log L bits in the tag representation. The two theorems will 
 not be proved formally. However, the illustrations could be 
 easily generalized to formal proofs. 
 
 Let us look at (::;. ,d.) and (s.,d. ) like the 
 following: 
 
 (s,,dj): (x,X2X3x^x3XgX^XeX3x,o,i,y2y3i^Y5ygy^lgygI,o ) 
 (s. ,dj) : (a,a2a3a^a5aga^agaga,o ' t .bgbgb^tgbgb^bgbg b,^^ ) 
 Assume there are loggb underlined bits and log M non- 
 underlined bits. 
 
 «!' 
 
 ■j*r 
 
 Assume all the underlined bits of (s ,d) satisfy an 
 omega passable connaction F^ r and ail th€ non-underlined 
 bits of (S/d) satisfy an omega passable connection Fm^ r where 
 k represents the total numerical value of the underlined bits 
 of s. Then f(s,d)} is passable by Hlm* 
 
 Proof : 
 
 Assume ((s,d)} is not passable by O-i^f then there 
 
 + Mote that a connection can be a broadcastina function, 
 while a permutation cannot. 
 
exist (s.,d,) , (s, ,d. ) ,dnd r (say, =6, in this case) 
 if J J 
 
 34 
 
 such 
 
 -I "2 "3^4 ^ 5 1^2 ^3^4 
 
 X. X„ 5C, 
 
 that 
 
 dnd l.V^h^A = ^.^2^3^4 
 
 Case 1 : x, x^ ^ d,a^ 
 
 In this case, we have x^x,. = d-a. , and x.x^ ^ a, a., 
 and y^ 1a E ^, ^Ia* which implies ^^^Pl , ci contradiction. 
 
 CasG 2: X, X4 = 5, 34 
 
 Here we have x, x^iigi^io^ ^ , a4.ag a,^, and s. ar;d Sj will 
 have the sar^e k value. Now Xg x 
 a^aga^ag and Y^ Yg = ^2'-3' which contradict 12^,1 P|v| 
 
 ^ a^'o.^ and XgXgX^Xg zz 
 
 Herce '"hoorem 2,2 is proved by contradiction. 
 
 Thoerenj_2_^3 
 
 Assume all tho underlined bits ot (s ,d) satisfy an 
 omega passable peLmatat icn Fl , and all the non-underliried 
 hits of (s,f1) satisfy an omega passable connection F^^f where 
 k represents the total numerical valje of the underlined bits 
 of d. Then f(s,d)} is passable byX^LM* 
 
 Proof : 
 
 Assume {(s,d)} is not passable by ^2^_^ , then th<-»re 
 
 exist (s.,d;) , (s..d.) , and r (say, =6, in this case) such 
 
 that x,X2:<3X^ ^ a^a^aga^, 
 
 and x_x^x,x ^x„ x.„ = a_a.a a.a.a ^ , 
 5 6 7—8 9-10 — 5 6 7 — 8 9—10 ' 
 
35 
 
 and 
 
 l^yzV^U = ^.bgtgb^. 
 
 Case 1 : Z^l^ ^ ^\^4. 
 
 In this case, we have x,X4 ^ 1|^4# ^^^ -e-io 
 and 1 1. E h.]lAt which implyi^Li^ , a contradiction. 
 
 = ^8^10' 
 
 Case 2 : x . x . = -^ . a . 
 
 2^3 
 
 This implies that Xg Xg ^ a 
 
 Also we will get x.x.x.x,^ = a.a. a„a,„ . Since P, is 
 J —1—4—9 — 10 — — I— 4-- 9— 10 '- 
 
 a perirutation, 1. I^IqI.q = ^,l)4^ei2io • Hence d; and dj will 
 
 have the sam" k value. Now 
 
 X3 ^ ^2^3 ' ^^^ 
 
 X X X X n 
 
 dga^a^ag, and y^ y^ = b^b^r which contradicts iiith X^mI P^k' 
 Hence Theorem 2.3 is provel. 
 
 It should be noted while basically all three theorems 
 allow P^^ ' s to be any connection function, P^ in Theorein 2.3 
 is the only one that requires to be a permutation. To help 
 appreciate why P^ has to be a permutation in Theorem 2. 3, out 
 not in Thoerem 2.1, we will show the following example. 
 
 f x.ample • 
 
 \or L=2, r^ = U, we let P be { (0, 0) , (0, 1) } , P ^^^ be 
 
 1-shift permutation and P^ be a 2-snitt permutation, 
 
 1) Let the most significant bit be the underlined bit, we 
 
 have : 
 
 
 .1 « 
 
 " .i *C 
 
 
 ! 
 
 
36 
 
 5. 
 
 ^2 
 
 ^3 
 
 
 ii 
 
 '^2 
 
 '^3 
 
 
 
 
 
 
 
 
 C 
 
 
 
 1 
 
 
 
 
 
 1 
 
 
 
 
 1 
 
 
 
 
 
 1 
 
 
 
 
 C 
 
 1 
 
 1 
 
 
 
 1 
 
 1 
 
 
 C 
 
 
 
 
 
 
 
 
 
 
 
 
 1 
 
 1 
 
 
 
 
 
 
 
 1 
 
 
 1 
 
 1 
 
 1 
 
 Q 
 
 1 
 
 
 
 
 1 
 
 
 
 
 
 c 
 
 1 
 
 1 
 
 
 1 
 
 G 
 
 1 
 
 There is no conflict. 
 
 
 
 
 
 
 
 2) Let the middle bit 
 
 be 
 
 the 
 
 under 
 
 lined 
 
 bit, hie have: 
 
 '^. 
 
 ^2 
 
 -^3 
 
 
 d. 
 
 ^2 
 
 ^3 
 
 
 
 c 
 
 c 
 
 
 C 
 
 
 
 1 
 
 r\ 
 
 
 
 1 
 
 
 1 
 
 p 
 
 
 
 1 
 
 
 
 
 
 
 1 
 
 
 
 1 
 
 1 
 
 
 
 1 
 
 
 c 
 
 
 
 
 
 
 
 c 
 
 
 
 
 1 
 
 1 
 
 
 
 c 
 
 
 
 1 
 
 
 1 
 
 -1 
 1 
 
 1 
 
 1 
 
 
 
 
 
 
 
 
 1 
 
 
 
 1 
 
 
 
 1 
 
 
 
 
 1 
 
 1 
 
 There are many conf lie ts (e .q. , between (0,1) and (4,2)). 
 1) Let the least significant bit be the under lin'^d bit, we 
 have: 
 
37 
 
 c; 
 
 ^2 
 
 ^3 
 
 Q 
 
 
 
 c 
 
 
 
 1 
 
 
 
 1 
 
 
 
 
 
 1 
 
 1 
 
 G 
 
 
 
 
 
 C 
 
 
 
 1 
 
 
 
 1 
 
 G 
 
 
 
 1 
 
 1 
 
 
 
 <^. ^2 ^3 
 
 1 
 
 1 
 1 1 
 
 coo 
 
 1 
 
 1 1 
 
 c 
 
 1 
 
 There are irar.y conflicts (e, g. , between (0 
 
 2) and (4,1))., 
 
 This partitioning property of tht' omega network 
 proves to he vital for the efficient handling of many 
 algorithms, especially the Recurrence Solvers, as shown in 
 Chapter 3 of this thesis. 
 
 II 
 
 ''Clil 
 
 
 4 
 
 •0 
 
 i 
 
38 
 
 2.2.3 CingcjaSroddcast Iheocems 
 
 Theorems 10 and 11 of [^] descrite broadcast inq 
 functions for sqaare blocks. In this section, we are going 
 to extend them to' 3-dimensional arrays, not necessary of 
 -^qual size edges. 
 
 Ke use the notation (k ,x , y) <a , b,c> to denote ele:nent 
 (k,K,y) of an axbxc array. Here 0<k<a, 0<x<b, 0<y<c. Also 
 
 (k,x,y) <<i,b,c> > (*, x,y) <a,b,c> symbolizes the mapping ot 
 
 element (k,x,y) to eleirents (0 ,x,y) , ( 1 , x , y) , . .. . (a- 1 , x,y) . 
 
 Now we can show six extensions of the broadcast 
 theorems. 
 
 '^or constant k and for all values of x,y,*, 
 
 (i) Q^^^ t ((k,x,y) <a,b,c> --->(*, x , y) <a , t, c>} 
 
 (ii) n^bc t (<k,x,y)<d,l:,c> ---> ( x , * , y) < h, a, c>} 
 
 (iii)n3^J^ t r(k,x,y) <a,b,c> >(x, y , *) < b, c, a >} 
 
 (^v) Habc t f ('^.,x,y) <a,b,c> Xy , x, *) <c, fc,a >} iff a>c 
 
 (V) r^abc ^ f(k,x,y)<a,b,c> —->{*, y,x) <a,c,l->} 
 
 (vi) Habc ^ f (k,x,v)<a,h,c> ---> (y , *,x) <c, a, h>} 
 
 Proof: 
 
 Let a' = loq(a), b' = log(h), c'^log{c), and r'=log{r) 
 
 C^) £} 1'f(k) >(*)}# d 1-tc-many broadcasting function, 
 
 and X^bc t{(^#Y)<brC> > (x , y) <b,c>) , an identity. 
 
 Therefore, (i) is proved by applying Theorem 2.1. 
 
39 
 
 (ii) First .*^ will prove: 12 gb t f C^/X) <a, b> > (x, *) <b,a>} . 
 
 Assume it is falsp, it implies that there exist Xf,X;,r,D,q 
 
 _ r 
 
 such that x.^K.f kb*-x. = kh+x. and x.a + p E X;a + q. 
 10 I TT J ' ab J 
 
 If c>b, we have x. ^ x, and x- = x: , a contca diction. 
 If r<b, the most significant (a'+b'-c') bits cf Xj will b-^ 
 the same as that of xj, while the least significant c' 
 
 bits of X. will be the same as that of x:. Since x ,- and 
 
 X: have only b' bits each, then x. = x.- which implies a 
 
 contradiction. 
 Hence ^ab^ ( (k, x) - — > (x , *) } . 
 Using Theorem 2.1 agairi, noting thatQ^tf(Y) >(y)}r (ii) is 
 
 proved . 
 
 (iii) Proof ot (iii) is similar to that of fl^^f 
 f (k,x)— >(x,*)} in (ii). 
 
 (iv) For this case, we can represent the tags (s. ,d.) 
 
 (s ; ,d •. ) as follows: 
 
 a' b • 
 s . , k , X . 
 
 c' 
 
 Y; 
 
 J L 
 
 c 
 
 Y 
 
 X . 
 
 I 
 
 a* 
 
 P 
 
 and 
 
 *f^ 
 
 
 I 
 
 iii 
 
 (1 
 
 'J L 
 
 J I '-i I 
 
 ~t — 
 
 r' 
 
 a« <-b'+c' -r ' 
 
 Fach tag is divided into 3 parts of lengths a', b' and c' 
 respectively. 
 
 If a>c, assume Habc^ f C^/XfY) <d,b,c> > (y , x, *) <c, b, a>} . 
 
 This irrplies that there exist s. ,d.,s.,d', and r such 
 
 I I J J 
 
 that S; 5E s; , and the least significart r' bits of s. 
 ' abc J ' 
 
40 
 
 (v) rhir, case can be represented pictoriallj as: 
 
 a' b' C 
 
 b» 
 X . 
 
 d : 
 
 d»+b«+c'-r 
 
 Tt b>c,we can pick (s. rd. ) F, (Sj,dj) such that p=q, yi=Yj' 
 and the ir^ost significant (b'-c') bits of Xj S Xj equal, 
 with the least significant c* bits unique. Then 
 
 i s : . By letting c=c, we can see that s, = Sj. Also 
 
 abc 
 
 we have the most significant [ a' +c' + ( b' - c') ] bits of dj 
 
 F, d; being equal, which is equal to a' + t'*c'-r'. Hence 
 
 we lust show that ^2 abc ^ fC^ . x, y ) <a, b, c> > 
 
 (*r y#x) <a,c,b>} . 
 
 v. =y. and 
 
 If c>b. We pick (s.,d.) & (s-^d:) such that p=q, 
 
 X. ^y- . Then S; ^ S; . We also pick r = c, then s- E s- . 
 The number of leading bits of dj Fj qj that are the same 
 = a'+r' >a'+b' =a • +b ' + (c ' -c • ) . Hence we can see that 
 QQ^^^f{a,^,y)<^.^,c> — > (*,y,x)<a,c,b>}. 
 
 (vi) We can represent this case as: 
 
 b' 
 
 X • 
 
 I 
 
 I ^' I L 
 
 ^1 
 
 X . 
 
 1j 
 
 a'+b'+c»-r 
 
 If a • + L' +c' >?c' +a' (i.e. b'>c'), we pick y. = y • r p=q and the 
 
41 
 
 and Sj are equal, while the most siqnificant 
 
 (a • fh •♦c'-r' ) bits of d; and dj are equal. 
 
 If r>c, then from s. and s . , we can see that Yj ~Vj • 
 
 Also, the least significant (r'-c') c£ x. and X: are 
 
 equal. From d. and d ; , we can see that the most 
 
 siqnificant (a ' +1 • +c ' -r' -c' ) bits of x. and x: ace 
 
 equal. These add up to a'+b'-c' I'its of x. and x,, and 
 
 is qreater than or equal to b' if a>c. So x. =X;. This 
 
 contradicts s. ^ s;. 
 
 a be 
 If r<c, we have the nost siqnificant (a* ♦b* ♦c' -r ' ) bits 
 
 of Vj and Y; equal, which is qreater than a'+b' and thus 
 
 greater than c*. So y. =7.. From an arqument similar to 
 
 that in the paragraph above, we have Xj =Xj . This also 
 
 contradicts s. ^ s-. 
 ' ate ^ 
 
 Hence if a>c, Og^^^t f (k , x, y) <a ,b ,0 --- > (y ,x, *) <c, b, d>} . 
 
 f a<c, assume /^g^j,t (('^/X,y) <a,b,c> > (y ,x , *) <c, b,a>} . 
 
 If ab>c, we pick (s.,d.) and (s,,d;) such that y, =y; and 
 
 p -q . Then x.=Xj except for the most siqrif leant (c'-i') 
 
 _ r. 
 
 bits. Then for r'=a'+b', wo have s. = s; and d; = d-, , 
 
 I y. J ' abc J 
 
 and since s. ^ s- , we show that Qa^c "^ f C^* x, 7) <d,b , c> 
 abc 
 
 > (y,x, *) <c, b,a>} . 
 
 If ab<c, we simply have to pick ''j-^j ^^^ Y]~Y^ to 
 arrive at the contradiction. So now, we get 
 Qabctnk,x,y) <a,b,c> - — > (y , x , *) <c, b, a>} . 
 
 en 
 
 I 
 
 IS 
 
 •0 
 
 i 
 
42 
 
 last (b*-c') bits of x 5 x equal. Then there will be 
 conflict if r'=b«. This implies Q abc ^H^ ,x ,y) <a,b, c> 
 - — > (Y,*,x) <c,a,]:>} . 
 
 If a' + b* +c' <2c'+a» (i.p. b'<c'), we simply pick: Y; =yj * P=^ 
 and x.=x-,. Then for r<c, we will have conflict, which 
 iirplies ^g^jj^t f (K, x, y) <a, b, c> - — > (y, ♦rX )<c,a ,b>} . 
 
 These broadcasting theorenrs are alsc essential in 
 ostablishinq th^ rocurr^nce alcjorithtrs in Chapter 3 ani are 
 some of the more important properties of the cmoqa network. 
 
43 
 2 . 2 . t| Gen eral Admissibility 
 
 It is veil known that the omega network can only pass 
 a sirall fraction of the total number of M 
 permutations (N** (N/2) /N!) • To improve the permutation 
 ability and to understand better the relationship between 
 permutations and the owega and inverse omega networks, we 
 would like to extend some of the results of Pease[16]. 
 
 Without relabelling, the indirect binary n-cube array 
 described in [16] is actually an inverse omega network[4] 
 after rearrangement of the switches. He can extend his 
 
 results to the omega network in a very simple manner. Let 
 
 n-i n-2 + 
 
 the index p be represented as (P, 2 +P22 •»-....?„,, 2*p^) 
 
 instead of (P|+2p2 + ...2 P^) » as in Formula 1 of [16]. Then 
 
 we can use exactly the same theorems as in [16], except now 
 
 we should note that the index bits are reversed. 
 
 iZ 
 
 Let X and y be expanded in binary notation as 
 (X, ,X2 ,. .. ,x^) and (y, #y2 # • • • ^y^) with x, and y, being the 
 most significant bits, x„ and y^ the least significant. The 
 function describing a permutation P can be written as a set 
 of functions 
 
 Y; = P; (X,,X 
 
 #X^) . 
 
 (2.1U) 
 
 + This notation is consistent with other sections of this 
 thesis. 
 
44 
 
 The principal theorem for the omega network will then 
 
 be: 
 
 Theoreir 2. U 
 
 F is adirissible by an omega network if and only if 
 the functions (2.1U) defining P can be written in the form 
 
 Yf = xj®f; (y, ,...,yi., ,Xf^ 
 
 ^r^) 
 
 (2.15) 
 
 for 1<i<r. ^ is the 'exclusive or* operation. 
 
 It is a ref oriTulation of Theorem 2 in [ ^ ]. However, 
 it applies only to perirutations. No broadcasting function is 
 considered. 
 
 A siirilar theorem for inverse omega network will be: 
 
 Thcorerr 2,?. 
 
 F is admissible by an inverse omega network if and 
 only if the functions (2.1U) defining P can be written in the 
 
 form 
 
 y> = Xj^f ,' (y^, ...,y,- + , rX,_, ,...,x,) 
 
 for 1<i<n. ® is the 'exclusive or' operation. 
 
 (2.16) 
 
 Using the new rotation, we will redefine lower and 
 upper triangular permutations as follows: 
 
 Definit ion 2. 2 
 
 A permutation is lower triangular if the 
 
functions (2. 15) defining P can be written as 
 
 Yi = x^®c 
 
 y-, =x;®f i (y, #y2#-»«»yi_, ) 
 
 (2.17) 
 
 where 2<i<n, c=0 or 1. 
 
 45 
 
 Defin iti on . 2» 3 
 
 A permutation is upper triangular if the 
 functions (2. 15) defining P can be written as 
 
 Y; -X.®fj(X|^, ,X ,.,.2 »• • • **n^ 
 
 Yn = Jfn^c 
 where l£i<n-1, c=0 or 1 . 
 
 (2. 18) 
 
 All lower and upper triangular permutations are 
 passable by omega networ)c. Me will now present the following 
 two lemmas that show that they are also inverse omega 
 passable. 
 
 Lemma 2. 3 
 
 .4 J 
 
 
 A lower triangular permutation can also be 
 represented in the form y- = x. ®g j (x, , ,. . x ;_j ) # and is 
 thus inverse omega passable. 
 
 Proof: 
 
 Assume 
 
 y. = x.®gj (x^,. ..,x j_, ) for i=1,..,. 
 
 k. 
 
46 
 
 = »K4l®f K+, (X^®C,X2®g2Ui)...wX^,®gK(X,..X^., )) 
 
 Since it is tru^ for k=1, this Lemma is thus proved by 
 induction. 
 
 L emma 2 «U 
 
 An upper triangular permutation can also be 
 represented in the form y. = x (S^g- (y^* •• -y j^^ ) # and is 
 thus inverse omega passable. 
 
 Proof: 
 
 The proof is similar to that of Lemma 2.3. 
 
 The followinq two theorems are taken out of [16], and 
 will be presented without proofs. 
 
 Theorem 2.6 f Pease] 
 
 The set of admissible lower triangular permutations 
 is a group under composition of maps. 
 
 Theo rem 2.7 f Pease ] 
 
 The set of admissible upper triangular permutations 
 is a group under composition of maps. 
 
 Pease showed that shifts are lower triangular 
 permutation in his context, which is upper triangular in our 
 notations. We are going to show here that the odd-ordered 
 
47 
 
 vector unscramblinq permutation is also upper triangular. 
 
 We first let the binary representation of the odd 
 
 order, p, be (Pi P2 Pp-i ^^' ^^ ^^^ source index is k, 
 
 then the destination index will be y=p*x. The values of the 
 Vj • s are then 
 
 Yn-2"^Ti->®Pn-2^n®Pn-i'^n-,®S„_2 (x^_, ,X^) 
 
 n-i 
 
 
 where Sj is the carry from the lower ordered bits to the ith 
 bit. 
 
 It is immediately obvious that this is an upper 
 triangular permutation defined in Definition 2.3. Hence by 
 Theorem 2,1, we can see that all compositions of shifts and 
 odd-ordered unscrambling permutations are omega passable. 
 
 Defining x=(x, x^ ...x„) , and y= (y , J^ •••Yn^ " ^ 
 permutation is linear if there exists an n x n nonsingular 
 binary matrix, P, such that 
 
 y = P • X 
 
 (2.19) 
 
 f 
 
 ^: 
 
 C5 
 
 It 
 
 1^ 
 
 
 \i 
 
 :s 
 
48 
 
 Extendinq Pease* resalt to our notations, a linear 
 permutation is oinega passable if P can be decomposed into the 
 matrix product LU, vhere L is a lover unit triangular matrix 
 and is an upper unit triangular matrix. (Dnit means that 
 all coefficients on the main diagonal are ones) . By analogy, 
 a linear permutation is inverse omega passable if P can be 
 decomposed into OL. 
 
 One important result from [ 16 ] is that any 
 nonsinqular P can be decomposed into L,0L2« This implies 
 that P can be decomposed as L,0 and Lq* So P can be passed 
 by two omega passes. It can also be decomposed as L| and 
 UI2. Hence it can be passed by two inverse omega passes. 
 This is a very significant result because it increases the 
 permutation ability of the omega network. The perfect 
 shuffle permutation is a good example. Although the omega 
 network is made up of stages of perfect shuffles, a perfect 
 shuffle permutation cannot be passed by a fixed log N stages 
 omega network. However, it is a linear permutation. So it 
 can be passed by two omega passes. 
 
 A diagram showing the permutation abilities of the 
 omega and inverse omega networks is shown in Figure 2.8. 
 
49 
 
 n Passable 
 
 U Passable 
 
 
 
 
 
 
 Others 
 
 
 
 
 
 CD 
 i-i 
 
 rt 
 
 c 
 
 13 
 
 (D 
 i-j 
 
 rt 
 •-< 
 
 
 
 
 LU deco- 
 mposable 
 without 
 pivoting 
 
 matrix 
 Langular 
 
 5 
 
 ->• 
 
 C 
 M 
 03 
 
 UL deco 
 mposable 
 without 
 Divot ine 
 
 
 
 assable) 
 
 
 
 2-fi-passable (or 2- 
 
 -^^"■'•-p 
 
 
 linear permutations 
 
 •ii; 
 
 f;i 
 
 r- 
 
 IS 
 
 
 Figure 2.8 Permutation Abilities of Omega Network 
 
 and Inverse Omega Network 
 
50 
 
 2. 3 Batcher_Net wock _and „, the 5h uf f le_Cgnnection 
 
 Cnc network that bears a great reseirblence of the 
 omeqa network is the Batcher's merging network. The only 
 structural difference is that instead of using tag bit 
 comparison at each of the switching elemerts, the Batcher 
 merger compares the magnitude of the two whole tag words at 
 each switch t-o dotetrr.ine which of the two output ports to 
 select . 
 
 Pefore we continue the discussion, w€ first define 
 
 the order set of a set of N elements as the relative ordering 
 
 of the elements, "^or example the order set of (8,12,17,3,9) 
 is {1,3,U,n,2). 
 
 It can te observed that the Batcher's merging 
 ilqorithm for a set of distinct elements is equivalent to the 
 omega tag routing algorithm for the corresponding order set. 
 
 It shoul-i, then, be obvious that the cmoga partition 
 theorems also applies to the Batcher merger. 
 
 Now we can deduce an alternate proof of Stone's 
 implementation [R] of the bitonic sorter on the perfect 
 shuffle network. 
 
 The basic idea of an NxN Batcher bitonic sorter (i'l 
 being a power of 2) is quite simple. Given two sorted 
 sequences of l^inqth L each, if the first sequence is in 
 
51 
 
 ascending order whilo the second sequence is in descendin>j 
 order, then the ?Lx2L Eatcher merger will sort the bitonic 
 sequence tor'^ed by the juxtaposition ol the two sequences. A 
 bitonic sorter consists of loq N stages of meiging nettfor"'CS, 
 where stage i ( '' £ i S lo92'^) consists of N/2' bitonic 
 mergers of size 2' x 2'. (Some switching elenents will need 
 to have their outputs reversed in order to produce the 
 required descending order.) By the extexision of the omega 
 partition theorem we can use an NxN Bitonic merger to 
 'simulate' N/2 bitonic mergers of size 2' x 2', by setting 
 the switches in the first (log^ N-i) columns straight through. 
 Hence, the sorting algorithm in [ 8 ] iirpienren ts the Batcher 
 bitonic sorter on a perfect shuffle network. 
 
 C) 
 
 C3 
 
 
52 
 
 2 . U Benes^Network andthe^ Shu f f le, Connection 
 
 It can be proved that the binary Beres network of 
 size NxN (whefe N is a powec of 2) is equivalent to a cascade 
 of an inverse omeqa network and an omega network , with the 
 middle two columns of switching elements collapsed into one. 
 The best known control time for the Benes network [17] 
 requires on the order of Nxlog^N operations. However, the 
 control time for an omega network is 0(log2^N), so for all 
 omega passable or inverse omega passable permutations, the 
 control time is only on the order of log2N. 
 
53 
 
 Sinc3 many connection networks ire sh uf f le-fcasod, if 
 we build a one stage perfect shuffle network to interconnect 
 all the proc'-»ssors, we can simulate any of these networks by 
 cycling sufficient nuirbor of times through the network. 
 lloroover, thn complexity of the network will only be 0(M), 
 which will be a great deal totter than that of other 
 networks. 
 
 One o^'vious shortcoming ot the omega retwork is its 
 inability to pass some permutations. In this section, we are 
 searching for the best strategies to pass any permutation 
 through a one stage rocyclic perfect shuffle network. By 
 recycling a one stage perfect shuffle network a sufficient 
 number of times, we hope to oe able to pass any general 
 permutation. 
 
 Lang [13] proposed to use queues at the outputs of 
 the switching elements, and then cycle the network for as 
 many times as it i.s needed for each ot the log ^ N steps. 
 Following this strategy, the number of shuffle cycles 
 required in the worst case is found *-o be 
 
 =2j2N -3 for log N being odd, 
 
 =3Jn -3 for loqjN being even. 
 The length of the queues can grow to: 
 
 for log N being odd. 
 
 a * 
 Ml 
 
 ill 
 111 
 
 I 
 
 for log N being even. 
 
54 
 
 Lang's algorithm is good in general. However, the 
 building of two 0(jN)-long queues into each switching element 
 certainly complicates the design of the switching elements. 
 Hence in another possible strategy, we prooose routing 
 algorithms that do not reguire to use queues at the switches. 
 The routing strategy is similar to that used in the 
 Destination Tag Method. Every input port will generate a 
 logjN bit tag representation of its destination and push it 
 (together with the data) through the network. The switching 
 functions of the switchirg elements are still the sa.ne. 
 Flowever, in case of conflict at any switch/ one input will be 
 honored while the other has to be switched to an undesired 
 output port and restart from the most significant tag bit. 
 At any given stage, the bit positions to be examined for each 
 tag may le different. So we need a bit count associated with 
 each tag to indicate which bit position will have to be 
 examined. Aft-^r the last tag bit of each input has been 
 examined, it will be stored away in a register and taken off 
 the network. 
 
 The conflict rf'solution can be: 
 
 a) gatp straight. 
 
 b) honor the input furthest away from its destination. 
 
 c) honor tht^ input nearest to the destination. 
 
 Sirulatiori usir.g random permutations shows that, on 
 the average, all three resolutions are iu£t about equally 
 
55 
 
 effective. 
 
 Instead of restarting from the most significant tag 
 bit for the data at the wrong output port, we can use a 
 built-in table to determine which bit to examine. In the 
 discussori below, we let n equal to loy^N, 
 
 For the destination tag method, assume tnere is no 
 conflict at any stage. We can observe that at stage k (k=2 
 to n) # the data word whose destination tag is d,d2....dn will 
 be at switching eleirent i (with binary representation 
 i|i2»»««in-|)* where d,d2...d|^ =in-i<»»»»in-i • 
 
 Hence by comparing the destination tags with the 
 switch tox numbers, wt; can find out how far a data word with 
 certain destination tag is frcir its destination. A more 
 useful information is which tag hit d,^(1<K<n) shall we 
 examine at that particular switch for further switching. 
 This number k is doteritined as the maximum nucrber of trailing 
 digits ot the switch number i that match with the leading 
 digits of the specified destination tag. These k values can 
 be input as a built-in table in a HC"^. For example, for an 
 N=8 network, the table will be: 
 
 m 
 
 ,|,--''VIJ 
 
56 
 
 
 
 De- 
 
 stination Taq 
 
 
 
 (1st 
 
 (log 
 
 N- 
 
 -1) bits) 
 
 
 
 00 
 
 CI 
 
 
 10 11 
 
 
 CI 
 
 3 
 
 2 
 
 
 1 1 
 
 Switch 
 
 0^ 
 
 1 
 
 3 
 
 
 2 2 
 
 NuiTiber 
 
 10 
 
 2 
 
 2 
 
 
 3 1 
 
 Urif octunataly, rio theoretical hound has been found 
 foe any of these two startup irethods. However, the two 
 methods have been simulated for many randorr permutations and 
 various network sizes. The siirulatod averages are tabulated 
 helow: 
 
 Network size 
 
 s 
 
 16 
 
 32 
 
 GU 
 
 128 
 
 256 
 
 Complete Restart 
 
 U.2 
 
 7.2 
 
 11.6 
 
 17. 
 
 24.5 
 
 32.7 
 
 Table Lookup 
 
 ■ 
 
 3.9 
 
 6.5 
 
 10.0 
 
 1U.7 
 
 20.2 
 
 27.9 
 
 Figure 2." compares these two restart methods with 
 Lang's nothod. The table lookup nrethod definitely has an 
 advantage over the cotrplete restart method, but is also 
 slower than Lang's method. 
 
57 
 
 
 o 
 
 tn 
 
 CM 
 
 O 
 O 
 CM 
 
 O 
 
 CO 
 
 ■o 
 
 O -i^ 
 
 +-> o 
 
 2: +-) 
 
 J- OJ 
 
 4-> r— 
 
 C »4- 
 
 O ««- 
 
 lO CO 
 
 $- 
 
 > 
 
 O 
 CO 
 
 c 
 cu 
 
 +-> 
 o 
 
 O) 
 <4- 
 
 S- 
 O) 
 Q. 
 
 O) 
 CD 
 fO 
 +-> 
 (>0 
 
 ^: 
 
 o 
 o 
 
 -o 
 
 O) s- 
 ■M O 
 
 in 
 
 O 
 
 cu 
 
 CJ5 
 
 in 
 
 O 
 
 10 
 e4 
 
 O 
 
 CM 
 
 ift 
 
 in 
 
 sa[0/C3 >|jom:^9n ^0 uaqiunfj 
 
58 
 ii NETgOPK_LfTILI2ATIQN IN , F& RALLEL PROCESS ING , SYSTEMS 
 
 3» "^ Introduction 
 
 To build a rreaningful processing system, we have to 
 be able to handle efficiently most of the application demands 
 of the users. In this section, we will investigate the 
 alignment requirements of some common operations or 
 algorithms and with what efficiency they can be handled by 
 the alignment networks. 
 
 Array operatiors are probably the most common type 
 of operations found in ordinary Fortran programs and they 
 have the most potential for high speedup and efficiency. So 
 the most important criterion of a good parallel processing 
 system is the efficient handling of array operations. Budnik 
 and Kuck [19] and Lawrie [U] discussed ways of organizing the 
 memories to allow conflict-free access to various slices of 
 arrays. Linear skewing is a standard technigue. However, 
 the data output will sometimes form a p-ordered vector, which 
 cannot be unscrambled by means of a simple shifter, [4] 
 discussed the alignment reguirements for some of the most 
 common types of array accessings. 
 
 In ordinary programs, operations that are not scalar 
 nor array operations very likely belong to the class of 
 
 "^ What we mean by array operations are the obvious type of 
 vector operations found in programs, not the type we obtain 
 by carefully rearranging the operation seguences of a 
 particular algorithm. 
 
59 
 
 recurrence operations. Recurrence operations, if not treated 
 properlvr will degrade a parallel processing system to a 
 serial machine. Kogge and Stone [20], Heller [21], and Chen 
 and Kuck [22] have shown various algorithms to speed up 
 recurrence operations. Section 3.2 will discuss the 
 adaptation of various recurrence solving algorithms onto 
 parallel processing systems. It will be shewn that, with 
 careful planning, the alignnrent reguirement £ can be greatly 
 simplified. flenre, we would not n.Be'i to use a full crossbar. 
 Insteadf a simpler alignment network, such as an omega 
 network, will suffice. 
 
 The adaptation of recurrence operatiors onto parallel 
 processina actually serves as a good example of how a wall 
 known comoutation algorithir can he tailored according to the 
 limited number of available perirutations of the alignent 
 network to minimi7e alignment time. In the extreme cases, 
 the alignirent network may have only limited number of 
 connections (lik^ the Illiac IV shifter cr a one-stagci 
 perfect shuffle network). To obtain any general permutation, 
 the network has to be recycled many times. For example, the 
 one-stage perfect shuffle network described in Section 2.5 
 may reguire (Jn) aiignnont steps before we can start on a 
 processing step. By carefully rearranging some of the 
 operation seguences in normal algorithms and by assigning 
 intermediate storage patterns in a deliberate fashion, we can 
 hopofully reduce the number of alignments per processing step 
 
 n 
 
 Hi 
 
 CO 
 
 ,1:4 Cm 
 
 ^3 
 
 £3 
 
 n 
 
 ii 
 
60 
 
 down to a con-Gtant (not dependent on H) . Pease [10] and 
 Stone [8] showed how the Fast Fourier Transform can be done 
 efficiently on a multiprocessing system interconnected with 
 the perfect ?5huffle connection. In Section 3.3 we are going 
 to show how matrix multiplication can be done in a more 
 efficient way in a multiprocessing system with a certain 
 class of conn«^ction networks. The number of alignment steps 
 is shown to be reduced ty a factor of Jn or IcgoN. 
 
 The algorithms described in this section are i^i such 
 (ietails that they can be easily micropro crammed into the 
 respective parallel processing systems. The intermediate 
 storage and alianment patterns are all clearly specified. 
 Masks are needed occasionally to prohibit some processors 
 from doing the prescribed operations at some steps. 
 Throughout this section, we are considering parallel 
 orocessinq systems structured like that in Figure 3.1. The 
 central control unit is not shown in the figure, but is 
 actually the master unit ot the array of processors. It 
 sends the ir icroinstruc tions to all the processors together 
 with the masks. Each processor will address only its own 
 memory. If the data words obtained need to be sent to 
 (different processors, they will he gated to the Alignment 
 Send Register (ASR) . After the roc,uired alignment is done, 
 thf>y will he returned to the Alignment Peceive Register (AR R) . 
 fcith this archit.^cture, wo can align internal registers as 
 well as memory r<-^gisters. 
 
61 
 
 Si 
 
 R 
 
 
 
 
 
 
 
 
 
 
 
 
 A 
 S 
 
 
 1 
 
 
 
 
 
 1 
 
 
 
 1 ^ 
 
 
 
 * — 
 
 -► 
 
 D 
 
 
 R 
 
 
 1 
 1 
 
 
 
 
 
 ____ 
 
 
 A 
 R 
 R 
 
 l^ 
 
 1 
 
 
 
 
 
 1 
 1 
 
 
 
 
 
 
 
 1 
 
 1 
 
 
 ALIGNMENT 
 NETWORK 
 
 j) Memory Data Register 
 \SR Alignment Send Register 
 \RR Alignment Receive Register 
 
 !>^- Memory Module i 
 
 1'^. Processor i 
 
 Figure 3.1 A Parallel Processor System Configuration 
 
62 
 3.? Adcipt.ation, of _Recai:i:ence_ Solvers 
 
 Chen and Kuck [22] provided many good algorithms to 
 handle recurrence systems. To actually implement these 
 algorithms on a parallel processing system, ore would require 
 some cateful partitioning of the recurrence system, and a 
 qood, uniform way to allocate the initial and intermediate 
 data so as to minimize the data routing time and the amount 
 of intermediate storage space. 
 
 The solution of a ?<rt,m> recurrence system is 
 actually equivalent to the solution of a bandod unit lower 
 triangular matriv system with matrix size n x n and the 
 number of nonz^^ro bands =m*-1- In general, he have to solve 
 (3.1) for X to gt t the recurrence results. 
 
 A X 
 
 (3. 1) 
 
 where A is a lower triangular matrix with 1's on the miin 
 diagonal ari m. more nonzero subdiagonals. 
 
 [23] and [ 2U ] reorganized some of the recurrence 
 algorithms into partitioned matrix notations to simplify 
 understanding. According to the number of processors 
 available and thi-^ values of m and n, we have to use different 
 recurrence solving algorithms for higher efficiency. In 
 general, there are three major algorithms to handle 
 recurrence systems. Ihe first algorithm uses a limitei 
 number of processors, and evolves from [23] and Algorithm 5 
 
63 
 
 of [25]. The second algorithm assumes the presence of a 
 large numjuer of processors, but will do the folding when the 
 number of available orccessors is less than the upper bound. 
 It evolves from [2^4] and Algorithm 2 of [22]. The third 
 algorithm is similar tc the second algorithm €xcept it uses a 
 less parallel method in solving the small full recurrance 
 systems in the initialization stage. The number of 
 processors used will be between that of the first and the 
 second algorithms. 
 
 Given D(the number of processors), and values of m 
 and n, the execution time for the first algorithm 
 
 = ^mn (m+2) /p + log^(p/m) => (m +^m/2+ 1 ) -ni'^-9m/2-2 , for p<n. 
 
 = 2m (m + ?) +loq2(n/'n) * (m^ + 3F/2+ 1) -m^-9m/2-2. 
 
 The execution time for the second algorithm 
 = 2ir.log (n/m) +4rr , 
 = (Umlog (n/m) +f m) mn/ {2p) , 
 = UFlog (n/m) *-6m. 
 
 The execution time for the third algorithm 
 = logon (2 + log,in) -log.m (log„m+ 1) /2 , 
 
 for p>n, 
 
 for p=mn, 
 for p<mn/2, 
 for p>mn. 
 
 2 
 for p>m n. 
 
 ■eMt 
 1 
 
 .1,1 
 
 C3 
 
 n 
 
 
 = (loq n (2 + log m) -loggm (log2rr+ 1 ) /2) m n/p , for p<m^n. 
 
 We can decide which algorithm to use ty coinparing the 
 above tim.e^, given rr , n, and p. A diagram showing what 
 algorithm to use tor m=8 is shown in Figure 3.2. l^e can see 
 that if we have many processors, we would like to use the 
 second algorithm; if we have a little bit less processors. 
 
64 
 
 the third algorithm will be more efficient; and if we only 
 
 have a limit'^'d nuirber of processors we would like to use the 
 
 first algorithm, k similar pattern can be observed for m's 
 of other values. 
 
 nfi=8 
 
 
 
 
 
 n s: 
 
 Lzes 
 
 
 
 
 p 
 
 16 
 
 3 2 
 
 64 
 
 128 
 
 256 
 
 512 
 
 102U 
 
 20U8 
 
 a 
 
 a 
 
 a 
 
 a 
 
 a 
 
 a 
 
 a 
 
 a 
 
 a 
 
 8 
 
 a 
 
 a 
 
 a 
 
 a 
 
 a 
 
 a 
 
 a 
 
 a 
 
 16 
 
 a 
 
 a 
 
 a 
 
 a 
 
 a 
 
 a 
 
 a 
 
 a 
 
 32 
 
 a 
 
 a 
 
 a 
 
 a 
 
 a 
 
 a 
 
 a 
 
 a 
 
 6a 
 
 r. 
 
 a 
 
 a 
 
 a 
 
 a 
 
 a 
 
 a 
 
 a 
 
 12F 
 
 c 
 
 c 
 
 c 
 
 a 
 
 a 
 
 a 
 
 a 
 
 a 
 
 256 
 
 b 
 
 c 
 
 c 
 
 c 
 
 a 
 
 a 
 
 a 
 
 a 
 
 SI? 
 
 b 
 
 b 
 
 c 
 
 c 
 
 c 
 
 a 
 
 a 
 
 a 
 
 I02u 
 
 b 
 
 b 
 
 b 
 
 c 
 
 c 
 
 c 
 
 a 
 
 a 
 
 20II6 
 
 b 
 
 b 
 
 b 
 
 b 
 
 c 
 
 c 
 
 c 
 
 a 
 
 U096 
 
 v% 
 
 b 
 
 b 
 
 b 
 
 b 
 
 r 
 
 c 
 
 c 
 
 a choose the first algorithm. 
 
 1;., choose the secord algorithm. 
 
 c choose the third algorithm. 
 
 Figure 3.2 Choice of Pecurrence Algorithms for m=8 
 
65 
 
 3.2. 1 Using a Limited Numb er of _Processors 
 
 The original algorithm can be found in [23], The 
 following algorithm shows how it can be adapted to a parallel 
 processinq system. 
 
 We will first describe how the input-output arrays 
 ind the intermediate arrays are stored in the parallel memory 
 systein. The non?ero elements of the A matrix, other than the 
 main diagonal, are stored in a section of the memory called 
 L. The processors and the memories are broken into k 
 partitions ( numbered through (k-1)) cf size m, where 
 k=p/F. The ith off-diagonal element of the (i+1)th row of A 
 will ba stored in the (m-;i)th memory of the j_ik/nj th 
 partition with relative dddress= i irod m. The £ vector is 
 stored in the memory section called f. The f vector is 
 broken into k subvectors. The i-th element of the j-th 
 suhvector will be stored in the (i irod m)th memory of the jth 
 partition with relative address- [i/mj . G and H are 
 intermediate storage sections. The result vector x is stored 
 in two memory sections, v and y. v and y together can 
 actually he th^ alias of memory section f, and the x vector 
 is stored in exactly the same way as the f vector. Section y 
 is the alias for the first (n/p-1) locations cf section f and 
 v is the alias for the last location (relative address=n/p- 1) 
 of f. 
 
 I 
 
 Its 
 
 ■ ** 
 
 «2 
 
 For maximum efficiency of this algorithm, p should be 
 

 
 
 
 
 66 
 
 
 less than or equal to n. 
 
 
 
 
 
 In the following algorithm. 
 
 
 
 
 
 PPN = processor identification 
 
 number 
 
 (0<PPN<p) , 
 
 
 
 FA = relative address within a 
 
 memory 
 
 section. 
 
 
 P.CC = accumulators in the processors. 
 
 
 
 P1-t^l = the three internal 
 
 registers in 
 
 the 
 
 
 processors . 
 
 
 
 
 
 Algorithm: 
 
 
 
 
 
 Stage 1: 
 
 
 
 
 
 . ♦ 
 
 
 
 
 1. Forif k partitions of size ni. 
 
 
 
 
 
 2. Set ppn = '^PN mod m. 
 
 
 
 
 
 3. Repeat for i=1 to n/k-1; 
 
 
 
 
 
 a. if ppn >(i mod m) , then x = 0. 
 
 
 
 
 
 else x=1. 
 
 
 
 
 h. fetch from f, RA =Li/mJ+x. 
 
 
 
 
 c. left shift by (i nod ir) into ACC. 
 
 
 
 
 d. fetch from L, RA=ir-ppn+i- 1 . 
 
 
 
 
 e. perform a flip route ir m-partition into 
 
 R1. 
 
 
 f. fetch f element from memory (i mod 
 
 m) , PA 
 
 = |_i/mj . 
 
 
 * The 'Form Partition* command forces 
 
 all subse 
 
 iguent 
 
 instructions to conform to this partitioning, for 
 
 both 
 
 
 alignment and memory accessing, unless 
 
 specif 
 
 led otherwise. 
 
67 
 
 g. broadcaist to respective ni-pai tition into P»2. 
 h. multiply PI and R2 into F3. 
 i, subtract R3 from ACC. 
 j. right shift ACC by (i irod vr) . 
 k. store ACC into f, FA=[i/mJ+x. 
 u. Done, 
 
 Stage 2: 
 
 1. Repeat for i = to iti-1; 
 
 a. fetch from I, RA =i, memories through jm-i-1). 
 
 b. right shift by i. 
 
 c. store into <^, RA-i, 
 
 2. Form l^ partitions of size m. 
 
 3. Repeat for i =1 to r./k-1; 
 
 a. fetch from G, RA =i, into ACC. 
 
 b. repeat fori=0tom-1; 
 (i) set h = i + -j-ir. 
 
 (ii) if h<0 then set R1=0. 
 
 else fetch from G, RA=h, intc S1, 
 (iii) fetch L oiement from miemory i, PA=i-1. 
 (iv) broadcast to respective m-partitions into F2< 
 (V) multiply 31 and R2 into F,3. 
 (vi) subtract P3 from ACC. 
 
 c. store ACC into 1, PA=i. 
 '4. Done. 
 
 
 f^ 
 
 ■-TI 
 
 :.!■■' 
 
 
 
68 
 
 itage 3: 
 
 A. Perforrr an m x in matrix transpose in m-pactition from 
 
 the last m rows of G*s to H. 
 3. Fetch fropi f, RA=(n/p-1), and store into v, 
 C. Repeat for j = to log (p/2in) ; 
 
 1. set r = (2**1) *m. 
 
 2. Form (p/2r) partitions of size 2r. Subdivide each 
 
 partition into left and right halves. 
 
 v^,_, and Hji.g 
 
 (j) (j) 
 in the left, and v qj and H 2j_2 will be in the right, 
 
 1<i<(p/2r). 
 
 (j) 
 (H^ is nonexistent and filled with all O's) 
 
 3. Fetch from the ri)ht hal^ v into riaht half kCC. 
 
 U. Put O's into left halt hCZ. 
 
 f. Eepeat for h=0 to m-2 by 2; 
 
 a) fetch from right half H, EA=h. 
 
 b) left shift l»y r into left half Rl. 
 
 c) fetch from right half l\ , PA = h+1, into right half Rl. 
 
 d) fetch froiT' left half v, memories (r-ir+h ) and (r-m+h+1) 
 only, 
 
 e) broadcast respectively to left and right half h2, 
 
 f) multinlv r?1 and R2 into E3. 
 
 g) subtract ^^3 froir ACC. 
 
 6. Fight shift AC: in left half by r into F1 of right 
 half. 
 
 7. Add right half F1 to right half ACC. 
 
 8. Store right half ACC into right half v. 
 
69 
 
 0. If -1=10^(9/2111) qo to D. 
 
 10. Ser.oat for h =0 to m-2 by 2; 
 
 a) set ACC=0. 
 
 b) Lppeat for q = to iti-1 ; 
 
 (i) fetch from riqht half H, Sfl^q. 
 
 (ii) fetch H element froir left half rneiriory 
 
 (r-[r*-q) , RA = h. 
 (iv) broadcast into left ?.2. 
 
 (V) fetch P eleirent froir left half memory 
 
 (r-m>q) , RA=h+1. 
 (vi) broadcast into riqht R2. 
 
 (vii) multiply R1 ancl E2 into R3, 
 (viii) aad r3 to ACC. 
 
 c) noqate ACT. 
 
 d) store left ACC into right half fi, RA=h, 
 
 c) store riqht ACC into right half h, SA=h+1. 
 D. Done. 
 
 Staqe U : 
 
 1. Form k m-Partitions. 
 
 2. Repeat for h=0 to n/p-2; 
 
 a. Perform an rr x n iratrix transpose frorr c;|FA: (hm) to 
 m(ti-H)-l) to H. 
 
 b. Fetch from f, RA=h, into ACC . 
 
 c. Repeat for l=0 to m-1; 
 
 ii 
 
 Ii 
 
70 
 
 (i) fetch from H, RA=i# into R1. 
 
 (ii) fetch V element from meirory i. 
 
 (iii) ' broadcast to the right neighbor m-partition into 
 
 P?. • 
 (iv) irultiply R^ and R2 into R3. 
 (v) subtract R3 from ACC. 
 d. Store ACC into y, PA=h. 
 3. Done. 
 
 An al ysis : 
 
 Throughout the algorithm, partitions of size m or 
 2r (=2 m) are used. By Corollary 2.1, we can see that the 
 omega network (or some of the full permutatior. networks) is 
 capable of performing the necessary alignments because of its 
 partition ability. As tor the different kinds of alignment 
 patterns that are required within the partitions, we have 
 right and left shifts, flips and 1-to-many broadcasting. All 
 of these patterns can be passed by the omega network. One of 
 the noteworthy patterns can he found in step C,5.e of Stage 
 3. The broadcasting function has the fcrm {(k,x)<r,2> 
 
 > (X , *) <2, r>} , That this function can be passed by the 
 
 omega network is proved in part (ii) of Section 2.2.1.2 of 
 this thesis. Tn step 2'C. (iii) of Stage U, the connection 
 function can be passed by the oirega network, by virtue of 
 Theorem 2.1, after setting F^ to a 1-shift permutation and 
 
71 
 
 Pj^'s to 1-to-many broadcasting 
 
 The m X pi matrix transpose in step A cf Stage 3 and 
 step 2. a of Stage ^i can be implemented as a 'subroutine' that 
 takes (m-1) steps. 
 
 Assume element (i/j) of matrix K is stored in memory 
 -j with relative address i {0<i,j<m). 
 
 Let ?PN be the processor identification number and 
 ® be the bit by hit 'exclusive or' of two integers. 
 
 I2^t£ix Transpose Pouting: 
 DO k = 1 to m-1 ; 
 
 h = PPK © k 
 
 Fetch elorient with index h into PI 
 
 ?wap PI with processor h 
 
 Ftore ^1 into M, PA=h 
 LND 
 
 The detailed counts for various operation times in 
 this algorithm are listed below; 
 
 'A 
 
 r .» U 
 
 
 M 
 
 Stage 1: 
 
 Fetch: 
 Stor'3 : 
 
 3(n/k-1) 
 (n/k-1) 
 
 * For ornega networks, we can use the column control method 
 described in Section 2.1.4, with p set to h. 
 
Alicfument: U(n/k-1) 
 Processor: 2(n/k-1) 
 
 72 
 
 Staqe 2: 
 
 Ff^tch: 
 Store: 
 Mignment : 
 Processor: 
 
 2inn/k + n/k-ir-1 
 n/k+m- 1 
 mn/k 
 2rr.ri/k- 2m 
 
 Stage 3: 
 
 Fetch: 
 Store: 
 Alignment : 
 Processor: 
 
 Stage ^ 
 
 Fetch: 
 Store: 
 Alignment : 
 Processor: 
 
 loqjp/m) (3m^/2+3m/2+1) -3rr\^/2*m*'^ 
 
 loti (p/m) (in+1) 
 
 log(p/iT) (3ir2/2+2m + 2) -3mV2 
 
 loa(p/m) (m^ + 3iii/2<-1) -m/2-m^ 
 2 
 
 3inn/p-3n+n/p-1 
 (n/p-1) (m+1) 
 2mn/P~ 2m 
 2mn/p- 2m 
 
 is for the total, we get: 
 
 Fetch: 
 Store: 
 
 Alignment : 
 Processor: 
 
 (3m^log2(p/m)/2 + 2m^n/p) 
 (m.logj^(p/ir) +3mn/p) 
 (3m^log2(p/m)/2-»-m^n/p) 
 (m^log (p/nr) +2rr^n/p) 
 
73 
 
 As we can see from these fiqures, the alignment and 
 memory times are of the sarre order as the processor time. 
 This iirplies that wg are not spending excessive delay in 
 either the accessing or alignment ot the intermediate ^ata 
 elements. 
 
 1*1 
 
74 
 
 3, 2, 2 F ull Becuryence^ Svstem So l ver 
 (using p-n processors) 
 
 The algorithm we use here is derived from [23], The 
 essence of the algorithm is as follows: 
 Assume we have to solve for it in 
 
 L X = f 
 
 (3.2) 
 
 where L is a unit lower triangular matrix of size n x n, 
 while X and f are arrays of sizes n x h* Ihe inverse of L 
 can be represented as. 
 
 -1 
 
 r 
 
 (3,3) 
 
 where M, = (I-L|ei ) in which L; is the ith coluiwi of L with 
 the element L(i,i) set to zero. The solution x is then given 
 by 
 
 * ^ n-i ^ n-2 
 
 M^f . 
 
 (3.U) 
 
 Then we will solve this product in parallel using 
 log n stages. 
 
 The initial storage patterns for the arrays in the n^ 
 processors/ memories are shown in Figure 3,3- 
 
75 
 
 2 
 For example, if 2n =2, then the addition tree will 
 
 look like: 
 
 1 2 3 U 5 6 7 8 9 10 1 1 12 13 14 15 
 
 For the M calculation, a pictoral description of 
 what needs to be done is shown in Figure 3,4. 
 
 Here, similar to what we do in the calculation of 
 
 (j + l) cj) 
 
 G » we broadcast ^2'+\ elements to the right side R1»s 
 
 (J) 
 and T2i elements to the right side R2»s. The partial 
 
 results will then be in R3 (1 ,0,i,*,x,y) <2, n/r ,n/2r,r,r, n>. 
 
 The summation of partial products are done by shifts (of 
 
 2 .r.n) and add, 0<h<j. The multiplication and additions 
 
 will be done at the same time as those in the calculation of 
 
 . CJ+i) 
 
 i*i 
 
 
 
 
 
 We first define s as the memory systeir of n modules 
 and P as the processor system of n modules. In this 
 algorithm, we need four internal processor registers (Rl 
 through RU) . 
 
76 
 
 "A 
 
 "^^ 
 
 R(>»,o)RC»*,i) 
 
 R(*,nH) f (*) M?^ M^ 
 
 Mr.. 
 
 ^iqure 3.3 Initial Array Storage 
 
 G consists of the h right hand columns and H ^ is 
 
 the ith column of the left hand matrix, L, (1<i<n-1)« Note 
 
 that Ho° is all zeros and the nth column of the matrix has 
 no entry* 
 
 Before we proceed, we would li^e to introduce some 
 new notations to simplify the description of the algorithms. 
 
 #N the set (0, 1, 2, . . . .H-1} 
 
 * the set (.....,-2,-1,0,1,2,.....) 
 
 All vectors are declared using the notation A<0> and 
 are indexed from A{0) to A (0-1). 
 
 For calculation of G , we will first broadcast Hi 
 to the left half Rl»s. Then we broadcast I elements to the 
 
 (j+n 
 
 will be 
 
 left half R2»s. Then the partial results of G 
 
 in R3 (0,0,*,x,y) <2,n/4r,r,2n,n>. The summation of partial 
 
 results are done by shifts (of 2 *n ) and add, 0<h<j, 
 
77 
 
 J i 
 n 
 
 \ 
 
 t(J) 
 
 '2i 
 
 
 (2i+l)-l 
 1. 
 
 ^2i 
 
 K-r-H 
 
 
 K-r 
 
 -H 
 
 i i 
 
 ^ 
 
 > 
 
 n 
 
 
 ^ 
 
 ^ ' 
 
 '^ 
 
 ^ 
 
 .2r 
 
 '2i+l 
 
 t=n+l-(i+l)2r 
 
 
 ,(j+l) 
 
 i=1.2,...,(n/2r-l) 
 
 Figure 3.4 mP"*"^ ^ Calculation 
 
 
 
 
 lis 
 
 t2 
 
78 
 
 The algorithm is shown belov: 
 
 Algopjlthin: 
 
 A. Repeat for j = to (log2n-2) ; 
 
 1. Let r=2**j, r •=max (r/«t,1) , r'»»inax (r/2, 1 ) , r=2r". 
 
 2. Declare S(PA=G) as Q<2,n/U, 2n,n>. 
 Declare S (BA=M) as il1<2,n/2r • ,n/r,r • ,r, n>. 
 Declare S(FA=P!) as W2<2,n/2r" ,n/2r, r",2r,n>. 
 
 Then for i=0 to (n/2r-1) , 
 
 (J) 
 G <2n,n> is in Q(0,0,*,*) 
 
 y^'^' <2n,r> is in Q (0,0,*,#r> (r-1) ) 
 
 Mg"^.^ <r,n> is in W1 ( 1,0, 2i,0,*,*) , M^^^ 
 
 IS 
 
 immaterial. 
 
 (J) 
 
 "21 + 1 <r»n> is in W 1 (1 , 0,2i>1,0, *,*) , 
 T^^^ <r,r> is in Wl (1,0,2i,0 ,*,#r* (2i 
 M^j"^'^ <2r,n> is in W2 ( 1,0,i,0,*,*) . 
 
 3. Declare P as P1<2,n/Ur,r ,2n,n>. 
 
 Declare P also as P2<2,n/r,n/2r,r,f ,n>. 
 Then G calculation uses P1 (0,0, *,*,*) , 
 while n^'!*^^ calculation uses P2 (1 ,0,i, ♦,*,*) . 
 
 «». Fetch »\^^ {*,*) from W1 ( 1,0, 1 ,0,*,*) . 
 
 5. Broadcast D (1 ,0, 1 ,0, x,y)* to R1 (0,0, x,»,y) of PI, 
 Vx,v. 
 
 '^ D is memory data register. Declaration always follows 
 whatever is to be fetched or stored. 
 
79 
 
 6. Fetch Wgi + i (*f*) from W1 (1,0, 2i+1, 0, ♦,*) . 
 
 7. Broadcast D (1,0, 2i*1 ,0,t ,y) to R1 (1 , 0,i ,x, ♦, y) of P2, 
 Vx,y, 
 
 8. Fetch Y^^' (*,*) from Q (0 ,0 ,♦, #r* (r-1) ) . 
 
 9. Broadcast D(0,0,x,y) to B2 (0, 0,y,x,*) of P1,Vx, and 
 vyef*r*(r-1)} . 
 
 10. Fetch T^^V (*»♦) from Wl (1,0,2i,0#*, #r«-( 2i+ 1) r-1) . 
 
 11. Broadcast D (1,0, 2i,0, x,y) to R2 (1,0,i,y ,x, *) of P2, 
 Vx, and V y eflr+(r-1) } . 
 
 12. Multiply R1 and R2 into R3. 
 
 13. Repeat for g=0 to (1-1); 
 
 a) Set RU=0. 
 
 2 - a _ 
 
 b) Declare P as P3<2,n / (2** (q*2) r) , 2, 2 ,r,n>. 
 
 c) Left shift R3 ( 1,*, 1,0, *,*) of P3 by 2 rn into R4. 
 
 d) Declare P as PU<2,n/2** (q + 3) ,2,2'^,2n,n>. 
 
 e) Left shift R3 (0,* ,1,0 ,*,*) of PU by 2^.2n^ into Ra. 
 
 f) Add P3 and RU into R3. 
 
 la. Fetch M 2; (♦,♦) from Hi ( 1,0,2i,0, *,♦) . 
 
 15. Right shift D (l , 0,2i ,0,x,y) by (ir n/2) to R2 
 (1,0,i,x,y) ,Vx,y. 
 
 16. Fetch G^"'' (*,*) from Q(0,0,*,*). 
 
 17. Transfer D(0,0,x,y) to R2 (0,0,0, x, y) of P1,Vx,y. 
 
 '^ This step will be skipped when j=0* 
 
 ^n^ 
 
 Transfers need no alignment. 
 
 
 i0» 
 
 1 <• 
 
 I ■■3 
 -1* 
 
 c: 
 
 ;c3 
 
 C3 
 
 
 
80 
 
 18. Add F2 and B3 into R2. 
 
 19. Transfer R2 (0,0, 0,x, y) of PI to D(0,0,x,y) of Q. 
 
 20. Store D(0,0,*,*) to G^"*"^'^ (♦#♦)• 
 
 21. Fetch H^^li (*,*) from Wl (1,0,2i>1 ,0, ♦,*) . 
 
 22. Right shift D (1, 0,2i+1,0 ,x, y) of 81 to E 
 (1,0,i,0,x*r,y) of H2 by (ir^n/2-r^n/U* m) V K,y. 
 
 23. Transfer R2 (1 ,0,i,0, x,y) of P2 to (1,0 ,i,0,x,y) of 
 W2, Vx,y. 
 
 2a. store D(1,0,i,0,x,y) into M 
 
 fj+i) 
 
 (x,y) Vx,y. 
 
 B. For i = loqgn-l; 
 
 1. Let r=n/2. 
 
 2. Declare P as P5<n/2 ,2n,n>. 
 
 3. Declare S(PA=G) as Q1<n/2,2n,n>. 
 Declare S(RA=«) as H3<2, 8,n/8,n/2,n>. 
 Then G ^ <2n»n/2> is in Q1(0,*,*), 
 
 Y^^^ <2n,n/2> is in Q1 (0, *, # (n/2) ♦(n/2-1) ) , 
 
 (J) 
 
 (J) 
 
 <n/2,n> is in W3 (1 , 1 ,0,*, *) . 
 
 U. Fetch «7 (*,*) from H3 (1, 1 » 0,*,*) . 
 
 5. Broadcast D(1,1,0,x,y) to R1 (x,*,y) of P5, Vx,y. 
 
 6. Fetch Y^"^^ (♦,*) from Q1 (0, *,# (n/2) ♦ (n/2-1) ) . 
 
 7. Broadcast D(0,x,y) to B2(y,x,*) of P5, Vx# and 
 Vyef* (n/2) ♦(n/2-1)). 
 
 8. Multiply R1 and R2 into R3. 
 
 9. Repeat for q=0 to j-1; 
 
 a) Set r4=0. 
 
 b) Declare P as P6<n/2** (q*2) ,2,2 , 2n, n>. 
 
81 
 
 c) Left shift R3 (♦, 1 ,0, *,*) of P6 by 2 .2n.n into Ri», 
 
 d) Add R3 and RU into R3. 
 
 10. Fetch G^"^^ (*,*) from Ql(0,*,*). 
 
 11. Transfer D(0,x,y) to E2(0,x,y) of P5, V x,y. 
 
 12. Add R2 and R3 into R2. 
 
 13. Transfer R2(0,x,y) of P5 to D(0,x,y) of Q1,Vr,y. 
 lU. Store D(0,*,*) to G^*'*'^ (*»♦). 
 
 C. Done. 
 
 A nalysis ; 
 
 
 Steps A. 5 and A. 7 use a broadcasting function that is 
 omega passable. We first apply part (ii) of the broadcasting 
 theorems which shows that { (K:, x,y) <2n,rrn>---> (x,* »y) 
 <r,2n,n>} is omega passable. Then we can apply the omega 
 partition theoreni to allow for the shift in partitions. The 
 broadcasting function in Step A. 9 and A. 11 are of the form 
 
 f (!Cr*^#y) <n,r,n> > (y, x,*) <r,r ,n>} . They are also omega 
 
 passable because of Part (iv) of the broadcasting theorem 
 (notice that a>c since a=c=n) ) , and the omega Partition 
 Theorem, Step A. 13 is the repetitive shifts and adds 
 described earlier in this section. The broadcasting function 
 
 in Step B.5 is of the form f(k,x,y )<2n,n/2,n> > 
 
 (x,*,y) <n/2,2n,n>} and that in Step B.7 is of the form 
 
 f C^^x^y) <n,2n,n> > (y,x,*) <n,2n,n>) . Both ace passable by 
 
 omega network. 
 
 es 
 
 -A 
 
 is 
 
 IS 
 
 is 
 
82 
 
 The operation times for this algorithir are: 
 Fetch : lloq^nrk 
 Store' : 21og2n-1 
 Align ! (loq^n) ♦lloggn+l 
 Processor: log n (log2n*3)/2 
 
83 
 
 3.2.3 Usin£_^an_^_P£ocesso£S 
 
 This -Algorithm is derived from [ 2^^ ]. We will solve 
 
 2 
 
 ?<n,m> with p=n^ ru However, if the numter of availaole 
 
 processors is less than this, we will have to use foldinq. 
 
 The thp'oretical processor bound found in [22] is 
 
 m (m* 1) n/?-m3 . However, if m ^n6 n are powers of two, for d 
 
 2 
 to be li power of two also, we have to use p-m (2n) n/2 = m n. 
 
 The matrix L arid the vector f can he viritten in the 
 
 for.Ti, 
 
 Lo 
 
 Rg L2 
 
 R 
 
 m ' my 
 
 i = 
 
 ft 
 
 f2 
 
 fn 
 
 s-i 
 
 
 
 .1 >• 
 
 ira C 5 
 
 -It 3 
 
 <i 
 
 where Lj and Fj are rr x m unit lower triangular and upper 
 trianqular matrices, respectively. Preinultip lying both sides 
 Lx = f by the matrix D = Idiag L ; J, we obtain the system 
 L X = f where. 
 
84 
 
 and 
 
 L 
 
 CO) 
 
 m 
 
 CO) 
 
 Go"' I 
 
 m 
 
 Gr,_.I 
 
 I -^ m 
 
 
 r (0) ^(o) 
 
 1 
 
 Fjlf; 
 
 i=1 , 2, . . . n/iT-1 . 
 
 This will be callefi the ini tializatior part of the 
 algorithnr, foe it sets up the data for the nain part of the 
 algorithm. 
 
 Then for the nain Fart, we form the sequence L 
 
 (j+i; 
 
 and f for j='^, ',..., log (n/m) - 1 . Kach matrix L is of 
 
 the form 
 
85 
 
 L 
 
 (J) 
 
 G^ 
 
 Ir 
 
 Gn . I 
 
 r-' 
 
 whera r =2 m. For the (j+1)th stage, we have 
 
 
 (j+i) 
 
 (J) 
 
 
 f; 
 
 (j + i) 
 
 T2i 
 
 CJ) (J) (J) 
 
 "^21+ 1 T2i + '2141 
 
 1 = 0,1, ...2f-1 
 
 The initialization p^rt will l^ done usiaq the method 
 •lescribed in Section j.2.2. We will not repeat the discasion 
 here. The second stage of the algorithm, however, needs some 
 discussion. 
 
 
 .1 <• 
 
 >r: 
 
 .>>> 
 
 a 
 
 
 
 (J) (j) 
 
 At step -j + 1 (0< j<log (n/ir) ) , we calculate ^ 2\ + \ ' ^ 2\ 
 
 (J) CJ) 
 
 and G . f. „- at the same tiir^e ( 1<i < (n/2r ) ) . Each 
 
 calculation uses rn processors. The '.'^ calculation is done 
 
 2 2 
 
 \x\ the left m n/2 portion of the p (=.ti n) processors while the 
 
 f calculation is done in the right m n/2 portion. G..^, has 
 
 -3 / L 2i+i 
 
 to be broadcasted to both portions. 
 
The memory system of 
 
 m n 
 
 86 
 
 is broken into 
 
 2 (J) 
 
 r m/2-pirtitions. G . (0<i<n/r) is stored in (0,1,0,*,*) 
 
 I 
 
 <2,n/r ,m/3,r,m>, while . , 
 (1,i,0,*,0) <2,n/r,m/2,c,m>. 
 
 f^-^ (0<i<n/2r) 
 
 is stored 
 
 in 
 
 The f calculation will need one extra step to adl 
 
 ^ (0) (J) (J) 
 
 f g.^j to the product G g,-^, .£21 • 
 
 AjLqorithir : 
 
 Stage 1 (Initialization) : 
 
 1. If .71=1, the systeiTi is already initialized, wg can qo 
 
 directly to Stage 2. 
 2« It" m=2, theCa is only one of f-diaqonal eleoent in each of 
 
 the L|*s, C<i<n/2. We can solve each of the n/2 systems 
 
 in (0 ,i ,*) <2 ,p/2 , U> in two steps: 
 
 <ii (•r*)=^i (1,*)-L| .Gj (0,*) , where Gj=[R j|f . ]. 
 
 P. will be in (0, i , ♦, *) <2,n/2, 2, 2> and fj at (1,i,*,0) 
 
 <2,n/2,2,2> respectively. 
 
 We then require two fetches and aliqns to route them to 
 
 (0) 
 
 (0) 
 
 at (0,1,':^,*,*) <2,n/2,1,2,2> anl f. at (1,i,0,*,0) 
 
 <2,n/2, 1 ,2, 2> respectively. 
 3. If m>2, W2 will use the method described in Section 
 
 ~ 3 
 
 The Gj calculation will be done in (0, i ,*) <2,n/m,m /2> 
 
87 
 
 and 1 calculation be done in (1 # i» ♦) <2 ,n/ni ,nn /2> for 
 
 0<i<n/m. 
 
 Initially, Pj will be stored in column majcr order in 
 
 (0, i , 0, ♦, *) <2,n/m,in/2,rn,m>, Resulting Gj and t; will be 
 
 in where 3j and f j were. 
 
 For stage 2, we want G ^?^ to he in (0,i,0,*,*) 
 
 (0) 
 
 <2, n/r., m/2,Trir nn> in row rrajor fashion and t j to be in 
 {1,i,C,*,0) <2,n/rr.,ir./2,m,m>. 
 
 Hence we want to route (0,i,C,x,y) to (C,i,0,y,x), and 
 also (0,1,1,0,7.) to (1,i,G,z,0) Vx,y,z. Both routes are 
 linear permutations and can le realized by the omega 
 networVc in two passes, due to the results described in 
 Section 2. 2. U, 
 
 Stage 2 : 
 
 A, Repeat for i - '^ to log (n/2ir.) ; 
 
 1. Let r=2**i.m. 
 
 2. Declare S(p.'\=G or f) as M<2 , n/2r ,m, r,m >. 
 Then for C<i<n/2r, 
 
 G 2i <r,m> is in M (C , i, ,*, *) , G ^ faeirg all O's. 
 
 r,m> is in M ( G ,i , m/2 , * ,*) , 
 
 2i+l 
 
 f ^l? <r> is in iv n,i,C,*,0) , 
 (j) 
 
 ^ ii+i '^'^^ i^ i" M(1,i,in/2,*,C) , rest of !•: C ,*,*,*, *) 
 = 0. 
 3. Declare P as '^1<2,n/2r,n,r,m>. 
 
 
 Its 
 
 ili 
 
 is 
 
88 
 
 CJ) 
 U. Fetch G 2j+, [*,*) from M (0,i ,iri/2, *, *) . 
 
 5. Broadcast C (0,i ,iii/2 , x, y) to B1 (*, i, y,x ,*) 
 of P1, Vx,y. 
 
 6. Fetch G^^^ (»m+(r-m),*) and f j**;^ (#in4(r-iT)) from 
 
 M (*,i,0,#in+ (r-m) ,*) . 
 
 7. nroddcast D(z,i,C,x,y) to E2 (z,i ,x, *, y) of Pi, 
 VyrZ# and Vx€f#n+ (r-m) ) . 
 
 fi. Multiuly PI and F2 into E3. 
 9. Repeat for q= to (loggm-l) ; 
 
 a) Declare P as F2<nrn/ (2** (q+1 ) r) , 2 ,2 ,r,m>. 
 
 b) Left shift R3 (*, 1 ,0, ♦, ♦) of P2 by 2 rm into RU. 
 
 c) Add ^3 and R4 into P3. 
 
 10. Set P'4=0. 
 
 (0) 
 
 11. Fetch ^ 2i + | <*^ ^^°'" M(1,i,rr,/2,*,0) . 
 
 12. Left shift D (1 , i ,ir/2 ,* , C) into FU by rmV2. 
 
 13. Subtract R3 froir Ru into F4. 
 
 14. I^ight shift RU ty rir into D. 
 
 15. 3toce D(*,i,1,*,*) into M (* , i, 1 , * , *) . 
 
 B. Done. 
 
 ^na l^s is : 
 
 The broadcasting function in Step A. 5 of Stage 2 is 
 
 of the form f (l' , x, y) <ir, r ,m> > (y # x , *) <ir ,r, it>) and is omega 
 
 passable. The broadcasting function in Step A, 7 of Stage 2 
 
 I 
 
89 
 
 is of the form [ (k , x#y) <ir,, r ,m> > (x, *,y) <m, r ,m>} and is also 
 
 omega passable. 
 
 The operation times for Stage 1 cdn be easily 
 obtainad by substitutirg n by m in the operation times listed 
 in section 3.2.2. However, we have to add tour alignment 
 passes for the linear permutation functions described in the 
 last part of Stage 1 if an omega network is used. If a 
 crossbar (or any other full permutation network) is used, we 
 only need to add two passes. 
 
 The operation times for stage 1 are then: 
 Fetch : ^log^m-U 
 Store : 2iog-m-1 
 Align : (log-m) +mog2m+1 
 Ptocessor: log m (log-m+3) /2 
 
 For ^tage 2, the operation times are: 
 Fetch : 11ogj^{n/m) 
 
 Store 
 
 log^Cn/m) 
 
 
 ml 
 
 a* 
 
 Align : log (n/m) +log2(n/m) log^m 
 Processor: 2iog (n/m) + log (n/ir) log.m 
 
 The total times for this algorithm arc then: 
 Fetch : 31og n+Ulog^rr-U 
 
 Store 
 
 Align 
 
 loq^n+loggm-l 
 
 loggia* log^m + loa -n + 31og m* 1 
 
 Ptocessor: log. n.logm- (lcg,p ) /2 + 21og„n-log_m/2 
 
90 
 ^.2.'* 0sin(i_a_J1O(]erate_NuOTber__o£_Pcocessors 
 
 "his algorithm is deLived from [2a], For the matrix 
 multiplication, however, this implementation does not use the 
 loqsum method. Instead, it uses the uore efficient 
 parallel-product serial-surr metLod. The preliminary 
 
 discussion of this algorithm will be similar to that of 
 Section 3.2.3. 
 
 For each staqe (j+1), {0< j<log (n/m) ) , we have n/2r-1 
 
 (J) 
 
 '"t 's. We allocate rm processors for each of these G's. 
 
 o 
 
 For j='^ (or r=m) , we need a total of (n/2n--1)m processors. 
 Since we assume that n,m,p are all powers of two, therefore 
 we have p= (n/2m) xm =mn/2. 
 
 In .subsequent descriotion, we will assume p=mn/2. If 
 in actual case, p<mn/2, then we will apply folding to the 
 algorithm. 
 
 •f L ^ 
 
 1 
 Ch3) 1 
 (2,2) (2,3) 
 
 1 
 
 (3.1) (3,2) (3,3) 1 
 
 (4,0) (4,1) (4,2) (4,3) 
 
 (5,0) (5,1) (5,2) 
 
 (6, 0) (6, J ) 
 
91 
 
 In qGFK^ral, let (i,j) be the (.n-j)th element on the 
 ith row of L, wherp C<i<n and C<j<in. Then (i,1) wilL b'^ 
 stoued in memory ((i mod in)*:n+-j) of the j_i/2j th m'-pact it ion. 
 For the L matrix qivcn above, the storage rrap of the first 
 m -partition i*^ shown in Figure 3.o. 
 
 Mamorv 
 
 ^ 1 2 3 H 14 15 
 
 (0,0) (0,1) (0,2) (C,3) (1,0) (3,2) (3,3) 
 
 (4,0) (U,1) (4,2) (4,3) (5,0) (7,2) (7,3) 
 
 L : 
 
 Figure 3.6 Storage Kap of the First m -Partition 
 
 The ith element of f is stored in the saire memory 
 module as (i,r) . 
 
 For initialization, we will partition the system into 
 n/2m rr -partitions. We will solve G j for i even first, then 
 t{ for i even, then r, j for i odd, and finally f; for i odd. 
 m multiplies and m additions will be required for each of the 
 calculations. Kence i total of 2iTi*4 = 8m steps will be 
 required for the initialization part. 
 
 ?-1 
 
 1*1 
 
 ■4 rf 
 
 a 
 
 •s 
 
 Before we present the algorithm, we first define a 
 swap in k-partition as a k/2-end-around-shif t in k-partition. 
 

 92 
 
 Algorithm : 
 
 
 Stage 1 (initialization) : 
 
 
 A) Ml arrays arc declared as <n/2in,in,m>. 
 
 
 3) Pepeat for h=0,1; 
 
 
 1. Transfer L (i, 1 , x) /nA=h , to Ll(i,j,x), V i,i. 
 
 
 and V X such that m-j<x<rr. 
 
 
 2. Left shift by j in m-partition L (i, 1 , x) , FA=h, to 
 
 
 P(i»1»x),v i,i, ard V x such that 0<x<m-j. 
 
 
 3. Repeat for q=0 to ir-l ; • • 
 
 
 a. Fetch R(i,q,x) , V i,x. 
 
 
 b. Broadcast D(i,q,x) to Rl(i,*,x). 
 
 
 c. Fetch LI (i, i,:n-j) , V i,j. 
 
 
 d. Eroadcdst D(i,j,m-j) to R2(i,j,*). 
 
 
 e. Multiply ni and R2 into P3. 
 
 
 f . Fetch P (i, 1,k) , V i,1/H. 
 
 
 g. Transfer r(i,j,!c) to ti2. 
 
 
 h. Add R2 and R3 into R3. 
 
 
 i. Transfer R3 to D. 
 
 
 i. Store D(i,i,k) into R(i,i,k), V i,1,k. 
 
 • 
 
 U. Repeat for q=0 to m-1; 
 
 
 a. Fetch f(i,q,0), V i, P.A=h. 
 
 
 b. Broadcast D(i,q,0) to F1(i,*,0). 
 
 
 c. Fetch Li (i» i,m--j) , V i,j. 
 
 
 d. Left shift 3(i,j,m-j) by (m-1) to R2(i,j/0). 
 
 
93 
 
 K 
 
 j-m^/2 
 
 k 
 
 rm 
 
 q(J) 
 
 r.(J) 
 
 '2i+l 
 
 ■2rm 
 
 g(j) 
 ^2i 
 
 V 
 
 (j+1) 
 
 ' V 
 
 (0,i, *,*,*) <2.n/2r,m,r,m> 
 ^ rm^/2 H 
 
 -rm 
 
 ^ 
 
 |#— rm — J( 
 
 A3) 
 ^2i 
 
 •2rm 
 
 -^ 
 
 ^2i+l 
 
 ^2i 
 
 V. 
 
 ? 
 
 (J+1) 
 
 -/ 
 
 ^ 
 
 ,..*a 
 
 ■ ?! 
 
 
 (l,i, *.*,*) <2,n/2r,m.r,m> 
 
 Figure 3.5 Storage Map at Step j of Stage 2 
 
94 
 
 G. Multiply F1 and R2 into R3. 
 
 f . Fetch f (i,i,0) , V i, j. 
 
 g. Transfer D(i,j,0) to R2. 
 h. Add R2 and r3 into R3. 
 i. Tran3f9r n3 to D. 
 
 j. Store D(i,j,0) into f(i,j,0), V i,j, 
 
 C) Done. 
 
 Staqe 2: 
 
 A) Repeat for j=0 to log(n/rT)-1; 
 
 1. Set r = 2''m. 
 
 2. Declare all arrays ris <r/2r,r,ni>. 
 
 3. Fetch f (i,k,0) , RA^I , i i,k. 
 U. Transfer ^{i,k,0) to hCC. 
 "i. Repeat for q = to m-l; 
 
 a) fetch :{(i,k,q),FA, Vi,k. 
 
 b) left shift D by q to R1. 
 
 c) fetch f (i,r-iT+q,0) , :^A=0, V i. 
 
 d) broadcast D (i,r-Tn*q , C) to R2(i,*,0). 
 a) multiply PI and R2 into R3. 
 
 f) subtract R3 from ACC, 
 
 6. Fetch f (i, 1,0) , RA=0, V i,j. 
 
 7. Transfer D(i,-i,C) to R1. 
 
 B. Transfer P1(i,j,0) to D, V j and V i even, 
 
 9. Swap ACCd*-),!)) in 2rm-part i tions to D, V j and 
 
95 
 
 10, 
 11, 
 
 12. 
 13. 
 
 ia. 
 
 16. 
 
 17. 
 IB. 
 n. 
 
 20. 
 
 21. 
 22. 
 
 23. 
 24. 
 
 V i eVen. 
 
 StouG D to f, RA=0. 
 
 Swap ni(i,j,0) in 2rir-pirt itionii to D, V j and 
 
 V i o'ld. 
 
 Transfer ACC(i,j,0) to D, V i odd, and V j. 
 
 Store D to f, FA=1. 
 
 If i=loT (n/rr) -1 ^ then qoto B. 
 2 
 
 Set ACC=0, 
 
 Repeat for q= to it- 1 ; 
 
 a) fetch F(i,k,q), P.A = 1, ¥ i,!c. 
 
 b) broadcast 0(i,k,q) to R1(i,k,*). 
 
 c) fetch ? (i,r-ir + q,k) , PA^O, V i,k. 
 
 d) broadcast >) (i, r-m-^^/K) to I^2(i,*,k). 
 
 e) inultiply "1 and F2 into R3. 
 
 f) subtract R3 from ACC. 
 Fetch P (i , i , k) , r. A=0, V i,j,k. 
 Transfer n(i, -],>;) to HI, . 
 
 Transfer R1(i,1,k) to D, V i,k and V i even. 
 Swap ACC(i,-j,K;) in 2riit- partitions to D, V j,k and 
 
 V i even. 
 
 Store b to H, RA=0. 
 
 Swap Pl(i,j,k) in 2rir-p irtit ions to D, V j,k and 
 
 y i odd . 
 
 Transfer ACC(i,j,k) to , V i odd, and V i,k. 
 
 Store D to R, ?A=1. 
 
 ■^! 
 
 -.1 
 
 
 
 22 
 C3 
 
 n) Done. 
 
96 
 Jlnalxs is : 
 
 All of the =iliqnment functions used in this algorithm 
 can be easily shown to be omega passable, by the simple 
 application of the omeya Partition Theorems. Steps 9, 11, 
 20, and 72 of Stage 2 show the swap operation described 
 earlier in this section. 
 
 The total times for this algorithm are: 
 Fetch: log2(n/ir) (Um + 3) 
 Store: Uloq (n/m) ♦Um-2 
 Align: log (n/m) (Um + U) +6rr. 
 Processor: 4m, loq (n/m) ■♦•6m 
 
97 
 
 3 . 3 Matrix, Hill tipXication_oii_a^Par§ll9l_PCQcessinq_?Y^t em 
 
 A Fortran code section that performs matrix 
 multiplication is as follows: 
 
 10 
 
 DO 10 T=1,N 
 DO 10 J=%N 
 DO 10 K=1,N 
 5(I,J) =A(I,J) +B(I,K)*C (K,J) 
 
 An efficient way to perform the calculation would be 
 to compile the product by rows (parallel on J) as shown 
 
 below, 
 
 DO 10 T=1,N 
 DO 10 K = '',N 
 10 A(I,*) =A (I,*)+B(I,K)*C (K,*) 
 
 2 
 
 This algorithm will require (H ) shifts to align the 
 
 operand matrices. A one-stage perfect shuffle network 
 simulating an omega network will take log^ N steps per shift, 
 
 and the Illiac IV type of switch will take (Jn) steps per 
 
 2 2 r~" 
 
 shift on the average." So a total of 0(N log.N) or O(NJN) 
 
 routing steps are required for matrix multiplication. 
 
 However, using the algorithm which follows, we need only 
 
 n(N^) steps. 
 
 
 101 
 
 ■Ail 
 
 - jt3 
 
 'ill 
 \\ 
 
 We first need to define the following two notions: 
 
 NQtation: If G is a permutation of some input set, G 
 
98 
 
 implies i consecutive applications of the permutation G to 
 the input set. 
 
 2sfi!liti2Q* * G-permutation is defined as a permutation G 
 
 2 3 M 
 
 such that G, G , G ,...,G are distinct and form a group with 
 
 G =1, the identity permutation. 
 
 Fvery G permutation can be uniquely represented as a 
 cycle (io,i, ,. . .Xm., ) where G (io) =i| ,G (i, ) =i2#. . #G (i m., ) ^i^, . 
 
 Two obvious G-permutations are the ♦1 shift 
 permutation and the -1 shift permutation. In general, ^k 
 shift and -k shift permutations will be G-permutations if k 
 is relatively prime to N. some nonshifting G-permutations 
 can be found using a perfect shuffle based permutation. The 
 G-permntations have a general form of: 
 
 G(i) = [2i*b(i) ]mod N, 
 
 where h (i) = b(i+N/2) V i=0.. .N/2-1, 
 
 and b (i) = or 1 V i. 
 
 A list of all {b (i) ,i=C.. ..N/2-1} that will give 
 G-permutations for N=U and 8 and the corresponding 
 G-permutations are listed in Table 3.1. 
 
99 
 
 Size 
 
 b(i) 
 
 G-permutation 
 
 u 
 
 1 1 
 
 (0 13 2) 
 
 8 
 
 110 1 
 
 (0137652a) 
 
 
 10 11 
 
 (0 1 2 5 3 7 6 U) 
 
 Table 3. 1 
 
 Assume we want to multiply two matrices A and B to 
 form C and that they are all of size NxN. The first method 
 uses N processors and requires that the storage scheme for 
 the matrices be 1-skew and 1-skip. The storage pattern is 
 shown in Figure 3.7, Each processor will have a 
 corresponding memory from which it can fetch data. Any data 
 a processor wants but not in its own memory will have to be 
 routed from the other processors. This algorithm also 
 calculates the relative address (RA) for each array it 
 references . 
 
 
 
 
 Memoi 
 
 •y 
 
 
 
 
 c 
 
 N 
 
 1 
 
 
 2 
 
 
 3 
 
 (0, 
 
 0) 
 
 (0, 
 
 1) 
 
 (0, 
 
 r2) 
 
 (0 
 
 r3) 
 
 (1. 
 
 -3) 
 
 (1. 
 
 ^) 
 
 (1l 
 
 1) 
 
 (1, 
 
 ,2) 
 
 (2. 
 
 2) 
 
 (2, 
 
 3) 
 
 (2, 
 
 0) 
 
 (2< 
 
 r1) 
 
 (3, 
 
 1) 
 
 (3, 
 
 ?) 
 
 (3, 
 
 -3) 
 
 (3 
 
 rO) 
 
 Figure 3."^ 1-skew 1-skip Storage Scheme 
 
 Fach processor has a wired-in processor port number. 
 
 .-t 
 
 
 5C) 
 
 J0I 
 
 .1 <• 
 
 ,J3I 
 
 
 
 15 
 
100 
 
 PPN (0 < PPN < N-1 ). T is a temporary array. 
 
 Rlaorithm: 
 
 ft) Repeat for TC = to N-1; 
 
 1. fetch A, Rft=ir, into R1. 
 
 2. set IR = (PPN-IC) mod N. 
 
 3. repeat for IT = to N-1; 
 
 a. fetch B, RA = IR, into P2. 
 
 b. multiply R1 and R2 into P3. 
 
 c. K-permute IP. 
 
 d. store T, PA= (PPN-IR,mod N) from R3. 
 
 e. G-permute R1. 
 U. set P1=0. 
 
 5. repeat for IT = to N-1; 
 
 a. fetch T, RA=IR, into R2. 
 
 b. add R1 and R2 into R2. 
 
 c. G-permute R1. 
 
 d. G-permute IR. 
 
 6. store C(RA=IC) from PI. 
 
 B) Done. 
 
 The significance of this result is that for certain 
 one stage networks, if there exists a G-permutation, then 
 each intermediate routing will take only 0(1) time instead of 
 
101 
 
 OfloqgN) time or (Jn) time. This greatly reduces the 
 alignment time for the system. 
 
 There are many variations of this algorithm, each for 
 a different memory skewing scheme. Two of them can be ased 
 for a parallel processing system with twice the number of 
 memory modules. They will be presented in Appendix A. 
 
 2 
 
 There is another algorithm which uses N processors. 
 
 2 2 
 However, it works only for a N xN network with tN shift and 
 
 ♦1 shift connections. This algorithm takes a total of N+2 
 
 memory fetches and N+1 memory stores . The total number of 
 
 alignment reguests is 3N and the total number of arithmetic 
 
 operations is 2N. Hence the alignment time matches in order 
 
 of magnitude with the memory and arithmetic operations. 
 
 The initial storage scheme is simple. All matrices 
 are stored in a linear manner, i.e., element (i,j) will be 
 stored in memory (Ni+j) • 
 
 R.lc[orithm: 
 
 
 ■iCD 
 
 
 1:J 
 
 i| 
 
 •a 
 
 :s 
 
 A) Fetch B into R1, 
 
 P) Repeat for | =0 to N-1; 
 
 1. store R1 into T, PA=j' 
 
 2. left shift PI by N, 
 
 C) Fetch A into P1. 
 
 D) Set ACC=0. 
 
SM: 
 
 102 
 
 E) Repeat for 1 =0 to N-1; 
 
 1. Fetch T, RA=(PPN-i-j)modN# into R2. 
 
 2. multiply R1 and R2 into R3. 
 
 3. add R3 to ACC. 
 
 U. riqht shift Rl by N into R2. 
 
 5. transfer R2(i,0)<N,N> to Rl (i,0) <N, N>. 
 
 6. left shift F1 by 1. 
 
 F) Store ACC into C. 
 
 The above two algorithms show how computations can be 
 tailored to fit a simple network so as to minimize the 
 routing times. 
 
103 
 
 a. PROCESSOR,. SYSTEM SIHDLMIO N.,TECHNIQOES 
 
 a, 1 Introduction 
 
 In order to evaluate the true effectiveness of a 
 parallel architecture, we irust hypothesi2e a compiler 
 capable of compiling ordinary programs into code which most 
 effectively utilizes the architecture, especially the data 
 alignment capabilities. The resulting code could then be 
 simulated and the important performance measures determined. 
 This is the objective of our Analyzer/Simulator project. It 
 involves the simulation of program execution on some 
 proposed parallel processing systems. The front end of this 
 project is a program analyzer which accepts Fortran source 
 programs, and by detailed analysis of the control and data 
 dependencies it produces a highly parallelized version of 
 the original program (see [26]). Next, this parallelized 
 version is input to another program, the Pesource Request 
 Generator (PRG) , which atteirpts to compile the parallelized 
 progranr into simulatable code. The code is a set of machine 
 resource requests with data dependencies embedded in it. A 
 machine resource can be a scalar or array processor, an 
 alignment network, or the whole bank of array memories. The 
 task of the PPG is to decide on the best way to slice the 
 comoutation specified by each instruction node, based on the 
 size of the matrices, the number of available processors, 
 the matrix storage scheme, and the type of alignment 
 
 5 I 
 
 
 
 mil 
 
 III 
 
 \\ 
 
 :? 
 
 IS 
 
104 
 
 network. Finally, the output of the RPG is input to a 
 simulator capable of siirulating a wide variety of 
 architectures. Here the time required, utilization of 
 various resources, and speedup and efficiency of the 
 program's execution in the qiven parallel processing system 
 will be calculated. Machine organization parameters can be 
 specified by the user. These parameters include the storage 
 schenie, the alignment network, the processor and memory 
 speeds, the number of array memories, and the number of 
 processors in the array processor system. 
 
 A block diagram showing the general organization of 
 the software is shown in Figure u.l. The Program Analyzer 
 is described elsewhere [26] and we will not discuss it here. 
 In this section we will describe the F EG and machine 
 simulation. Some experimental results will also be 
 presented. In Section U.2 we will discuss the input data 
 structure ani available machine parameters for the REG. In 
 Section U,3 we will describe the output of the simulator in 
 the form of performance measures. Then in Sections 4.4 
 through U.6, some of the algorithms and strategies of the 
 RRG will be described. Finally, in Section 4.7, we will 
 discuss some of the preliminary results of the initial set 
 of exoeriments. 
 
105 
 
 0) 
 
 
 o 
 
 to 
 
 c 
 
 s- 
 
 <o 
 
 (U 
 
 
 +-> 
 
 ^ 
 
 <u 
 
 o 
 
 E 
 
 «4- 
 
 (0 
 
 S- 
 
 i~ 
 
 Ol 
 
 <0 
 
 a. 
 
 a. 
 
 
 
 
 i~ 
 
 CD 
 
 o 
 
 c 
 
 -!-> 
 
 • r- 
 
 <o 
 
 E 
 
 
 
 3 
 
 t- 
 
 E 
 
 
 • f** 
 
 
 oo 
 
 c 
 o 
 
 c E 
 
 3 S- 
 O O 
 
 
 
 
 S- 
 
 
 o 
 
 (/) 
 
 +-> 
 
 T3 
 
 (0 
 
 C 
 
 ZJ 
 
 3 
 
 
 O 
 
 03 
 
 CO 
 
 > 
 
 
 LU 
 
 
 1 
 
 
 OJ to 
 
 
 O 4-> 
 
 
 S- to 
 
 
 =J <u 
 
 
 O 3 
 
 
 to O" 
 
 
 (U O) 
 
 
 ct: ce: 
 
 
 s- 
 
 O) 
 
 o 
 
 o - 
 
 _> +J 
 
 S- 
 
 to (TJ 
 
 Z3 
 
 OJ S- 
 
 o 
 
 3 OJ 
 
 (/) 
 
 3- C 
 
 OJ 
 
 (U (U 
 
 a: c 
 
 i; C3 
 
 to 
 
 <u c 
 
 S- o 
 
 3 -r- 
 
 4J +-> 
 
 O rO 
 
 <U O 
 
 +-> •!- 
 
 •f- M- 
 
 .£: •!- 
 
 CJ o 
 
 I- (V 
 
 <C CL 
 
 O) 
 r— <U 
 
 I— -a 
 im o 
 's- o 
 
 <o 
 a. 
 
 $- 
 
 N 
 >> 
 
 c 
 
 C 
 
 o 
 
 •r— 
 +-) 
 
 rO 
 N 
 •r- 
 
 c 
 m 
 
 o 
 
 o 
 +J 
 
 OJ 
 N 
 >^ 
 
 rd 
 
 c 
 
 <u 
 
 
 135 • 
 
 C9 
 
 :c 
 
 "B 
 
 to 
 
 c E 
 
 i- S- 
 
 +-> CD 
 
 s- o 
 
 O i. 
 
 •U. Oi. 
 
106 
 U . 2 sijTul atoc Inpu t Specifications 
 
 4.2.1 Input Instruc tion Nod es 
 
 The most easily recognizable form of parallelism is 
 typified by a matrix addition shown belo". 
 
 BO 10 1=1, N 
 DC 10 J=1,f1 
 
 10 A(I»J) =B (I,J) -^C (I,J) 
 
 The Program Analyzer will determine what the 
 dependency limitations are for each program segment (in this 
 case, there are none), and then break them into 
 machine-code-like instruction nodes. Each instruction node 
 will provide all the information concerning the operator, 
 the two operands and the result. 
 
 After the Fortran Analyzer phase, all parallel DO 
 loop indices are distributed into each instruction node. 
 The DO loop limits are noriralized to start with and have 
 increments of 1 only. We first assume that there are n 
 active DO Loop indices in a particular instruction. The 
 -j-th DC loop index, I:, inay have an upper limit, Uj , as a 
 function of I, ,1^ . . . . , Ij.( . Assuming the function is a 
 linear function, the U: 's can be represented by a nx{n+1) 
 matrix, D, such that 
 
107 
 
 '". 
 
 1 
 
 K 
 
 "2 
 
 
 
 • 
 
 = r 
 
 1 
 
 • 
 
 
 
 ."". 
 
 ' 
 
 ' 
 
 n+1 
 
 I2 
 
 In 
 1 
 
 Note that except for the last coluirn, the D matrix 
 is strictly lower triangular. 
 
 In a Similar fashion, each k-dimensional array (with 
 linear^ subscripts) being referenced in a node with n active 
 no loop indices, will have a corresponding k by (n*1) 
 coefficient matrix C, Let E; be the subscript expression of 
 the jth dimension, then 
 
 -1 
 
 1 
 
 ^ 
 
 ^2 
 
 
 
 • 
 
 = } 
 
 
 • 
 
 
 
 .^n. 
 
 1 
 
 . 
 
 n+1 
 
 
 II 
 
 
 l2 
 
 
 • 
 
 
 • 
 
 ^ 
 
 In 
 
 ^ 
 
 1 
 
 ,».'J I 
 
 Definition-- Let there be n memory units. A p-ordered 
 N-vector (mod M) is defined as a vector of N elements whose 
 i-th logical element is stored in memory unit pi+c (mod M) 
 where c is an arbitrary constant. 
 
108 
 
 The idea of a p-ordered N-vector is V€ry useful in 
 finding the number of cycles required to access a vector or 
 to aliqn it usinq certain alignment networks. 
 
 Usinq a qeneraliz^d skewing scheme as in [Lawrie 1 ], 
 for an array with k dinrensions, we will have (m, ,in2f . . . . #m,^) 
 skewing. Assuming an array operand in an instruction node 
 has k dimensions and n active indices, then we define an 
 order vector, V, of n+ 1 elements as: 
 
 < n + ^ > « k > ^ 
 
 I V J= [m, m2..,mj 
 
 n+1 
 
 For an array element defined by any particular set 
 of values fl, =h, , I2=h2 , . . . 1^=^^ } , the element will be stored 
 in memory port z, where 
 
109 
 
 z = C 
 
 'h,' 
 
 h2 
 
 • 
 
 • 
 
 hn 
 
 1 
 
 In Addition, the importance of the order vector lies 
 on the fact that for any partition of the array formed by 
 running the jth active index parallel, the partition is a 
 Vj -ordered vector (rrod M), where H is the number of 
 ireniories. When the order and nuirber of elements of a 
 partition are calculated, the number of cycles required to 
 access and aliqn the vector can be easily determined. 
 
 For Fortran statements that cannot be easily 
 dispatched as array or scalar operations, they will be 
 grouped as recurrences nodes. Each node represents a R<n, m> 
 system {c.f.[22]). Each R<n,m> system will be broken into 
 as many smaller recurrence units as possible. Information 
 such as the number of smaller recurrence units and the 
 values of n and m for the units can be found in a recurrence 
 node. With this information, we can determine which is the 
 best recurrence solving algorithm to use and its 
 corresponding execution time. 
 
 ..•:^i: 
 
 C9 
 
 .1 <• 
 
 C3 
 
 
110 
 
 U.2.2 Machine Parameters 
 
 ■ In the parallel architecture that He simulate we 
 assume that the resources can operate in an overlapping (or 
 pipelining) fashion. However, we still honor the dependency 
 between different instruction nodes. Each resource will 
 have its own resource queue to hold the waiting requests. 
 Hence one node may ne using the alignment network while an 
 independent node can start fetching its operands from the 
 memory system. 
 
 It is impossible to siirulate every known parallel 
 architecture. So we concentrate on two classes of 
 architectures. The first class is shown in Figure ^,2 and 
 the second in Figure U.3. Note that the one in Figure 4.2 
 resembles that assumed in Chapter 3. The second type has 
 two alignment networks, one for input to the processing 
 system and the other for output to the memory system. This 
 class can be chosen by setting the parameter option 
 M_PAPRM. TW0_2^L_NfT to 1. 
 
 The scalar men^ory and scalar processor in Figure 4.2 
 and U. 3 are optional and can be chosen by setting M__PARAM.SM 
 and/or M_PARAM.SP to 1 's. 
 
 The number of processing elements in the processing 
 array and the number of memories can be selected using the 
 parameters ?!_PARAM- NU«_PPOC and M_PARAM. Ndm_M EM. 
 
Ill 
 
 
 
 " — f 
 
 
 
 ALIGNMENT 
 NETWORK 
 
 r.\ 
 
 — » 
 
 % 
 
 
 ^'°/ 
 
 
 
 
 
 
 
 
 
 
 rs\ 
 
 
 "i 
 
 
 vv 
 
 
 
 
 • 
 • 
 • 
 
 - 
 
 • 
 
 9 
 
 9 
 
 • 
 
 ""1 
 
 
 
 
 (^,\ 
 
 
 ^N-l 
 
 -^ 
 
 ^-if 
 
 
 
 
 
 
 
 
 ^ 
 
 Data 
 Path 
 
 Control 
 Path 
 
 
 ■3 
 
 :;!r 
 
 !2 
 
 Figure 4.2 Machine Configuration A 
 
112 
 
 Data 
 Path 
 
 Control 
 Path 
 
 Figure 4.3 Machine Configuration B 
 
113 
 
 The skewing system chosen can be specified by 
 assigning values to the array M_PAEAM. WEM_Ma E. For example 
 to get (1,1) skewing, we would put the numters 0,0,0,1,1 
 into the hEm_MAP array. 
 
 As for the alignment network, right now we can 
 choose any one of the four possible networks by setting 
 M_PARAM. AN_TYPe to the appropriate value. 
 
 AN TYPB 
 
 = 1 
 
 = 2 
 
 = 3 
 
 = U 
 
 crossbar 
 omega network 
 */P#11 shift network 
 ♦I shift network 
 
 The memory Cycie time can be specified by using 
 H_PARAM, MCYCLE_TIM'5, and the scalar memory tiire be specified 
 using n_PAEAM. S_MEM_TIME. To allow for pipelining, we have 
 two separate time fields for each resourc€ request. The 
 first is RT which contains the time that must lapse before 
 another request for the same resource can be started. The 
 other is IT, which contains the time required to finish 
 processing the request. If a particular resource is 
 pipelined, then FT will be the pipeline segmert time and IT 
 will be the length of the whole pipe. In this case, IT is 
 greater than or equal to et. For a memory request, however, 
 BT will be the cycle time and IT will be the access time. 
 In this case, IT is less than or equal to RT. 
 
 
 ■A " 
 
 It 
 
114 
 
 The alignment times are specified by M__PARAH. A_CT_IN 
 and M_PAFAM. A_CT__OOT. The processing times can be assigned 
 by the user using the array H__PARAM,OP_TIHE. The elements 
 are times reouired for simPle assignment, addition, 
 subtraction, multiplication, and division, respectively. tfe 
 also allow the users to define their own built-in function 
 and user defined function times in n_PARAM. BUILTIN_TIME and 
 ri_PARAM. nSE?FCN_TIME respectively. 
 
 A sweeping index is defined to be the active index 
 that is to be run parallel in order to produce the desired 
 partition, one option that the user has is tc declare what 
 he wants as the sweeping index. The other is to let the 
 Simulator choose the best index in terms of execution times. 
 To choose the first option, we will have to set 
 ?1_PARAM. SWPCPT to and to set the array M_P ARAM. SWEEP_INDX 
 to the running indices reguired. 
 
 For standard algorithmic procedures, such as 
 recurrence handling, the operation times for various 
 resources have been calculated in Chapter 3. Thus we just 
 need to substitute in the formulas for the operation times 
 rather than perform detailed simulation of the algorithm. 
 However, we will be missing certain overlap parallelism. 
 This overlap can be assigned by the user using 
 M_PAPAM.CVFFLAP. In appropriate cases, IT will be set to 
 OVKPLAp+RT/lOO, and will be less than RT. 
 
115 
 
 In this section, we have discussed various options 
 that are available to the users. By setting all the 
 appropriate options, the usee has defined a machine 
 configuration that he wants to study. In the next section, 
 we will discuss the outputs available from this Simulator. 
 
 
 ma 
 
 9 
 
 'I 
 
 s 
 
 5» 
 
 C9 
 
 ^5 
 
 ^3 
 
116 
 
 4.3 Sirnulator Output s 
 
 T^e output of the si'iulator is a set cf performance 
 measures. One such rreasure is Tp» the tine required for 
 simulated execution of the program graph froir the Program 
 Analyzer in the specified machine organization using p 
 processors. If T, is the execution time for the same 
 program qraph, then we dofine another measure, the speed 
 factor, Fp, as T, /Tp. In addition, the Simulator calculates 
 measures of tne utilizations of various system resources. 
 The utilization of each resource is broken down into several 
 separate utilizations, U^, Us and Uip . First, U^ , the array 
 iiHtX ^ycle, is the percentage of time that at least one 
 processor is pertorming a computation. However, whenever an 
 array operation is being performed, only some of the 
 processors may be actually doing useful work. This is 
 measured by the slicing ut ilization, Ug. For example, to 
 add two 30 element vectors together using 20 processors 
 would requiro two steps. The first step wculd form the 
 first 20 ^^rr\s and ^ould use all 2C processors resulting in a 
 slicing utilization* Ug , of 100%. The second, step would 
 form the last 10 sums using only 1C processors and would 
 re'^ult in n^^sOi;. The overall Ug would then be 75%. 
 Finally, some processors are turned off because of IF 
 statements in the original programs, and this is measured by 
 Ojp . For examole, assume that in the following program, 1/3 
 of the B (I) are less than zero: 
 
117 
 
 CO 10 1=1,30 
 10 IF (B(T).GE.O) A(I)=A (I) ^B (I) 
 
 Then Ujp =67%. Thus, using 20 processors on this program, Ua 
 might be 80%, for example, because the processors are 
 waiting for memory access or data alignment. Of this 80% of 
 the time, only 75% of the processors could be used because 
 of the difference between the number of processors and the 
 array size Vh-"^^^) r ^"^ o^ these 75% of the processors, 
 only 67% are turned on (U^p =67%) . Thus, the total average 
 processor duty cycle, Oj , is equal tc Og *U5 *Uip = 
 80%*75%*67% = U0%. By separating the components of 
 processor utilization in this way we car determine the 
 source of processor inefficiencies. 
 
 »*r 
 
 ';;|"T1 
 
 
 
 •a 
 
 «3 
 
118 
 U • ^ Sweeping indices 
 
 As described in Section 4,2,2^ when an instruction 
 node can be swept by more than one index, there are two 
 options for the user to define the sweeping irdex. One is 
 to specify it in the pararreter W_PARAn, SHEEP_INDX. The 
 other is to let the RRG choose the best index for each 
 individual node. When there are other indices having upper 
 limits that depend on the sweeping index* we need to modify 
 the C and D matrices before the sweeping is allowed. In 
 other words, an instruction node can be swept on an index I,- 
 if and only if, in the D matrix, the i-th column has all 
 zeroes. 
 
 For example, if the instruction node looks like: 
 
 DO T = 0,N-1 
 
 DO J = G,I-»-k 
 
 A (I, J) = 
 
 then 
 
 C = 
 
 1 
 1 
 
 and D = 
 
 N-1 
 
 1 k 
 
 To sweep on index I, we need to expand the node to 
 two nodes: 
 
 DO J = 0,k-1 
 DO I = 0,N-1 
 A {I,J)=0 
 
 and 
 
 EO J = 0,N-1 
 CO I = 0,N-J-1 
 A(I+J,J+k)=0 
 
119 
 
 Now there are two sets of C and D matrices. They 
 
 are: 
 
 c, 
 
 = 
 
 1 
 
 
 
 
 
 
 
 
 ^0 
 
 1 
 
 0_ 
 
 
 c, 
 
 = 
 
 'c 
 
 c 
 
 N-1 
 
 
 
 .0 
 
 
 
 k-1 
 
 Ca 
 
 = 
 
 '1 
 
 1 
 
 0^ 
 
 
 
 
 .0 
 
 1 
 
 K 
 
 
 D2 
 
 = 
 
 'c 
 
 -1 
 
 N-f 
 
 
 
 
 
 
 
 N- 
 
 -^ 
 
 Note that after this transformaticn, the first 
 column of D, and D2 are all zeroes* fience the two 
 transforired nodes can be swept on index I. 
 
 In general, given a node 
 
 DO 1=0, N-1 
 DO J=0,hl+k 
 A(I,J)=0 
 
 •,Mf*' 
 
 ,-■.^€1; 
 
 :.'» 
 
 ■•11 
 
 
 •I 
 
 ►;,l,jy 
 
 51 
 
 • s 
 
 ^3 
 
 a^d we want to sweep on index I, the original loop will have 
 to be transformed into: 
 
 DO J=0,hN-h+k 
 DC 1 = 0, f (J) 
 ft (T,J)=0 
 
120 
 
 The first problem here is what should be the 
 equation for f (J) in general. If h is not equal to 1, f ( J) 
 Can contain many irodulo functions, which are ronlinear, and 
 cannot be represented easily in our linear D iratrices. 
 Another problem is that if h is large, many of the vectors 
 \(*,J) will be sirall vectors which can seriously degrade the 
 efficiencies of a parallel system. So the solution we 
 picked is to do this kind of + ransf oriration orly if h=1. 
 
 above: 
 
 Consider a more general case than the one shown 
 
 CO 1=0, N-1 
 Do J=0,I+k 
 A(pI+qJ+r,xI+yJ+z) =0 
 
 i.e, 
 
 and 
 
 C = 
 
 D = 
 
 P ^1 
 
 X y 
 
 N- 1 
 
 1 k 
 
 We can split the node into two parts: 
 
 DO 1=0, N-1 
 CO J^0,k-1 
 A (pI*qJ + r,xI + yJ + z) =0 
 
121 
 
 and CO 1=0, N-1 
 
 CO J = '',I 
 A(pI+qJ + r + qk,xn-yj + 7 + yk)=0 
 
 The first part thus has: 
 
 and 
 
 ^i = 
 
 p 
 
 q 
 
 r 
 
 
 
 .X 
 
 y 
 
 z 
 
 
 c. = 
 
 'c 
 
 c 
 
 N-l' 
 
 
 
 
 
 
 k- 
 
 -1 
 
 = c 
 
 The second oart is equivalent to: 
 DO J=o,N-1 
 V.C I = J,N-1 
 A (pl+qJ+r+qk^xI+yJ+z+yk) =0 
 
 and is also equivalent to: 
 
 DO J=0,N-1 
 
 DO I=^,N-J-1 
 
 A(pl+ (p*-q) J + r + qk,xI+ (x + y) J + z-^yk) =0 
 
 ■II 
 
 It 
 • ^ 
 
 or 
 
 p p+q r+qk 
 X x-»-y 2 + yk 
 
 Do = 
 
 -1 N-1 
 
 C N-1 
 
Hence C, = C * X 
 
 122 
 
 where X = 
 
 1 1 
 1 k 
 1 
 
 Therefore, V, = M * C, and Vj = M * C2 will form a 
 pair of order vectors that can determine if the matrix can 
 be swept by the index I. 
 
 In qeneral, if we reduce U: from Ij^~d€pendent to I^- 
 
 independent, X will be a (n+1)x(n+1) matrix with ones on the 
 
 diaqonal* another 1 at position (j^h) and a k at position 
 (h,n-H) . 
 
•/« 
 
 123 
 
 4 . 5 Array Slicing 
 
 When a" instruction node represents a larger 
 operation than the processor system can handle, the array 
 operands in the node have to be sliced. Let us define the 
 required number of slices as S. The slicirg utilization, 
 Ug , discussed in Section 4.3, is defined as the percentage 
 of the amount of a resource that is being utilized. These 
 are the two most important quantities to be discussed in 
 this section. 
 
 When the upper index bounds are all independent 
 (i.e. the first n columns of D are all 0*s) , it is easy to 
 find S and 11$ : 
 
 S = fN/p] IT [D(I,n+1)+l) 
 
 1 = 1 
 Iftis 
 
 Ds = N/(f N/p] *p) *100% 
 
 where Ij is the sweeping index, N=D (i^ ,]-!••■ 1) ♦I , and P is the 
 number of processors. 
 
 After transf oriring the loop as discussed in Section 
 ^,^, no upper index hound will be dependent on Ij . If, 
 however, the upper bound of I5 depends on index h, we have 
 to calculate S and Ug differently. if we have this kind of 
 instruction node: 
 
 
 
 .XSSi 
 
 ,'C 
 
 1 
 
 «3 
 
124 
 
 DO h=0, N-1 
 DO I =0, ah-«-b 
 instruction 
 
 a rough estimate of S can be calculated as follows: the 
 average upper bound of Ig is aN/2+b. Hence 
 
 S = [(aN/2 + b)/pl ♦N 
 and Us = (aN/2<-b)/ (S*p) ♦100% 
 
 If a=1, we have an upper triangular systeir and we 
 can find more accurate values for S and u^ , Let us first 
 consider S for a purely triangular system, as shown below: 
 
 N 
 
 Breaking it into [n/p] sets of columns- The first 
 set will contribute N to S. Ihe second set will contribute 
 (N-p) to S, and so on. So 
 
 S =Y^ (N-Pi) 
 
125 
 
 = rN/pl fN-pCfN/pl -1)/2). 
 
 Mo** let Us return to the original triangular system, 
 as shown below: 
 
 N 
 
 where K=N+b- fb/pl ^p =N-(p-b)niod p. 
 
 The first half contributes N* fb/pl to S. The second 
 half is a ourely triangular systerr with siz€ MxM, and thus 
 contributes Fm/pI {M-p ( Fm/pI -1) /2} to S. Therefore 
 
 S=N Fb/pl + M/D {M-p( M/P -1)/2}. 
 
 The total nurrber of elements = N(N + 1*2b)/2. Hence 
 
 Hs = N (N+1 + 2b)/(2Sp) *100i;. 
 
 ■.tvS 
 
 ■3lC U 
 
 ,.jj»i 
 
 -i- 
 
 ,..^C 
 
 1 
 
 «2 
 
 •a 
 
 when S is greater than one, we will have to 
 replicate the resource requests S times. However, in order 
 to save simulation time, we will devise the following 
 orocedure. 
 
 Ve first Observe that in general, each slice will 
 
126 
 
 foJ-low the same general pattern: a fetch^ an alignrnentr a 
 processor cycle^ then another alignirent and finally a store. 
 When there are more slices, they will be duplicates of the 
 sequences* but with slight time displacements, like: 
 
 F 
 A 
 P 
 A 
 S 
 
 Thp middle part of the operation will be the 
 concurrent operations of F-A-P-A-S, each working on a 
 different slice. It is repealed (S-U) times. To expedite 
 simulation, we put an implied DO loop around this middle 
 part. This DO loop will be simulated repeatedly until no 
 other request node is using any system resource. Then one 
 more iteration will be done to figure out the iteration 
 time. This time is then multiplied by the number of 
 remaining iterations to find the total time. This method 
 will reduce the amount of parallelism slightly; however, it 
 reduces the simulation time greatly. 
 
 Only one DO loop can be active at any time for any 
 
127 
 
 One level specified, in order for the atove simulation 
 method to work. This can be achieved by generating a 
 resource request (for that level) at the DO node. The 
 resource will be released at the END node# when the required 
 iterations have been finished. This effectively locks out 
 any other independent DO loop activity which would interfere 
 with tinging the "last" iteration. After the resource is 
 released, another DO can be activated by being granted that 
 resource. 
 
 .^f 
 
 
 1 
 
 ,1 1 
 
 
128 
 U • 6 Resource Time C a lculation 
 
 A^ter we have found S and various utilizations, wg 
 will proceed to find the resource times of various needed 
 resources for that particular instruction node. Scalar 
 memory and processor tiires are simple to calculate and we 
 will not elaborate on these. However, recurrence and 
 ordinary vector operations need further explanation. Shift 
 networks nresent a different set of calculation and will be 
 treated in a separate section. 
 
 4.6.1 Bec urren ce H andling 
 
 For recurrence nodes, we have analyzed in Chapter 3 
 the conditions under which certain recurrence solving 
 algorithms should be used. We ^ave alsc f")':nd th-i 
 correspondinq resource times for -^ach algorithm. uence, we 
 can save a iDt of simulation time by simply putting in the 
 corresponding resource times when a recurrence node is 
 encountered. This way, we assumie once a recurrence node is 
 encountered, we will preempt the machine to do just the 
 recurrence, fJsually there is overlap between various 
 resource times, i.e. , the sum of all the resource times 
 should be greater than the total execution tine. To account 
 for this effect, we set the total execution time to a 
 constant parameter, OVERLAP, multiplied by the sum of the 
 resource times. This OVERLAP can be found by first writing 
 
129 
 
 the recurrence solving alqorithm in Fortran and then running 
 it through the Analyzer/Simulator. The average OVERLAP is 
 calculated for various array sizes and machine 
 Configurations. This rrethod will not give us the true value 
 for the execution tiire of each recurrence calculation. 
 HoWeVer, it "ill give us a moderately reliable estimate. 
 
 ■ntlM' 
 
 
 C3 
 
 '■,i*« 
 
 3 
 
 
130 
 
 U . 6 . 2 Vector Operation s 
 
 (asinq crossbar or omega network) 
 
 For a vector operation node using crossbar or omega 
 networks, the resource times are easy to calculate after the 
 order vectors (described ir Section U.2.1) for the operands 
 are calculated. For merrcry accesses, if the order for that 
 particular ooerand on a particular sweep is p, then 
 consecutive eleirents can be found p memory modules apart. 
 Hence we need g=qcd(p,M) merrory fetches before we can fetch 
 the entire slice. In general, the number of memory cycles 
 required to access a p-ordered N-vector (defined in Section 
 U.2.1) stored in Jl memories = fN*q/w| . After each memory 
 cycle, the time required tor aligning a p-ordered vector 
 using a crossbar and for aligning before storing into a 
 p-ordered vector using omega network is equal to 1 network 
 cycle. Nevertheless, to fetch a p-ordered vector slice using 
 the omega network, we also need g network cycles. Hence the 
 corresponding total nuirter of network cycles are: 
 
 1) crossbar — fN*g/M] . 
 
 2) omega -- gfN*g/ff| for fetching, 
 
 rN*g/Ml for storing. 
 
131 
 4.6.3 II liac_Typ o_5 hif t Ne twor ks 
 
 For processing systems with N processors, if the 
 interprocessor connections are + /n and ♦l-shifts, we call 
 the alignment network the I lliac type shift n€tworlc. 
 
 F'or this type of network, when a uniform shift 
 permutation is requested, the resource time is easy to 
 calculate. Let s be the shift distance required. We first 
 set ss=min (s, N-s) # where N is the number of processors, 
 Also let n be /n. The shift time required 
 
 = L^s/nJ ♦ (ss mod n) for (ss mod n <n/2), 
 
 = L^s/nJ ■♦■n-H-(ss mod n) for (ss mod n >n/2). (4.1) 
 
 When a triangular type array is accessed* and the 
 shift distance for each slice is different, we have to find 
 the average shift time for the array. Let A (x) be the 
 average shift time for shift distances 1,2, ...,%• If x=Nk-»'W 
 where N is the number of processors, then 
 
 Nk + w 
 
 (^.2) 
 
 
 , j!»l 
 
 
 
 «8 
 
 I'- 
 ll 
 
 where D(w) is the average shift time for shift distances 
 1,2,...,w,(w<N). 
 
 We will first calculate wD(w). 
 
 Let d=D(n). We first observe the regular pattern of 
 the shift time for s=1,2,..,.n. 
 
132 
 
 For n=B, shift dist. 0123U5678 
 
 Shift tinie 0123U4321 
 
 Summing this arithmetic series, we will get 
 
 nd = 2*(n/2) (n/2+1)/2 =N/U+n/2, for n even. 
 
 and 
 
 = 2*((n+1)/2) ((n*1)/2*1)/2-(n+1)/2 
 
 = {n*^r /^, 
 
 for n odd. 
 
 C^.a) 
 
 To observe the shift time pattern, we will show the 
 shift tiire for various s using N=16 processors. 
 
 shift dist. 1 2 3 iJ 5 6 7 8 9 10 11 12 13 ia 15 
 time 0122123323 3 2 1 2 2 1 
 
 Let us first concentrate on the shift time after we 
 arrive at the reauired n-partition. 
 
 We let 1- \y/n\ and y=w mod n. z will be the 
 n-partition number while y is the number within a particular 
 n-partition. we set y=y+1 for z > n/2 to account for the 
 extra shift to reach the other end of the n-partition. 
 
 yD(y) =y(y + '')/2 
 
 for 0<y<n/2, 
 
 yD(y) =(n/2) (n/2+1)/2^^ (n+1-i) for y>n/2. 
 
 i=f+l 
 
 =ny-y(y-1)/2-N/a 
 
 ('».'♦) 
 
 We then calculate the times required to get to the 
 respective n-partitions. They are: 
 
133 
 tj : 0,1,..., (n/2-1), (n/2-1), (n/2-2) ,...,2,1. 
 
 Let w*f be t^he total number of n-shifts we have to 
 rio for all s in the first z n-partitions- 
 
 if=ny^tf^nz(2-1)/2 for C<z<n/2, 
 
 i = t 
 
 =n (n/2) (n/2-1)/2+n (n-i) for n/2<z<n, 
 =Nz-n2^/2-nz/2+nN/a. 
 
 (4.5) 
 
 The total nurrber of n-shifts required for s in the 
 n-partition where w is located is equal to q, where 
 
 q = min (z,n-1-z) ♦y for z<n, 
 
 = for z=n. (a. 6) 
 
 Hence wD (w) =znd*yD (y) ^wf "•"q* 
 
 For w > N/2, wo need to subtract n/2 from wD (w) , since we 
 have overcounted the shift tiir^e at N/2. 
 
 i.e. wD(w) = znd+yD(y) +wf ^q for w<N/2, 
 = znd + yD (y) +wf >q-n/2 for w>N/2. 
 
 (4.7) 
 
 Now we want to find D(N). For w=N, 2=n and y=0*-1=1. 
 
 Hence ND (N) =n (N/4 + n/2) ♦n- 1/2+ 1/2-N/a+nN-nN/2- K/2 
 -nN/4+0-n/2 
 
 :uis^ 
 
 LlL 
 
 :■.«£ ■ 
 
 f1 
 
 •I fcl 
 
 3 
 
 «2 
 
 :s 
 
 "fl 
 
 =nN/2-N/4-n/2 
 
 C^.S) 
 
 NOw A (x) in (U.2) can be found easily once ND (N) and wD(w) 
 
134 
 
 are known. 
 
 If a random shift is required, then ar average time 
 of /n/2 will be used. If a broadcast function is needed, 
 then we will use the worst case result of |n. 
 
 When the Perirutation is other than shift or 
 broadcast, we will apply Orcutt's result of 8(/n-1) for 
 omega passable permuta tions[ 27 ]. 
 
135 
 
 H , 7 Experi"'e"tal Results 
 
 Our initial experi^^ents will deal with the effects 
 of the following architectural pacameters: 
 
 1) The number of array processors, and the speed of the 
 
 processors relative to the array ireirory system. 
 Initiallyr the processors will be restricted to a 
 
 single group of processors operating from a single 
 instruction stream (SIKD) . 
 
 2) The presence or absence of an independent scalar 
 
 processor and/or meirory* The absence of a scalar 
 processor forces scalar operations to be performed by 
 the array processors* 
 
 3) The memory system^ including the array storage scheme 
 
 (1-skew, etc.)» and the number of memories (power of two 
 or prime) . 
 
 U) The type of alignment network: 
 a) Crossbar, 
 h) omega network, 
 c) +i, + /p shifter (Illiac IV), 
 
 •'a 
 
 ..m 
 
 . .cfc 
 
 I 3 
 
 
 n 
 
 ■4 
 
 iU 
 
 These parameters will te studied for a large variety 
 of application programs, and in addition the size of the 
 application programs (i.e. the array sizes) will be varied 
 in order to produce fairilies of performance figures. 
 
136 
 
 The tables below present some p reliminary results of 
 Experiments on three prograins. He would like to stress at 
 this point that these results are preliminary. The three 
 programs can hardly be construed as representative of any 
 large population of applications. The first program, ADVV, 
 is a U-point relaxation scheme. ADVV was chosen because of 
 its highly parallel nature. The second program, ELMBAK, 
 forms the eigenvectors of a real matrix by back transforming 
 those of the corresponding upper Hessenberg matrix. ELMBAK 
 is reasonably complicated, but has no recurrences. The 
 third program, sLeqI, is a Gauss-Jordan reduction program. 
 SLFQI was chosen because it contains a recurrence relation 
 (a P.<18,1> system). We present the results of these three 
 programs only as an indication of the types of results we 
 expect frotr our experiments, and an illustration of how to 
 interpret the results. 
 
 The complete tables of experimental results for 
 
 these three programs are shown in Appendix E. Some of the 
 
 more interesting figures are grouped together in Tables 4.1 
 
 through U.U. 
 
 Table 4.1 shows the speed factor, Fp=T,/Tp, and 
 processor utilization Dj using 16 processors, 17 memories, a 
 crossbar alignment network, skewed storage, and separate 
 scalar processor and mem.ory. The results are presented as a 
 function of N, the data array sizes. Notice for ACVV, the 
 
137 
 
 
 s^ 
 
 6^ 
 
 
 
 
 iH 
 
 0^ 
 
 
 
 
 r^ 
 
 ro 
 
 
 
 o 
 
 
 
 1 
 
 
 o 
 
 f 
 
 .^ 
 
 1 
 
 
 iH 
 
 CN 
 
 vO 
 
 
 
 
 
 0^ 
 
 
 
 
 &^ 
 
 6-S 
 
 &^ 
 
 
 
 O 
 
 CNJ 
 
 un 
 
 
 
 r-^ 
 
 CO 
 
 -cr 
 
 
 O 
 
 
 
 
 
 ^ 
 
 r. 
 
 ^ 
 
 „ 
 
 
 II 
 
 0^ 
 
 o 
 
 in 
 
 
 2 
 
 • 
 
 • 
 
 . 
 
 
 
 Ln 
 
 C30 
 
 ■<r 
 
 
 
 .-H 
 
 
 1— 1 
 
 
 
 B^ 
 
 B^ 
 
 B^ 
 
 
 
 CN 
 
 m 
 
 O 
 
 
 
 <D 
 
 CN) 
 
 m 
 
 
 o 
 
 
 
 
 
 ~J- 
 
 » 
 
 r. 
 
 ^ 
 
 
 II 
 
 iH 
 
 0^ 
 
 vO 
 
 
 2 
 
 • 
 
 • 
 
 
 
 
 -:t 
 
 U~l 
 
 ON 
 
 
 
 rH 
 
 
 
 
 
 e-s 
 
 e^ 
 
 B-? 
 
 
 
 o 
 
 o 
 
 .H 
 
 
 
 r^ 
 
 iH 
 
 CM 
 
 
 VO 
 
 
 
 
 
 iH 
 
 •. 
 
 ^ 
 
 ^ 
 
 
 II 
 
 o 
 
 O 
 
 00 
 
 
 IZ 
 
 • 
 
 . 
 
 
 
 
 v£) 
 
 m 
 
 ^ 
 
 
 
 iH 
 
 
 
 
 
 6-5 
 
 ^ 
 
 e-s 
 
 
 
 m 
 
 vC 
 
 <r 
 
 
 o 
 
 <t 
 
 
 iH 
 
 
 t-H 
 
 
 
 
 
 II 
 
 t\ 
 
 rv 
 
 ,. 
 
 
 z 
 
 r^ 
 
 o 
 
 m 
 
 
 
 ON 
 
 CN 
 
 <r 
 
 
 § 
 
 
 ^ 
 
 rH 
 
 
 >-l 
 
 p- 
 
 PQ 
 
 Cr- 
 
 
 W) 
 
 p> 
 
 s 
 
 W 
 
 
 o 
 
 Q 
 
 hJ 
 
 •-1 
 
 
 M 
 
 <^ 
 
 W 
 
 CO 
 
 
 Ph 
 
 
 
 
 
 
 T) 
 
 
 1 
 
 OJ 
 
 
 o 
 
 :$ 
 
 
 u 
 
 0) 
 
 
 a 
 
 rM 
 V) 
 
 
 vO 
 
 
 
 rH 
 
 T3 
 
 
 60 
 
 rt 
 
 
 C 
 
 
 
 •H 
 
 ^ 
 
 
 W 
 
 u 
 
 
 3 
 
 o 
 
 
 /-"s 
 
 u 
 
 
 E- 
 
 0) 
 
 
 tD 
 
 d 
 
 
 ^— ' 
 
 4-1 
 
 
 c 
 
 c 
 
 
 O 
 
 (U 
 
 
 •H 
 
 a 
 
 
 4J 
 
 c 
 
 
 cd 
 
 ao 
 
 
 ^3 
 
 •H 
 
 
 •H 
 
 rH 
 
 
 t— 1 
 
 (t) 
 
 
 •H 
 
 
 
 4J 
 
 y> 
 
 
 D 
 
 0) 
 
 
 ^1 
 
 w 
 
 
 O 
 
 M 
 
 . 
 
 to 
 
 O 
 
 
 U) 
 
 ^1 
 
 CJ 
 
 at 
 
 U 
 
 N 
 
 o 
 
 
 •H 
 
 o 
 
 •^ 
 
 W 
 
 i-i 
 
 /'-N 
 
 
 D. 
 
 r-^ 
 
 >% 
 
 
 rH 
 
 cfl 
 
 13 
 
 rH 
 
 M 
 
 C 
 
 1 1 
 
 U 
 
 nj 
 
 U-l 
 
 nj 
 
 ^^ 
 
 O 
 
 m 
 
 vO 
 
 ^w^ 
 
 4-J 
 
 rH 
 
 
 CO 
 
 [H 
 
 w 
 
 -a 
 
 '•^-^ 
 
 OJ 
 
 
 
 •H 
 
 Q) 
 
 
 J-i 
 
 jn 
 
 
 O 
 
 4-1 
 
 ^3 
 
 s 
 
 
 QJ 
 
 0) 
 
 to 
 
 CU 
 
 e 
 
 •H 
 
 a 
 
 
 
 en 
 
 iH 
 
 2 
 
 r-l 
 
 
 
 • 
 
 r. 
 
 • 
 
 <!• 
 
 w 
 
 (1) 
 
 
 J-I 
 
 W) 
 
 <u 
 
 o 
 
 od 
 
 rH 
 
 M 
 
 >-i 
 
 J3 
 
 cn 
 
 o 
 
 CTt 
 
 0) 
 
 ■U 
 
 H 
 
 a 
 
 w 
 
 ,-.'-f 
 
 ; I 
 
 ff 
 
 p 
 
 e 
 
 1 
 
 t? 
 
 
 «3 
 
138 
 
 speed factor quickly approaches the maximum value of 15. 
 Processor utilization ranges from ^3% to 71%, The result 
 for N=16 indicates that Ua = 70% ( Us=i00% since N=p=16 and 
 for ADVV, Ujp =1C0?E) . Thus, the processors are only busy 70% 
 of the time due to non-perfect overlap of array processor 
 operations with alignment, memory, and scalar operations. 
 However, the speed factor is 16 which would indicate a 
 similar degree of non-perfect overlap in a conparable serial 
 processor. The other programs, ELHBAK and SLEQ1 indicate 
 much lower speed factors and utilizations. SLEQ1 contains 
 recurrences* which are handled in parallel but much less 
 efficiently than the pure vector operations in ADVV. 
 Notice, however, that even though SLECl contains a 
 recurrence, the speed factor of 1*^.5 is very close to the 
 maximum of 16 when N is 60. We believe it is significant 
 that we are able to handle recurrences this well. 
 
 The reason the ELMEAK results are so low illustrates 
 an interesting situation. At the present tine programs are 
 compiled into three address vector or scalar instructions. 
 If the vectors are of sufficient length, then an implicit 
 loop is established in order to cycle the processors, 
 memories, etc. a sufficient number of times. Within this 
 implicit loop there is usually overlap between processor, 
 alignment and memory operations. However, b etween separ at e 
 vector instructions, there is no overlap. Thus, one 
 instruction must finish before the next starts. This is 
 
139 
 
 what causes the low figures for ELMBAK. This indicates to 
 us that it is inrportant to design the vectcr instructions 
 and control unit so that different vector instructions 
 overlap each other. 
 
 It is also interesting to note that Pp and Uj 
 continue to increase with N for both ELMBAK and SLEQI. This 
 is due to increased overlap of operations within the implied 
 loops of vector instructions and, in SLEQI, more efficient 
 recurrence algorithms which are used when N is sufficiently 
 larger than the number of processors. 
 
 Table U.2 indicates the effectiveness of various 
 alignment networks and skewing schemes. As we can see, the 
 crossbar and omega networks performed egually well. The 
 Illiac network perforired somewhat better, at least for ADVV 
 and ELKBAK, This is due to two facts. First, the Illiac 
 network was set to operate four times faster than the other 
 networks. This reflects the difference in the complexity of 
 the networks. Second, we were able to "compile" the 
 programs using very simple alignm.ent requirements which 
 could be easily handled by all three networks. The lack of 
 difference between staight storage (0,1) and skewed storage 
 (1,1) is also a reflection of this second foint. He were 
 able to compile the program.s so they only needed access to 
 rows, and thus they do not benefit from skewed storage. 
 However, we do not believe this result will held for larger. 
 
 r! 
 r 
 
 •"I 
 
 ,ti^t 
 
 I. » 
 
 n 
 
140 
 
 Illiac IV 
 Straight Skewed 
 
 1384 
 (9.9) 
 
 1282 
 (2.8) 
 
 -K 
 
 1384 
 (9.9) 
 
 1261 
 (2.8) 
 
 ^C 
 
 Omega 
 Straight Skewed 
 
 1416 
 (9.7) 
 
 1760 
 (2.0) 
 
 2644 
 (4.5) 
 
 1416 
 (9.7) 
 
 1760 
 (2.0) 
 
 2644 
 (4.5) 
 
 ssbar 
 Skewed 
 
 1416 
 (9.7) 
 
 1760 
 (2.0) 
 
 2644 
 (4.5) 
 
 Crc 
 Straight 
 
 1416 
 (9.7) 
 
 o o 
 
 v^ . 
 r-, CM 
 
 iH -^^ 
 
 2644 
 (4.5) 
 
 e 
 
 CO 
 00 
 
 o 
 
 1 
 
 < 
 
 X 
 
 W 
 
 rH 
 
 o- 
 
 w 
 
 CO 
 
 1 
 
 .^.S 
 
 
 c 
 
 >-l 
 
 
 o 
 
 o 
 
 
 a 
 
 3 
 
 
 »-. 
 
 4-1 
 
 
 0) 
 
 0) 
 
 
 4-1 
 
 c 
 
 
 c 
 
 
 
 •H 
 
 iJ 
 
 c 
 
 
 <U 
 
 (U 
 
 
 D- 
 
 E 
 
 
 >» 
 
 c 
 
 
 4-1 
 
 00 
 •H 
 
 
 U 
 
 tH 
 
 
 CO 
 
 to 
 
 
 •H 
 
 
 
 iH 
 
 t/5 
 
 
 rH 
 
 D 
 
 
 M 
 
 O 
 
 •H 
 
 
 r- 
 
 ^ 
 
 
 5 
 
 to 
 
 > 
 
 
 -a 
 
 
 
 di 
 
 00 
 
 c 
 
 •H 
 
 
 to 
 
 to 
 
 
 i-t 
 
 3 
 
 
 60 
 
 
 
 O 
 
 x— ^ 
 
 
 ^-1 
 
 
 
 CL 
 
 til 
 
 
 4-1 
 
 v_/ 
 
 
 OJ 
 
 
 
 ^^ 
 
 o 
 
 
 4-1 
 
 4-1 
 
 
 O 
 
 u 
 
 
 c 
 
 to 
 
 
 QJ 
 
 
 
 > 
 
 T3 
 
 
 CO 
 
 CU 
 
 
 x: 
 
 OJ 
 
 a 
 
 
 dj 
 
 en 
 
 
 *> • 
 
 
 
 
 X) 
 
 
 ^ 
 
 C 
 
 
 o 
 
 CO 
 
 
 •H 
 
 
 
 ^ 
 
 «t 
 
 /— • 
 
 3 
 
 a. 
 
 vO 
 
 H 
 
 rH 
 
 en 
 
 
 II 
 
 dJ 
 
 •* 
 
 O 
 
 u 
 
 OJ 
 
 »^^ 
 
 C 
 
 6 
 
 
 OJ 
 
 •H 
 
 
 l-l 
 
 4-1 
 
 
 u 
 
 
 • 
 
 3 
 
 c 
 
 tn 
 
 u 
 
 o 
 
 <D 
 
 OJ 
 
 •H 
 
 e 
 
 >-l 
 
 4-1 
 
 tu 
 
 
 3 
 
 x: 
 
 w 
 
 CJ 
 
 u 
 
 c 
 
 0) 
 
 w 
 
 •H 
 
 ^ 
 
 
 (0 
 
 w 
 
 00 
 
 4J 
 
 
 c 
 
 c • 
 
 CM 
 
 •H 
 
 O U) 
 
 • 
 
 3 
 
 u c 
 
 (U 
 
 o 
 
 
 .i«i 
 
 rH -H 
 
 0) 
 
 tn 
 
 O- 4J 
 
 .H 
 
 
 W U 
 
 XI 
 
 "O 
 
 J <U 
 
 CO 
 
 c 
 
 CO C 
 
 H 
 
 (0 
 
 * 
 
 
 
141 
 
 more complicated programs. 
 
 Table ^^,3 illustrates another interesting result. 
 One question which continually plagues machine designers 
 concerns the relative speed of the memory and processor. 
 Should the memory be the same speed as the processor, twice 
 as fast* or three tildes as fast? The answer depends on many 
 things: the design of the machine instr^ctiors, the size of 
 arithmetic expressions in the source program, etc- Table 
 U.3 shows thG execution time, Tp, and processor utilization, 
 IJj , for three different cases. In column 1, the processor 
 array, alignment network, and memory all have the same cycle 
 time. In column 2, the alignment network alone has been 
 made twice as fast. There is very little difference between 
 coluiTins 1 and 2, This is because the faster crossbar switch 
 is only effective when data alignirents are required in the 
 absence of memory accesses. None of these three programs 
 required such alignirent. The small diference present 
 between columns 1 and 2 sirrply represents a shorter overall 
 time for a "short" vector operation in the absence of 
 inter-instruction overlap. 
 
 
 ,1 ^\. 
 
 ■ 1 
 l' J 
 
 fi* * 
 
 9 
 9 
 
 «2 
 
 Column 3 of Table U.l corresponds to a machine whoSe 
 alignment network and memories are twice as fast as the 
 processor array. For ADvv, the improvement in Tp is 
 noticeable but not significant. This is because ADVV has 
 relatively large expressions in the source program so the 
 
142 
 
 ratio of memory to processor operations is close to 1:1. 
 "Thus, ADVV does not need a very fast memory. For ELMBAK and 
 SLEQ1, however, the improvement in T is more significant. 
 This would indicate that, at least for these programs, the 
 faster memory might be cost effective. 
 
 Table 'i,^ shows the effectiveness of an independent 
 scalar processor and scalar memory on Tp, Also included in 
 the table are the utilizations of scalar memory and scalar 
 processor respectively. A scalar processor and memory 
 should be effective for several reasons. First, without a 
 scalar memory, when a scalar is being broadcast over all 
 elements of an array* the scalar operand would have to be 
 fetched from the array iremory and aligned (broadcast). This 
 constitutes wasteful use of the array memory. Second, the 
 Use of both scalar memory and processor would allow some 
 scalar operation to be done siirultaneously with array 
 operations. Thus we would be able to overlap or mask out 
 certain truckulent serial operations in the program. 
 
 In Table U.a we can see that the scalar processor 
 causes no improvement in To and the scalar meirory results in 
 only marginal improvement, even though both are utilized to 
 some extent. However, we believe that our "ccmpiler" can be 
 improved so as to utilize the scalar hardware more 
 effectively. This will involve inproving the 
 inter- instruction overlap and more accurate accounting for 
 
143 
 
 Program 
 
 Col 1 
 
 Col 2 
 
 Col 3 , 
 
 ADVV 
 
 1416 
 43% 
 
 1384 
 44% 
 
 1036 
 58% 
 
 ELMBAK 
 
 1760 
 6% 
 
 1650 
 7% 
 
 1023 
 10% 
 
 SLEQl 
 
 2644 
 14% 
 
 2590 
 14% 
 
 1606 
 23% 
 
 Table 4»3 The effects of the 
 relative differences in memory, 
 alignment, and processor cycle 
 
 times on T and U^ ,^. 
 
 p T. (16 pro- 
 cessors, crossbar switch, N=10, 
 and skevv/ed storage.) 
 
 
 
 Scalar 
 
 Scalar 
 
 Both scalar 
 
 
 
 memory 
 
 processor 
 
 processor 
 
 Program 
 
 Neither 
 
 only 
 
 only 
 
 and memory 
 
 ADVV 
 
 1544 
 
 1416 
 
 1544 
 
 1416 
 
 
 ~ » ~ 
 
 12%, - 
 
 -, 3% 
 
 12%, 3% 
 
 ELMBAK 
 
 1854 
 
 1760 
 
 1854 
 
 1760 
 
 
 ~ 1 ~ 
 
 12%, - 
 
 -, 12% 
 
 12%, 12% 
 
 SLEQ 1 
 
 2768 
 
 2644 
 
 2768 
 
 2644 
 
 
 ~ > ~ 
 
 8%, - 
 
 -, 9% 
 
 8%, 9% 
 
 Table 4,4 The effect on execution time, Tp, of a scalar memory and scalar 
 processor. The percentage figures are scalar memory utilization and 
 scalar processor utilization respectively. 
 
 
 i:!J 
 
 •fl 
 
 ^3 
 
144 
 
 such things as subscript calculation. 
 
 The above discussion is based Only on limited amount 
 of experimental results. It can only be regarded as an 
 illustration of how to interpret the results. According to 
 which "benchmark" programs a user is interested in, he can 
 conduct experiments using that set of programs. In the end, 
 he will then be able to determine on the kind of machine 
 configurations that is most suitable for him. 
 
145 
 
 5. CONCLUSION 
 
 This thesis concerns the utilization and 
 effectiveness of interprocessor connection networks for 
 parallel (SLID) type computers. The problems concerning 
 interconnection networks can be divided into three areas: 
 capabilities, exploitation, and effectiveness. 
 
 Capabilities include network properties and network 
 control methods. One of the networks that we have examined 
 closely is the omeqa network. The omeqa network is one of 
 the more attractive multistage networks, it is moderate in 
 network coirplexity and quite powerful in its permutation 
 capabilities. If we concentrate on only seme of the more 
 common permutations, we can further reduce the complexity of 
 its control algorithms. Three different control methods are 
 shown in this thesis. He have discussed a significant new 
 property of the omega n<=*twork, the partitioning property. 
 We showed that a large size omega network can be regarded as 
 a conglomeration of irany smaller size omega networks, each 
 passing a different smaller omega-passable connection 
 function. This partitioning property of the omega network 
 proves to be vital tor the efficient handling of many 
 computation algorithms. We also discussed another important 
 property of the omega network, the broadcastirg ability. He 
 showed the conditions under which a 2-dimensional data array 
 can be broadcast to a J-dimensional data array using the 
 
 IrtpJT 
 
 
 ; J Co 
 
 is 
 
 It 
 
146 
 
 omega network. This data transfer ability is necessary for 
 example in certain matrix multiplication and recurrence 
 solving algorithms, in Chapter 2, we were also able to 
 extend the capabilities of the omega network further using 
 the concept of linear permutations. 
 
 The shuffle connection is the basis of many 
 interconnection networks, like the omega network, the 
 Batcher network, and the binary Eenes network. Because of 
 this similarity, we can apply some of the properties of one 
 network to the others, and hence increase further 
 Understanding of such networks. Some such extensions are 
 shown in Section 2.3 and 2.U. 
 
 Because of the great simplicity in gate counts, the 
 one stage perfect shuffle exchange network is also carefully 
 examined. Algorithms were presented in Section 2.5 to show 
 how such network can be used for performing' permutations. 
 
 with the old and new knowledge we have acguired on 
 network capabilities, we would like to apply them to some 
 common computations. Recurrence solving algorithms and 
 matrix multiplication algorithms are two examples that we 
 used in Chapter 3. The efficient handling of recurrence 
 operations is essential in parallel processing systems 
 because the parallel system will be degraded to a serial 
 machine otherwise. With careful planning, we were able to 
 simplify the alignment requirements of various recurrence 
 
147 
 
 solving algorithms. So, instead of a full crossbar* wg can 
 now use a simple alignment network, such as the omega 
 network. In Section 3.3, we show how a comrron computation 
 algorithm (such as matrix multiplication), if detected, can 
 be adapted onto a parallel processing systeir equipped with 
 only a one stage network. Hence Chapter 3 has been 
 dedicated to techniques for exploiting various 
 interconnection networks. 
 
 To evaluate the tru© effectiveness of a particular 
 interconnection network, we have to determine the 
 effectiveness on real programs of a parallel processing 
 system equipped with such a network. This can be achieved 
 with the help of the Analyzer/Simulator project currently 
 being developed. The program analyzer first generates a 
 highly parallelized version of the program. Then the RRG 
 will compile it into suitable pseudo machine code from the 
 information about the parallel processing systems that the 
 user defines. This pseudo compilation is dore based on the 
 capabilities of the architecture to be studied, including 
 the type of interconnection network used. From the results 
 of the simulator, we can determine how well does an 
 interconnection network work- 
 
 mfV' 
 
 ■-'!', *• 
 
 ,■(?" c > 
 
 . IK' 
 
 :s 
 
 :» 
 
 «2 
 
 '-5 
 
 •a 
 
 with the methodology described in this thesis, the 
 true effectiveness of an inte rprocessor connection can then 
 be determined. 
 
148 
 
 We conclude this thesis by giving the following 
 topics that are worthwhile for further research: 
 
 1) IS there a set of basic permutation patterns for the 
 omega ROM control method from which other useful 
 permutation patterns can be generated by doing logical 
 o^'prations on some members of the set? For example, how 
 can we generate the control pattern of a k-shifted 
 p-ordered spread permutation from that of a k-shift 
 permutation and that of a p-ordered permutation? 
 
 2) Finding the analytic bounds and averages for the time 
 required to pass a permutation using a one stage perfect 
 shuffle network is a worthwhile project. 
 
 3) We are able to adapt recurrence solving algorithms on a 
 parallel processing system equipped with a (log n)-stage 
 network. A highly significant result would be to show 
 how recurrence algorithms could be handled a one-stage 
 network (perhaps coupled shuffle and shift connections). 
 This would lead to a very cheap and effective prallel 
 architecture. 
 
 H) Control unit times should be carefully added to the 
 Simulator. Subscript calculations and register and 
 scalar usages should be accounted for more accurately. 
 
149 
 I^IST OF REFERENCES 
 
 [1] K.J .Thurber# "interconnection Networks — h Survey and 
 Assessment," AFIPS Conference Proceedings , Vol.43, 
 pp. 909-919, nay 1974. 
 
 [2] K.E. Batcher, "Sorting Networks and Tl^eir Applications," 
 Proceedings of the 1968 SJCC.pp. 307-314. 
 
 [3] V.E.Benes, Mathematical T heory of Connecting Netw orks 
 an4 Teleph one Traffic . Academic Press, New York, 1965. 
 
 [4] D.H.Lawrie, "Access and Alignment of Data in an Array 
 Processor," IEEE Transactions on Computers, pp. 1145- 
 1155, December 1975. 
 
 [5] T.Feng, "Data Manipulating Functions in Parallel 
 Processors and Their Implementations," I ^EE Transacti ons 
 or^ Computers, pp. 309-318, March 1974. 
 
 [6] G.H. Barnes, et al,"The Illiac IV Computer," I EEE 
 Transactions on C omputers , pp. 746-757, Aug ust 1968. 
 
 [7] B.C. Swanson, "interconnections for Parallel Memories to 
 Unscramble p-ordered Vectors, "IEEE Transaqtions of} 
 Computers^ pp. 1 IO5-I 1 1 5, November 1974. 
 
 [8] H.S.Stone, "Parallel Processing With the Perfect 
 Shuffle," IEEE T ransactions on Compu ters»pp. 153-161, 
 February 1971. 
 
 
 
 :5P 
 
 ^3 
 
150 
 
 [9] S. H. GO lomb, "Permutation by Cutting and Shaf fling, "S^Ag 
 Review, vol. 3, no, U, pp. 293-297, October 1961. 
 
 [10] B.C. Pease, "An Adaptation of the Fast Fourier Transform 
 to Parallel Processing, "JACM, pp. 252-264, April 1968. 
 
 [11] L.R.Goke and G.J. Lipovski, "Banyan Networks for 
 Partitioning Multiprocessor Systems, " 1st Annual Compu ter 
 Architecture C onf e renc e, Gains ville, Florida, December 
 1973, pp. 21-28. 
 
 [12] K. E. Batcher, "The Multi-Dimensional Access Memory in 
 STARAN," submitted to lEEETC. 
 
 [13] K-T.Hen and D.H .Lawrie, "Omega Network Control Plane 
 Iirplementation," Dept. of Computer Science, Univ. of 
 Illinois, Urbana-champaign, unpublished memo, Sept. 
 1976. 
 
 [14] D.H. Lawrie, "Memory-Processor Connection Networks," Univ. 
 of Illinois, Urbana-Champaign, Computer Science Report 
 557, Feb. IQ^B. 
 
 [15] T.Lang and H.S.Stone, "A Shuffle-Exchange Network with 
 Simplified Control. "IEEE T ransact i ons on C ompute rs, 
 pp. 55-65, January 1976. 
 
 [16] M .c. Pease, "The Indirect Binary n-Cube Microprocessor 
 Array," submitted for publication. 
 
V . , 
 
 151 
 
 [17] D.C.Opferman and N.T.Tsao-Wu,"On a Class of Rearrange- 
 able Switching Networks," ^el\ Syste^n Techn i cal JournaJ., 
 vol.50, pp. 1579-1618, Hay-June 1971. 
 
 [18] T« Lang, "Interconnections Between Processors and Memory 
 Hodales Using the Shuffle-Exchange Hetwork," I EEE 
 T ransactions on Com puters, pp. 496-503. May 1976. 
 
 [19] P.Budnik and D, J. Rack, "The Organisation and Use of 
 Parallel Memories," IEEE Transactions on C omputers , 
 pp. 1566-1569, December 1971. 
 
 [20] P.M.Kogge and H.S.Stone, "A parallel Algorithm for the 
 Efficient Solution of a General Class of Recurrence 
 Equations," IgEE T ran s actions on Computers, pp. 786-7 92, 
 August 1973. 
 
 [21] D. Heller, "On the Efficient Computation of Recurrence 
 Relations," I nst . Comput. A ppI. S ci. Eng. (ICASE^ » June 
 197U. 
 
 [22] S.C.Chen and D. J. Kuck, "Time and Parallel Processor 
 Bounds for Linear Recurrence Systems, "IEEE Transactions 
 OQ Computers, pp. 701-717, July 1975. 
 
 
 C9 
 
 - 1 
 
 J IT 5 
 
 - &? 
 
 n 
 
 [23] S.C.Chen, D.J. Kuck, and A. H. Sameh, "Pract ical Parallel 
 Triangular System Solvers," in preparation. 
 
 [2i»] A. H. Sameh and R« P. Brent, "Solving Triangular Systems on a 
 Parallel Computer," Univ. of Illinois, Urbana- 
 
152 
 Champaign, Compatec Science Feport 76 6, November 1975. 
 
 [25] S.C.Chen, "Speedup of Iterative Programs in Hulti- 
 pcocessing Systems," Oniv. of Illinois, Orbana- 
 
 Champaign, Computer Science Report 69 U, Jan. 1975. 
 
 [26] B.R.Leasure, "Compiling Serial Languages for Parallel 
 Machines," H. S. Thesis, University of Illinois^ 1976. 
 
 [27] S. E.Orcutt,"In)pieirentation of Permutation Functions in 
 ILLIAC IV-Type Computers. " IEEE Transactions on 
 
 C ompute rs, pp. 929-936, September 1976. 
 
153 
 
 APPENDIX A 
 
 We will first show a rrultiplicat ion algorithm that 
 takes N processors and 2N uieinoiies. The skewing scheme is 
 /TT+l-skew 2-slcip. .^t^mories 2i and {2i-»-N+ /n+ 1) mod (2N) ara 
 connected to processor i. fill even PA's refer to memory 2i 
 and all odd RA's refer to memory (2i + N-«- /TI + 1) mod (2N) . An 
 illustration of the memory map is shown in Figure A,1, 
 
 
 
 ■■— -■■ 
 
 
 PO 
 
 PI 
 
 P2 
 
 P3 
 
 K:0 hi 
 
 ^12 r'l 
 
 m M^ 
 
 Me M^' 
 
 (0,C) 
 
 (0,1) 
 
 (C,2) 
 
 (0,3) 
 
 (1,2) 
 
 (1,3) 
 
 (nc) 
 
 (1,1) 
 
 (2,1) 
 
 (2,2) 
 
 (2,3) 
 
 (2,0) 
 
 (3,3) 
 
 {^,^) 
 
 (3,1) 
 
 (3,2) 
 
 Figure \,} yN + 1-Sk.ew 2-Skip Scheme 
 
 A l£0 r it: h m : 
 
 A. Pepeat for IC=0 to N-1; 
 
 1. Fetch A, P^^IC, into t? 1 . 
 
 2. Set k=r ( N + 1) *IC ]mod (2N) . 
 
 3. If k is odd then k= (h-N- N-1)nod(2N). 
 ^. k=k/2. 
 
 5. IR= (PPN-k) modN. 
 
 6. Repeat for IT=0 to N-1; 
 
 a. fetch n, ^A=IR, into P2. 
 
 
 vC w 
 
 It 
 
 % 
 
154 
 
 b. multiply P1 and P2 into R3. 
 
 c. G-permute IR. 
 
 d. J=(PPN-( N + 1) IR/2 )modN. 
 
 e. if IR is odd then J= (J + N/2) inodN 
 
 f. store R3 into T, RA=J. 
 q. G-pt»rrnutc ?1. 
 
 7. Sot R1=C. 
 
 3, Repeat for I'"=0 to N-1; 
 
 a. fetch T, Bk=lh, into PA. 
 
 b. add R"! and F2 into F1. 
 
 c. G-permute HI, 
 
 d. G-PGcnute T^. 
 
 9. Store PI into C, RA=IC. 
 
 P. Done. 
 
 Another algorithm also uses 2N memories, except now 
 it uses a l-skew ?-skip storage scheme. 
 
 For this scheme, memories 2i and ( 2i+ K+ 1) mod (2N) are 
 connected to processor i. The iremory scheme is illustrated 
 in Figure A. 2. 
 
155 
 
 PO 
 
 MO M5 
 
 PI 
 «12 J^7 
 
 P2 
 
 Ma Ml 
 
 P3 
 
 M6 M3 
 
 (0,0) 
 
 0,2) 
 (2,3) 
 
 (3,1) 
 
 (0,1) 
 
 (1,3) 
 
 (3,2) 
 
 (0,2) 
 
 (1,0) 
 (2,1) 
 
 (3,3) 
 
 (0,3) 
 
 (1r1) 
 
 (2,2) 
 
 (3,0) 
 
 Fiqare \,2 1-3kew 2-Skip Scheme 
 
 A 1 c[o r i t hrr : 
 
 \, Rf?peat for IC^O to N-1; 
 
 1. Fetch ^, P»-TC, into F1. 
 
 2. If IC is old then k= ( TC-N- 1 ) mod (2N) . 
 
 3. k=k/?. 
 
 a. IE=(FPM-k) niodN. 
 
 5. r^epeat for IT = to N-1; 
 
 a. fetch B, RA=IB, into R2. 
 
 b. inultiply K1 and P2 into P. 3. 
 
 c. G-permute I^.. 
 
 d. J=(PFN- IP/2 )inodN. 
 
 e. if 12 is odd then J= (J + N/2) inodN. 
 
 f. store P^ into T, BA=J. 
 q, G-permute R 1 . 
 
 6. Sf^t P1=0. 
 
 7. Repeat for IT = "! to II- 1; 
 
 a. fetch T, RA=IE, into HA. 
 
 ■tftf 
 
 
 
 £? 
 
 
 3 
 
 9 
 
 5»» 
 
 
 «3 
 
156 
 
 b. add Pi and R2 into Rl. 
 
 c. G-permute HI, 
 
 d. G-permute IR. 
 
 8. Store Rl into C, RA=IC. 
 
 B. Done, 
 
157 
 APPEND I X_B 
 
 The following twelve tables are the experimental 
 results of three Fortran programs; ADVV, ELMBAK, and SLEQI, 
 
 The column AN shows what alignment network is chosen. 
 The column (H=) indicates whether the number of menories is 
 prime or not. The column SKEW shows the skewing scheme being 
 chosen. Finally, the column SPEED (P/A/M) gives the relative 
 speeds of processor, alignment network and memory. As for 
 the headings, the number of processors is given as P=icx. 
 SP/SiM indicates whether the switches SP and Sm (c.f. Section 
 U.2.?) are turned on or not. 
 
 For the tabale on Execution Time(Tp) and Spaed 
 Factor (Fp) , the main number in each entry is To and the 
 number in parentheses is Fp . The remaining tables show 
 utilization measures of various system resources. They are 
 in the order: array memory, input alignment network, output 
 alignment network, vector processor system, scalar memory and 
 scalar processor. 
 
 r"' 
 
 ■ A ^]2 
 
 ^ it* 
 1 If Iff 
 
 " -^ 
 •0 
 
 when the entries in some row j is the same as that of 
 row i, it will be marked "same as row i". 
 
II 
 
 II 
 
 
 , — . 00 • f^ • 
 
 ^ CT> CO C3^ O CT> 
 
 r— . 00 • CO • 
 I— «:!• CT> CO Cn O CTt 
 
 II 
 
 a. 
 
 «* <y\ 
 
 O r- 
 
 CJ 1— 
 
 
 
 <t . 
 
 C\J • 
 
 r^ 
 
 2 
 
 ^ 
 
 Ln 00 
 
 LD cr> 
 
 O CT. 
 
 o 
 
 o 
 
 
 r— ' 
 
 1 — — ' 
 
 s_ 
 
 i- 
 
 I— CO 
 
 00 
 
 ,^.— «^ 
 
 (»^-v 
 
 <-—*«» 
 
 CVJ CO 
 
 C\J ^ 
 
 ^ o^ 
 
 CSJ ^ 
 
 en • 
 
 CO • 
 
 00 • 
 
 CO • 
 
 CO CTi 
 
 o en 
 
 r— "* — ' 
 
 CO en 
 
 O CT> 
 
 I— CO 
 
 lo pv. 'v^ cr> U3 «^ 
 
 r— . CO • CO • 
 
 «d- en CO en o en 
 
 ^-^ ID '-> 
 
 «^ en o o 00 en 
 
 «;3- . CM • 00 • 
 
 Lo CO in en o CO 
 
 o 
 
 ^ <n 
 00 • 
 CO en 
 
 CM ^ 
 CO • 
 
 o en 
 
 158 
 
 o 
 
 2 
 o 
 
 o >— en I— 
 CM • <^ • 
 un en o en 
 
 I/) 
 
 
 
 II 
 
 Q. 
 
 O 
 O 
 II 
 
 oo 
 
 a. 
 
 CO 
 
 
 QJ 
 
 (U 
 
 E 
 
 E 
 
 (T3 
 
 <o 
 
 to 
 
 to 
 
 O) 
 
 E 
 n3 
 to 
 
 CO 
 03 
 
 to 
 
 O) 
 
 OJ 
 
 E 
 
 E 
 
 ra 
 
 m 
 
 to 
 
 irt 
 
 ^ en cvj ^ 
 
 CO • CO . 
 
 CO en o en 
 
 o 1 — en r— 
 CM . to • 
 LD en o en 
 
 ^ <n 
 
 CO • 
 
 CO en 
 
 CM ^ 
 CO . 
 O (Ti 
 
 o I— 
 
 CM • 
 ID CTi 
 
 1^ 1— 
 o en 
 
 .— ~ un U3 
 
 CM CM O CM CM CM 
 
 r>. • "=^ • I — • 
 
 CM CO CM CO O CO 
 
 ^-— ' «* — > co> 
 
 UJ 
 
 .. ol 
 
 C3 r-l 
 O 
 
 a: •• 
 
 00 
 CO 
 00 
 
 CO 
 00 
 
 00 
 
 CO 
 00 
 00 
 
 CO 
 
 ^ en CM sj- 
 
 CX3 • CO • 
 
 CO en o en 
 
 Q 
 
 < 
 
 s- 
 
 o 
 
 i- 
 a. 
 
 s- 
 o 
 
 s- 
 o 
 +J 
 o 
 
 •a 
 
 (U 
 CL 
 
 -o 
 
 c 
 
 Or- r^ r— 
 
 CM • r^ • 
 LD en o en 
 
 E 
 
 •r- 
 
 s- 
 
 Q. 
 
 CQ 
 X 
 
 CO 
 
 CO 
 CO 
 
 00 "^ 
 
 CO "=* 
 
 00 CO 
 
 LD "^ 
 
 o CM r>. CM 
 
 ^ . en • 
 
 CM CO en CO 
 
 <^ > — CM ' 
 
 00 
 CM 
 00 
 
 CO 
 
 CM 
 
 CM 
 
 a 
 
 o CO r~> CM 
 
 ^d" • en • 
 
 CM CO en CO 
 
 <t:j-^> CM — ' 
 
 CO 
 CM 
 CO 
 
 c 
 o 
 
 •r- 
 
 4-> 
 3 
 O 
 0) 
 X 
 
 LU 
 
 CO 
 
 eu 
 
 s- 
 
 00 
 
 CM 
 
 1 — CM 
 
 CO 
 
 LD to 
 
 00 
 
 en 
 
 I— CM 
 
II 
 
 a. 
 
 00 
 
 11 
 
 a. 
 
 kO CO 
 
 CO CO 
 
 KO ^ 
 
 U3 CO 
 
 73,68, 
 68,12, 
 
 75,35, 
 69,12, 
 
 73,68, 
 93. 8, 
 
 73,68, 
 68,12, 
 
 VO CO 
 
 CO CO 
 
 ^ ^ 
 
 VO CO 
 
 73,68, 
 68,12, 
 
 75,35, 
 69,12, 
 
 50,46, 
 93. 8, 
 
 73,68, 
 68,12, 
 
 00 CO 
 
 S3- CO 
 
 V£) ^ 
 
 00 CO 
 
 78,71, 
 62, -, 
 
 80,36, 
 63, -, 
 
 Ln 
 VO o 
 
 Ln en 
 
 78,71, 
 62, -, 
 
 <X) 1 
 
 CO 1 
 
 ^ 1 
 
 VO 1 
 
 CO 
 
 o 
 
 CO cj tn CO VO 00 
 
 VO r— CO I— «;r 
 
 CO I— tn CO o pn 
 r*> r^ r^ rN tn en 
 
 CO 
 
 CO Ln 
 r>. VO 
 
 VO I 
 CO 
 
 O VO 
 CO VO 
 
 VO 
 
 O I 
 
 Ln 
 
 VO CO 
 
 Ln CTi 
 
 o 
 
 ^ ^-* 
 II 
 
 II S 
 oo 
 
 a. ' 
 
 Q_ 
 00 
 
 rN I CO 
 
 vn I r^ 
 
 00 CO 
 VO I— 
 
 CO I— 
 
 CO 
 
 ■ — r 
 
 00 Ln 
 r<> VO 
 
 E 
 
 s*- I r> I CO r 
 
 i>^ CO vn 
 
 r-; CO r-^ CTi Ps !> 
 
 CO VO 00 VO Ln cn 
 
 p ^ 
 
 uj <c 
 
 CO 
 00 
 00 
 
 CO ^ 
 
 00 00 
 
 
 s 
 
 i. 
 ex 
 
 QQ 
 X 
 
 
 •— CO 
 CO VO 
 
 00 «d- 
 
 00 ^ 
 
 00 CO 
 
 I— CO 
 
 2 
 o 
 
 to 
 
 2 
 o 
 
 in 
 
 E 
 to 
 
 V) 
 
 E 
 fO 
 to 
 
 o 
 
 rtJ 
 
 Ol 
 
 to 
 
 00 
 00 
 00 
 
 CO 
 
 00 
 00 
 CO 
 
 00 
 
 CJ 
 
 c; 
 
 1 CO 
 
 1 sd- 
 
 1 CO 
 
 1 s3- 
 
 CO CO 
 VO 1— 
 
 CO 00 
 
 0k n 
 
 O CM 
 
 O 00 
 
 Ln en 
 r-^ VO 
 
 o CO 
 Ln en 
 
 Ln en 
 r^ VO 
 
 o ro 
 Ln en 
 
 1 CO 
 
 1 S3- 
 
 1 CO 
 
 1 «* 
 
 #1 «t 
 
 CO CM 
 CO 1— 
 
 CM 00 
 CM 
 
 O CM 
 
 O 00 
 
 Ln en 
 r^ VO 
 
 O CO 
 
 Ln <n 
 
 Ln en 
 r-«. VO 
 
 o CO 
 in en 
 
 1 CO 
 
 1 •* 
 
 1 CO 
 
 1 ^ 
 
 CO 1 
 
 CO 
 
 S3- I 
 CM 
 
 A 9\ 
 
 CO 1 
 
 OO 1 
 
 o CO 
 
 00 VO 
 
 r>> o 
 Ln en 
 
 o CO 
 
 00 VO 
 
 r^ o 
 in en 
 
 1 
 ro •« 
 
 CO CM • 
 
 1 1 
 
 n A 
 
 CM 00 
 
 CM 
 
 1 1 
 
 O CM 
 
 1 1 
 
 O CO 
 
 in CM 
 
 o r>> 
 Ln en 
 
 Ln CM 
 
 OP> 
 
 LO en 
 
 1 I 
 
 1 1 
 
 I 1 
 
 1 1 
 
 CO I CO I 
 
 CO CM 
 
 O VO VO CO 
 
 00 VO in en 
 
 II II 
 
 00 I Ln J 
 
 >— en 00 pn. 
 CO VO in en 
 
 00 
 CM 
 CO 
 
 CO 
 
 CO I CM I 
 
 o VO VO CO 
 CO VO in en 
 
 I I 
 
 I I 
 
 CM 
 
 I— cn 00 r> 
 00 VO in en 
 
 CO 
 CM 
 00 
 
 00 
 
 159 
 
 
 to 
 
 i- 
 cn 
 o 
 
 S- 
 Q. 
 
 &- 
 O 
 
 to 
 
 (U 
 
 o 
 
 i- 
 
 o 
 
 (U 
 DC 
 
 to 
 
 Z3 
 O 
 •r— 
 
 S- 
 
 «4- 
 
 o 
 
 fO 
 
 O) 
 
 >» 
 
 <_) 
 
 >> 
 
 +-> 
 
 Q 
 
 O) 
 
 •— CO 
 
 CO 
 
 in 
 
 VO 
 
 00 
 
 CM 
 
 cn 
 
 O r— 
 
 CO 
 
 .a 
 
 .'.-•In 
 
 
 ■1; 
 (I 
 
 ' ' . ;« H "J 
 
 "? *■ 2 
 
 2» 
 
 Cl 
 
 -9 
 
Q 
 
 .. ol 
 
 O r-l 
 
 o 
 a. z 
 
 
 ^ ^ 
 
 o 
 
 
 
 o 
 
 
 
 
 
 O 
 
 
 o 
 
 
 f^ 
 
 o 
 
 
 
 o 
 
 
 
 
 
 o 
 
 
 o 
 
 
 •> 
 
 LD 1— 
 
 
 
 V£) r- 
 
 
 
 
 
 
 
 
 
 •s^ >— 
 
 ^^ •* 
 
 
 
 r~" ^ 
 
 
 
 
 
 A 
 
 
 •v 
 
 
 VO V— 
 
 «o 
 
 
 
 -o 
 
 
 
 
 
 *o 
 
 
 •>o 
 
 
 II 
 
 LO O 
 
 
 
 WD O 
 
 
 
 
 
 CO o 
 
 
 o o 
 
 
 II ^ 
 
 r-^ ^" 
 
 
 
 ^■~ n— 
 
 
 
 
 
 ^^ f— 
 
 
 
 
 oo 
 
 «t A 
 
 
 
 M #\ 
 
 
 
 
 
 •\ #1 
 
 
 «\ *k 
 
 
 Q. "^ 
 
 LO *X> 
 
 
 
 (^ VO 
 
 
 
 
 
 CO VO 
 
 
 CO CO 
 
 
 Q. 
 
 r"^ 1"^ 
 
 
 
 ^^ ^^ 
 
 
 
 
 
 
 
 
 
 OO 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 o 
 
 
 ' 
 
 O 
 
 
 
 
 
 o 
 
 
 o 
 
 
 
 
 o 
 
 
 
 O 
 
 
 
 
 
 o 
 
 
 o 
 
 
 
 
 o> 1 — 
 
 
 
 OO r- 
 
 
 
 
 
 1 l~~ 
 
 
 
 
 
 " — ^ 
 
 tn • 
 
 
 
 VO « 
 
 
 
 
 
 A 
 
 
 «\ 
 
 
 
 r — 
 
 "O 
 
 
 
 -O 
 
 
 
 
 
 •>o 
 
 
 -o 
 
 
 
 9t 
 
 CTi o 
 
 
 
 OO O 
 
 
 
 
 
 CO O 
 
 
 o o 
 
 
 
 1 
 
 tn r— 
 
 
 
 (^ r- 
 
 
 
 
 
 CO r— 
 
 
 
 ,_ 
 
 
 
 cr, n 
 
 r- O 
 
 
 
 CO CO 
 
 ^ O 
 «* O 
 
 «:*• 
 
 «* 
 
 ^ 
 
 ^ 
 
 CO CO 
 
 CO CO 
 
 1 o 
 o 
 
 CTi 
 
 CO CO 
 
 CD CD 
 
 1 O 
 
 o 
 
 
 
 o 
 
 •* r^ 
 
 
 
 •» !—• 
 
 
 
 
 
 *• 1-^ 
 
 
 A r^ 
 
 
 
 n 
 
 CO •« 
 
 S 
 
 2 
 
 K£> »> 
 
 S 
 
 2 
 
 2 
 
 2 
 
 00 •■ 
 
 2 
 
 to " 
 
 2 
 
 
 r— 
 
 LD 1 
 
 o 
 
 o 
 
 en 1 
 
 O 
 
 o 
 
 o 
 
 O 
 
 LD 1 
 
 o 
 
 1 
 
 o 
 
 
 
 CM CO 
 LD tX) 
 
 s^ 
 
 S- 
 
 en CO 
 
 s- 
 
 S- 
 
 S- 
 
 s- 
 
 cn CO 
 en CO 
 
 S- 
 
 cn CO 
 cn CO 
 
 s- 
 
 VD 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 ■ — 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 II 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 Q_ 
 
 •* 
 
 9,59 
 00, - 
 
 
 
 3,63 
 00, - 
 
 
 
 
 
 1 1 
 
 CO o 
 
 
 1 1 
 
 "O 
 
 o o 
 
 
 
 o 
 
 Lf> 1 — 
 
 00 
 
 l/) 
 
 «^ r— 
 
 CO 
 
 LO 
 
 co 
 
 10 
 
 to r— 
 
 CO 
 
 
 co 
 
 
 
 CTi 1— 
 
 03 
 
 n3 
 
 CO 1— 
 to v£) 
 
 <* 1 
 
 (O 
 
 (O 
 
 03 
 
 03 
 
 CO 1— 
 
 CO to 
 
 03 
 
 0S ^ 
 
 CO 1— 
 to CD 
 
 1 1 
 
 03 
 
 
 i-"^ 
 
 5 
 
 
 
 "^ 
 
 
 
 
 
 
 
 
 
 
 o 
 
 A «« 
 
 OJ 
 
 O) 
 
 A «k 
 
 cu 
 
 OJ 
 
 dJ 
 
 (U 
 
 #> •» 
 
 0) 
 
 •» •* 
 
 <V 
 
 
 «t 
 
 ro I 
 
 E 
 
 E 
 
 UD 1 
 
 E 
 
 E 
 
 E 
 
 E 
 
 CO 1 
 
 E 
 
 CO 1 
 
 E 
 
 
 o 
 
 LD 
 
 (O 
 
 fO 
 
 LT) 
 
 fo 
 
 03 
 
 03 
 
 03 
 
 LO 
 
 03 
 
 
 (0 
 
 
 
 CNJ r— 
 LD KO 
 
 1/1 
 
 CO 
 
 en 1— 
 en VXD 
 
 CO 
 
 CO 
 
 lO 
 
 CO 
 
 LO r- 
 
 un CD 
 
 CO 
 
 LO <— 
 
 cn CO 
 
 CO 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 O 
 
 
 
 
 
 
 
 
 
 
 
 
 
 #t 
 
 cr> 1 
 
 
 
 IT) t 
 
 
 
 
 
 1 1 
 
 
 t 1 
 
 
 o 
 
 LD 
 
 
 
 r^N 
 
 
 
 
 
 
 
 
 
 ^ — ' 
 
 •» •* 
 
 
 
 «b «t 
 
 
 
 
 
 •* •» 
 
 
 •\ *i 
 
 
 II 
 
 CM 1 
 
 
 
 00 1 
 
 
 
 
 
 CO 1 
 
 
 CD 1 
 
 
 II s: 
 
 UD 
 
 
 
 r> 
 
 
 
 
 
 CD 
 
 
 CM 
 
 
 CO 
 
 «t *t 
 
 
 
 9S «\ 
 
 
 
 
 
 •« «t 
 
 
 ft « 
 
 
 a. ^ 
 
 CM CM 
 
 
 
 r^ CM 
 
 
 
 
 
 r^ CM 
 
 
 I^ CM 
 
 
 Q_ 
 
 ^ 00 
 
 
 
 r>, 00 
 
 
 
 
 
 r>. CO 
 
 
 r^ 00 
 
 
 oo 
 
 
 
 
 
 
 
 
 
 
 
 
 
 _^. ^ 
 
 
 
 
 
 
 
 
 
 
 
 
 
 1 1 1 - 
 
 00 
 
 00 
 
 '^ 
 
 00 
 
 ^ 
 
 00 
 
 CO 
 
 ^3- 
 
 CO 
 
 ^ 
 
 CO 
 
 ^ 
 
 UJ ex. 
 
 00 
 
 <* 
 
 «* 
 
 CO 
 
 ^ 
 
 CO 
 
 00 
 
 «d- 
 
 CM 
 
 ^ 
 
 CM 
 
 ^ 
 
 00 Q. 
 
 00 
 
 00 
 
 00 
 
 CO 
 
 CO 
 
 00 
 
 00 
 
 00 
 
 CO 
 
 CO 
 
 00 
 
 CO 
 
 ' 
 
 
 
 
 
 
 
 
 
 
 
 
 
 "^ 
 
 ^.^ 
 
 
 
 ^ ^ 
 
 
 
 _ 
 
 
 _ 
 
 
 _ 
 
 
 3 
 
 ^- 
 
 
 
 r— 
 
 
 r— 
 
 f— 
 
 
 ^— 
 
 
 ^— 
 
 
 LU 
 
 A 
 
 
 
 9t 
 
 
 A 
 
 «k 
 
 
 M 
 
 
 •« 
 
 
 :^ 
 
 1 
 
 
 
 r— 
 
 
 o 
 
 , — 
 
 
 , — 
 
 
 o 
 
 
 CO 
 
 "" 
 
 
 
 ^ " 
 
 
 ^^"^ 
 
 ' 
 
 
 " — ' 
 
 
 "■ — ' 
 
 
 
 O) 
 
 
 
 
 
 
 
 
 
 
 
 
 II 
 
 E 
 
 
 
 J^ 
 
 
 
 -^ 
 
 
 Js<i 
 
 
 
 
 s: 
 
 sl 
 
 
 
 CM 
 
 
 
 CM 
 
 
 CM 
 
 
 
 
 — 
 
 
 Q- 
 
 
 
 
 
 
 
 
 
 
 
 
 <: 
 
 X 
 
 
 
 
 
 
 c? 
 
 
 1— 1 
 
 
 
 
 1 
 
 
 
 
 
 
 
 
 1 
 
 
 
 
 
 160 
 
 E 
 
 J- 
 
 C7) 
 O 
 
 S- 
 
 a. 
 
 o 
 
 v> 
 (U 
 
 o 
 
 S- 
 
 o 
 to 
 (U 
 
 to 
 
 O 
 •r- 
 
 s- 
 
 ro 
 
 O 
 
 c 
 o 
 
 +J 
 
 N 
 
 ID 
 
 a> 
 
 c 
 
 •r- 
 U 
 
 00 
 LD 
 
 
 r— CM 
 
 CO 
 
 to 
 
 00 
 
 CTi 
 
 r— CM 
 

 
 
 
 
 
 
 
 
 
 
 
 1^ 
 
 
 
 
 
 
 
 
 
 
 
 P = 64 
 SP/SM=(1, 
 
 1— CO ^1- CO 
 
 r— ^ 
 
 .— CO .— ^ 
 
 
 
 .— ^ 
 
 1 CO 
 
 1 *3- 
 
 1 CO 
 
 1 «:»■ 
 
 O CM LO CM 
 
 PN. 00 
 
 r— CM (v. 00 
 
 
 
 .8, 7, 
 15, 8, 
 
 12,10, 
 11.12, 
 
 8, 7. 
 15, 8. 
 
 12. 0. 
 11,12, 
 
 O 00 
 
 
 CO LO 
 
 CM r— COLO 
 
 CO LO 
 
 
 
 CO CO CM CO 
 
 CM ^ 
 
 «^ CO CM «d- 
 
 ^ 
 
 ^ 
 
 32.29. 3 
 58, 8, 4 
 
 47,21. - 
 44,12. 3 
 
 32.14. - 
 59, 8. 4 
 
 1 CO 
 O CM 
 
 1 ^ 
 
 
 O CM O CM 
 ^1— CM 1— 
 
 CO CO ^ ^ 
 
 30,27 
 58, 8, 
 
 CO CM cn CO 
 
 ^1— CM 
 
 ^O CO CM 00 
 "id- ^ CO LO 
 
 32, 0, 
 59, 8, 
 
 
 o 
 
 CO CO CM CO 
 
 CM ^ 
 
 CO CO CO ^ 
 
 
 
 CM «:*• 
 
 1 CO 
 
 1 ^ 
 
 1 CO 
 
 1 «* 
 
 tx> 
 
 M 
 
 r^ 1 C3^ I 
 
 CO r— 
 
 O C3^ r— O 
 ** CO ^ ^ 
 
 29,28 
 56, -, 
 
 o 1 <yi 1 
 
 ^ CM 
 
 CO CTl r- LO 
 ^ CO CO LO 
 
 2 
 O 
 
 s- 
 
 2 
 o 
 
 31,29, 
 56, -, 
 
 44,20, 
 40, -, 
 
 31,14. 
 57. -, 
 
 44. 2. 
 40. -, 
 
 31, 1, 
 57, -. 
 
 II 
 
 
 
 
 
 
 
 
 
 
 
 
 o. 
 
 
 
 
 
 
 
 
 
 
 
 
 
 -- 
 
 CO 1 CM 1 
 
 CM 1 
 
 *^ 1 CM 1 
 
 
 
 CM 1 
 
 1 1 
 
 ' ' 
 
 ' • 
 
 1 I 
 
 
 
 O CM O CM 
 •vf r— CM 1— 
 
 rN CO 
 
 CM 
 
 CO CM cn CO 
 
 ^ r— CM 
 
 to 
 
 (/) 
 
 CTl 00 
 CM 
 
 r— CM 
 CM r— 
 
 ^ 00 
 
 M 9* 
 
 O CM 
 
 O CO 
 
 
 
 CO CO ^ ^ 
 
 O CTl 
 CO LO 
 
 LD CO C\J CTl 
 "^ ^ CO LO 
 
 ns 
 
 (0 
 
 CM cn 
 
 CO LO 
 
 
 CM cr, 
 
 CO LO 
 
 
 CM cn 
 
 CO LO 
 
 
 o 
 
 CO 1 CM 1 
 
 CM 1 
 
 CO 1 CM 1 
 
 
 
 CM 1 
 
 1 1 
 
 1 I 
 
 1 1 
 
 1 ( 
 
 
 o 
 
 r^ 1 CT» I 
 
 CO r- 
 
 O CTl I— O 
 ^ CO ^<;t 
 
 29,26. 
 56, -, 
 
 O 1 00 1 
 «* CM 
 
 CO cn 1 — LO 
 
 <:^ CO CO LO 
 
 E 
 fO 
 to 
 
 E 
 
 31,28, 
 56, -, 
 
 44,19. 
 40, -, 
 
 31,14, 
 56. -. 
 
 44, 0, 
 40. -. 
 
 56, 2, 
 93, -, 
 
 
 
 
 
 
 
 
 
 
 
 
 o 
 
 
 
 
 
 
 
 
 
 
 
 o 
 
 <;»• 1 CM 
 
 CO 1 
 
 LO I CO 1 
 
 
 
 CO I 
 
 1 1 
 
 t I 
 
 1 t 
 
 r 1 
 
 <3- ^—i 
 
 •» •* A 
 
 
 
 
 
 5^ 
 
 «k to 
 
 LO I 
 
 
 LO 1 
 
 
 II 
 
 II S 
 
 CO 
 
 VjD 1 CO 1 
 «^ CM 
 
 CO 1 
 CO 
 
 00 1 r- r 
 
 LO ^ 
 
 to to 
 
 CO I 
 
 CL -v^ 
 00 
 
 LO LO LO LO 
 
 LO O 
 CO CO 
 
 CO LD »:t O 
 LD LO «^ CO 
 
 
 
 ^ CO 
 
 CO r^ 
 
 LD LO 
 
 LO O 
 ^ CO 
 
 CO rs 
 
 LO LO 
 
 LO O 
 ^ CO 
 
 ,_, 
 
 
 
 
 
 
 
 
 
 
 
 
 00 00 
 
 •^ 
 
 00 ^ 
 
 CO 
 
 00 
 
 ';r 
 
 00 
 
 ^ 
 
 CO 
 
 ^ 
 
 
 CO ^ 
 
 "* 
 
 00 ^ 
 
 CO 
 
 CO 
 
 ^ 
 
 CM 
 
 ,_ 
 
 CM 
 
 ^ 
 
 00 D_ 
 
 00 00 
 
 CO 
 
 00 CO 
 
 CO 
 
 00 
 
 00 
 
 CO 
 
 CO 
 
 00 
 
 CO 
 
 2 
 
 LU 
 00 
 
 #k 
 
 
 A 
 
 o 
 
 M 
 
 M 
 
 C3* 
 
 — > 
 
 II 
 
 i 
 
 
 
 
 
 
 
 s: 
 
 •r- 
 
 CM 
 
 J»4 
 CM 
 
 
 CM 
 
 
 
 
 < 
 
 X 
 
 
 
 
 oi 
 
 
 1 — r 
 
 
 
 
 161 
 
 Q 
 
 <C 
 
 E 
 o 
 
 £- 
 O. 
 
 S- 
 
 o 
 
 to 
 
 (U 
 
 o 
 
 S- 
 3 
 
 o 
 
 10 
 
 O) 
 
 a: 
 to 
 
 O 
 
 •r- 
 
 S- 
 10 
 
 c: 
 o 
 
 •r- 
 
 +J 
 
 <a 
 
 N 
 
 ■o 
 <u 
 
 d 
 •^- 
 
 E 
 o 
 o 
 
 LD 
 
 cu 
 
 CD 
 
 ■vvM 
 
 
 
 J',5>i. 
 
 
 «2 
 
 I— CM 
 
 CO 
 
 LO 
 
 CO 
 
 CTl 
 
 O I— 
 
 CM 
 
CO 
 
 • •o 
 
 U3r- 
 O 
 
 an •• 
 
 
 ^-s 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 1— 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 •% 
 
 
 
 
 
 
 
 
 
 
 
 
 
 «:f 
 
 ,^ 
 
 , — * 
 
 - — » 
 
 , — » 
 
 
 
 
 
 
 > — » 
 
 ,. s 
 
 , — » 
 
 ^ — ., 
 
 <X) 
 
 > • 
 
 O 1— 
 
 o ^ 
 
 en en 
 
 
 
 
 
 
 O LO 
 
 o 
 
 r^ r^ 
 
 ,— 
 
 
 II 
 
 lO o 
 
 in ^ 
 
 CM 1— 
 
 
 
 
 
 
 00 r>. 
 
 CM en 
 
 r^ r«» 
 
 r- cn 
 
 II 
 
 z: 
 
 r^ • 
 
 <^ • 
 
 o . 
 
 
 
 
 
 
 CM • 
 
 r^ • 
 
 CM . 
 
 r^ • 
 
 
 oo 
 
 r— CM 
 
 I— CM 
 
 r— CM 
 
 
 
 
 
 
 I— CM 
 
 !>-. CM 
 
 r— CM 
 
 r^ CM 
 
 o. 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 r— 
 
 o^ 
 
 O^ 
 
 CO en 
 
 
 
 
 
 
 CM LO 
 
 <n 
 
 r— O 
 
 *a- 
 
 
 M 
 
 i£> O 
 
 LO «;!- 
 
 CM r- 
 
 
 
 
 
 
 00 r*. 
 
 «* 00 
 
 LO CO 
 
 co en 
 
 
 ^— 
 
 r^ • 
 
 ^ . 
 
 o . 
 
 
 
 
 
 
 CM • 
 
 r<- • 
 
 CM • 
 
 LO . 
 
 
 •>_^ 
 
 1— CsJ 
 
 1— CM 
 
 1— CM 
 
 r^ 
 
 en 
 
 ^- 
 
 ^- 
 
 CO 
 
 r- CM 
 
 i^ cvj 
 
 f— CM 
 
 r^ CM 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 ^_^ 
 
 
 
 
 o 
 
 "!:r 1— 
 
 CM CO 
 
 O cn 
 
 
 
 
 
 
 CM ^ 
 
 cn 
 
 .— 00 
 
 CM 
 
 
 «t 
 
 LT) en 
 
 LD CM 
 
 in 1— 
 
 S 
 
 S 
 
 2 
 
 s 
 
 2 
 
 cn Ln 
 
 <^ r-. 
 
 r^. Ln 
 
 CO 00 
 
 
 r— 
 
 00 . 
 
 LO • 
 
 o . 
 
 o 
 
 o 
 
 o 
 
 o 
 
 o 
 
 CO • 
 
 o . 
 
 CO • 
 
 cn . 
 
 
 ^--«' 
 
 ^— r— 
 
 .— CM 
 
 •— CM 
 
 s- 
 
 s- 
 
 s- 
 
 S- 
 
 s- 
 
 r— CM 
 
 CO CM 
 
 1— CM 
 
 1^ CM 
 
 
 
 ■ — ^ 
 
 — ' 
 
 * — 
 
 
 
 
 
 
 
 
 ' 
 
 "~^ 
 
 — ' 
 
 VD 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 ' — 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 II 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 Cl. 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 ^^ 
 
 ^.^ 
 
 ^-^ 
 
 ,— ^ 
 
 
 
 
 
 
 ,-^ 
 
 ^-^ 
 
 ,_^ 
 
 ^-^ 
 
 
 1 — 
 
 O r- 
 
 o •* 
 
 CO en 
 
 
 
 
 
 
 CvJ LO 
 
 en 
 
 r- O 
 
 •^ 
 
 
 «« 
 
 VO O 
 
 LD «:;*• 
 
 CM r— 
 
 l/l 
 
 CO 
 
 00 
 
 00 
 
 l/) 
 
 00 rv. 
 
 "s:!- CO 
 
 LO 00 
 
 CO cn 
 
 
 o 
 
 r-s . 
 
 ^ • 
 
 o . 
 
 fO 
 
 (O 
 
 <T3 
 
 ta 
 
 (O 
 
 CM • 
 
 r^ . 
 
 CM • 
 
 LO • 
 
 
 > - 
 
 1— <Si 
 
 .— CM 
 
 r— CM 
 
 
 
 
 
 
 r- CM 
 
 r^ CM 
 
 r— CM 
 
 r^ CM 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 o" 
 
 «;d- I— 
 
 CM CO 
 
 O CO 
 
 OJ 
 
 OJ 
 
 <U 
 
 O) 
 
 cu 
 
 CM ^ 
 
 CTl 
 
 r— 00 
 
 CM 
 
 
 •k 
 
 Ln en 
 
 in CM 
 
 Ln r— 
 
 E 
 
 E 
 
 E 
 
 E 
 
 E 
 
 cn Ln 
 
 "^ r^ 
 
 r^ LO 
 
 CO 00 
 
 
 o 
 
 00 • 
 
 LD . 
 
 o . 
 
 fO 
 
 fO 
 
 fO 
 
 (T3 
 
 ITS 
 
 CO . 
 
 o • 
 
 CO • 
 
 en • 
 
 
 •>. * 
 
 
 I— CM 
 
 1— CM 
 
 V) 
 
 1/1 
 
 LO 
 
 I/) 
 
 LO 
 
 1— CM 
 
 CO CM 
 
 f— CM 
 
 r^ CM 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 o* 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 •* 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 o 
 
 , — , 
 
 , — , 
 
 , — , 
 
 
 
 
 
 
 , — » 
 
 < — , 
 
 * — » 
 
 * s 
 
 '!t 
 
 
 O r- 
 
 O viD 
 
 CO CM 
 
 
 
 
 
 
 Ln CO 
 
 Ln Ln 
 
 Ln r^ 
 
 Ln cn 
 
 
 II 
 
 O VO 
 
 O 00 
 
 CO LO 
 
 
 
 
 
 
 CO O 
 
 en o 
 
 en en 
 
 CM cn 
 
 II 
 
 s: 
 
 CM . 
 
 cn • 
 
 CO • 
 
 
 
 
 
 
 r> • 
 
 o • 
 
 r^ • 
 
 1^ • 
 
 
 00 
 
 C\J r— 
 
 f-^ f-^ 
 
 r— r— 
 
 
 
 
 
 
 r— CM 
 
 I— CM 
 
 r— r— 
 
 r^ f— 
 
 o. 
 
 to 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 ^^ ^ 
 
 
 
 
 
 
 
 
 
 
 
 
 
 O 
 
 T~ 
 
 00 
 
 00 
 
 •^ 
 
 CO 
 
 «* 
 
 00 
 
 CO 
 
 >* 
 
 CO 
 
 •vi- 
 
 CO 
 
 •^ 
 
 Ul 
 
 — . 
 
 
 
 
 
 ■^^ 
 
 -~^ 
 
 
 ■■^^ 
 
 -^ 
 
 — ^ 
 
 — . 
 
 •*^ 
 
 UJ 
 
 < 
 
 00 
 
 •^ 
 
 •* 
 
 00 
 
 •vf 
 
 CO 
 
 CO 
 
 ^ 
 
 CM 
 
 1 — 
 
 CM 
 
 f— 
 
 o_ 
 
 
 
 ^^^ 
 
 *^^ 
 
 ■*«v^ 
 
 ^ 
 
 ^"v^ 
 
 "•»N^ 
 
 "^*^ 
 
 ^N^ 
 
 ^*v^ 
 
 *** — 
 
 ' — ^ 
 
 o-> 
 
 Q. 
 
 CO 
 
 00 
 
 CO 
 
 CO 
 
 00 
 
 CO 
 
 00 
 
 00 
 
 CO 
 
 CO 
 
 00 
 
 CO 
 
 
 
 ^ ^ 
 
 
 
 ^ ^ 
 
 
 ^ ^ 
 
 ^ 
 
 
 ^^ 
 
 
 ^ ^ 
 
 
 
 3 
 
 r — 
 
 
 
 1 — 
 
 
 1 — 
 
 f^ 
 
 
 I^ 
 
 
 ^— 
 
 
 
 UJ 
 
 » 
 
 
 
 *k 
 
 
 •k 
 
 «H 
 
 
 A 
 
 
 0k 
 
 
 
 iii^ 
 
 1— 
 
 
 
 , — 
 
 
 o 
 
 1— 
 
 
 ^- 
 
 
 o 
 
 
 
 00 
 
 ^■■"^ 
 
 
 
 '^■^ 
 
 
 ^~^ 
 
 ' 
 
 
 ^ 
 
 
 ^ 
 
 
 
 
 OJ 
 
 
 
 
 
 
 
 
 
 
 
 
 
 II 
 
 E 
 
 
 
 -ili 
 
 
 
 Ji^ 
 
 
 .:x^ 
 
 
 
 
 
 ^~ 
 
 Q. 
 
 
 
 CM 
 
 
 
 CM 
 
 
 CM 
 
 
 
 
 
 z 
 
 CO 
 
 
 
 
 
 
 G 
 
 "^ 
 
 
 
 
 
 <: 
 
 X 
 
 
 
 
 
 
 1— ( 
 
 
 
 
 
 
 r— 
 
 CM 
 
 CO 
 
 «* 
 
 Ln 
 
 LO 
 
 I^ 
 
 00 
 
 cn 
 
 o 
 
 , 
 
 CM 
 
 162 
 
 CO 
 
 t- 
 cn 
 o 
 
 Cl. 
 
 s- 
 o 
 
 o 
 o 
 
 <0 
 
 a. 
 
 CO 
 
 -o 
 
 c 
 
 0) 
 
 o 
 +-> 
 o 
 
 X 
 
 CD 
 

 a. 
 
 vo 
 
 II 
 
 CO C\J 
 
 ^ LT) 
 
 O r- 
 
 
 C\J ,— 
 
 1 — 1 — 
 
 C\J CM 
 
 
 «« *k 
 
 A •» 
 
 •* f^ 
 
 
 <X) C\J 
 
 (O in 
 
 CM I— 
 
 
 C>0 f— 
 
 n- r— 
 
 CM r— 
 
 
 en <T> 
 
 CT> 1— 
 
 CM VO 
 
 
 ';r 
 
 Lf) r— 
 
 ^ n- 
 
 
 CO CVJ 
 
 ^ Ln 
 
 O r- 
 
 
 OJ r— 
 
 r— t— 
 
 CM C\J 
 
 
 #1 «i 
 
 A •* 
 
 •* •* 
 
 
 <X) C\J 
 
 vD tn 
 
 CM <— 
 
 
 CM 1— 
 
 1 — 1 — 
 
 CM 1— 
 
 
 CT> CT» 
 
 en I— 
 
 CM VO 
 
 ' 
 
 ^ 
 
 Ln 1— 
 
 ^ 1— 
 
 
 ^ OvJ 
 
 "5r «>r 
 
 1— O 
 
 
 CM r— 
 
 1^ t— 
 
 CM CM 
 
 
 «i •« 
 
 «% A 
 
 •» « 
 
 
 «^ 1 
 
 O 1 
 
 O 1 
 
 ^ 
 
 CO 
 
 C\J 
 #1 •% 
 
 
 O 
 
 CO a\ 
 
 en 1— 
 
 r— VO 
 
 
 IT) 
 
 UD 1— 
 
 LD 1— 
 
 
 CO 
 
 CM 
 
 ':^ I 
 
 o 
 
 CM 
 
 VO CM VO LO CM 1— 
 CM t— r- I— CM I— 
 
 <n CM 
 
 «:»■ CM 
 
 CM 
 
 cn vo 
 
 LO CM 
 
 CM rs. 
 
 ^ CO 
 
 «* . I f— 
 
 I— . CM 
 
 CO 
 
 O 
 CM 
 
 o 
 
 CO 
 
 
 
 
 
 
 
 O 
 
 r-«. 1 
 
 vo 1 
 
 CM I 
 
 
 A 
 
 CM 
 
 ^— 
 
 CM 
 
 
 O 
 
 * A 
 
 •» •* 
 
 #1 A 
 
 'a- 
 
 — > 
 
 ^ 1 
 
 vo 1 
 
 vn 1 
 
 
 II 
 
 ^ 
 
 CM 
 
 CO 
 
 II 
 
 ^ 
 
 9% n 
 
 •* A 
 
 •I A 
 
 
 00 
 
 CM CM 
 
 CO r^ 
 
 r-v I— 
 
 Q. 
 
 Q. 
 
 to 
 
 r>> CO 
 
 00 CO 
 
 in IT) 
 
 
 
 
 
 00 Q_ 
 
 OO 
 
 00 o <n ^ 1 — vo 
 
 in CM vo CM LO oo 
 
 00 
 CO 
 00 
 
 00 ^1- 
 
 CO 00 
 
 (U 
 
 E 
 
 •r— 
 
 s- 
 
 Q. 
 
 X 
 
 E 
 03 
 
 to 
 
 CO 
 
 2 
 O 
 
 i~ 
 
 2 
 
 o 
 
 to 
 
 (TJ 
 
 E 
 fl3 
 
 CO 
 CO 
 CO 
 
 CO 
 
 <— CO 
 
 s 
 o 
 
 S- 
 
 2 
 o 
 
 to 
 rtJ 
 
 O) 
 E 
 ro 
 to 
 
 CO 
 
 CO 
 00 
 
 CM 
 
 to 
 
 to 
 
 (U 
 E 
 fO 
 to 
 
 <u 
 
 (T3 
 
 00 «:;*• 
 
 00 "5^ 
 
 CO 00 
 
 CM 
 
 c: 
 
 1 r^ 
 
 I 00 
 CM 
 
 1 r^ 
 
 1 00 
 CM 
 
 CM r- 
 
 CM ^ 
 CM r- 
 
 CO r^ 
 
 CM r— 
 
 cn «;r 
 
 r^ CO 
 
 vo t— 
 
 LO I— 
 tn CM 
 
 r^ CO 
 
 vo r— 
 
 LO .— 
 LO CM 
 
 I IT) I p^ 
 
 r— CM 
 
 VO I 
 CM 
 
 CM 
 
 r>» CM r^ I— 
 r^ I— ko CM 
 
 I vo I r^ 
 
 <— CM 
 
 CM I cn I 
 
 CM I— 
 
 I I 
 
 I I 
 
 CM I— CM I— 
 
 r^ o lo <n 
 
 vo CO LO ^ 
 
 I I 
 
 vo 
 
 CM 
 
 CM I 
 
 CM 
 
 p> CM vo «!:a- 
 
 I I 
 
 00 CM 00 I— 
 
 rN. I — vo CM 
 
 I I 
 
 I I 
 
 cn p^ vo ^ 
 
 00 o vo o 
 
 vo CO LD LO 
 
 I I 
 
 CM I 
 
 CM 
 
 cn I 
 
 00 CO 00 00 
 
 rs. CM vo ^ 
 
 CO I o I 
 
 1 — T— CM LO 
 
 cn o^d- r>. vo 
 
 00 
 CM 
 
 CO 
 
 00 
 
 CM I 00 I 
 CM I— 
 
 00 cn o CO 
 00 CO r>. vo 
 
 00 
 CM 
 00 
 
 00 
 
 CM 
 
 tz- 
 
 CM 
 
 LO ^O 
 
 163 
 
 1 r^ 
 
 1 CO 
 CM 
 
 1 r«. 
 
 1 00 
 CM 
 
 
 CM 1— 
 
 Q ^ 
 CM I— 
 
 cn rs. 
 
 vo «3- 
 
 
 r^ CO 
 vo 1— 
 
 LO r— 
 LO CM 
 
 CO CO 
 
 vo 1— 
 
 vo CM 
 
 in CM 
 
 CO 
 
 E 
 s- 
 
 O 
 
 i~ 
 
 a. 
 o 
 
 to 
 
 <v 
 u 
 
 o 
 to 
 <u 
 ai 
 
 to 
 
 O 
 •r- 
 
 i- 
 
 «l- 
 
 o 
 
 ro 
 
 O 
 >> 
 
 o 
 
 
 00 
 
 E 
 (d 
 
 S- 
 
 cn 
 o 
 
 i~ 
 cl. 
 
 titeM 
 
 
 3 
 
 :c 
 "2 
 
 CO cn 
 
 o f— 
 
 CM 
 

 , — ^ 
 
 o 
 
 
 , 
 
 o 
 
 
 A 
 
 ^ r- 
 
 
 ^— 
 
 «« 
 
 «!r 
 
 •^^ 
 
 «o 
 
 lO 
 
 II 
 
 (^ o 
 
 
 s 
 
 1 — 
 
 II 
 
 oo 
 
 
 Q. 
 
 
 
 
 
 o 
 
 
 
 o 
 
 
 
 VD I— 
 
 
 ^— ^ 
 
 r— •» 
 
 
 , 
 
 • o 
 
 
 «k 
 
 «d- O 
 
 
 n— 
 
 CVJ I— 
 
 
 
 O CO 
 
 o 
 o 
 
 ^ r— 
 VjD O 
 
 IT) >£) 
 
 o 
 
 O 
 
 o 
 
 O 
 
 1 r^ 
 
 t n" 
 
 •* •* 
 
 «« 
 
 o 
 
 «o 
 
 CT> O 
 
 CO o 
 
 r— 
 
 r~~ 
 
 A «% 
 
 A •« 
 
 LO WD 
 
 LO VO 
 
 164 
 
 o 
 o 
 
 CO f— 
 
 to o 
 
 CM I— 
 
 CVJ C^i 
 CM VO 
 
 o 
 
 o 
 
 o 
 
 o 
 
 1 1— 
 
 1 r— 
 
 «t 
 
 0i 
 
 «o 
 
 *o 
 
 ^ o 
 
 r^ o 
 
 CO 1— 
 
 CM 1— 
 
 •« X O^ 
 
 •t •» 
 
 CM CO 
 
 CM CO 
 
 CM U3 
 
 CM V£> 
 
 II 
 o. 
 
 in o 
 
 r— o 
 
 
 
 r^ o 
 
 r— o 
 
 
 
 
 
 1 o 
 
 o 
 
 
 1 o 
 o 
 
 
 
 2 
 o 
 
 o 
 
 CVi 1 
 
 2 
 o 
 
 o 
 
 5 
 
 o 
 
 S- 
 
 o 
 
 s- 
 
 o « 
 
 CO 1 
 
 5 
 o 
 
 CO •« 
 OJ 1 
 
 o 
 
 J- 
 
 r^ CO 
 
 f— U3 
 
 
 
 cr> CO 
 
 
 
 
 
 en CO 
 
 
 CT) CO 
 r— KO 
 
 
 I/) 
 
 (T3 
 
 «r> I 
 
 • o 
 ^ o 
 
 CM r— 
 
 O 1— 
 CO CO 
 
 ir> I 
 
 cr> i E 
 
 p- fO 
 
 r> I— 
 
 I— CO 
 
 to 
 
 O) 
 
 
 II 
 
 CL. 
 
 o 
 o 
 II 
 tn 
 
 D- 
 00 
 
 c/i a. 
 
 en 
 
 CO 
 
 CO 
 
 1— iO 
 «5f U3 
 
 UJ 
 CO 
 
 CO 
 
 LU 
 
 .. Ol 
 O r-l 
 
 o 
 
 CO 
 CO 
 
 00 
 
 CO 
 
 CO 
 
 E 
 
 •r- 
 
 S- 
 O- 
 
 CQ 
 X 
 
 VO O (/) 
 CM f— fO 
 
 CM r— 
 CM CO 
 
 .— I E 
 
 CM (d 
 
 •< *> to 
 en r— 
 I— CO 
 
 to 
 
 rt3 
 
 to 
 n3 
 
 lO 
 IT3 
 
 O) 
 
 E 
 fO 
 to 
 
 E 
 m 
 to 
 
 (U 
 
 to 
 
 cr> 
 
 LO 
 
 CM LT) 
 
 O r— 
 
 CM 
 
 CM 
 
 a 
 
 I I 
 
 00 I 
 
 CM LO 
 LD VD 
 
 'd- 
 
 CO 
 
 ^ 
 
 00 
 
 CO 
 
 «::f 
 
 CO 
 
 «* 
 
 CO 
 
 «* 
 
 CO 
 
 CO 
 
 ^ 
 
 CM 
 
 00 
 
 CO 
 
 CO 
 
 CO 
 
 00 
 
 00 
 
 CO 
 
 CM 
 
 to 
 
 I I 
 
 «;r o 
 
 CO I— 
 
 CM I— 
 CM CO 
 
 I I 
 
 O I E 
 CO n3 
 
 •> •> to 
 cr> r— 
 I— CO 
 
 00 
 
 CM I— fO 
 
 CM r— 
 CM CO 
 
 I I 
 
 O I E 
 
 CO fO 
 
 « » to 
 
 CTl (— 
 r- CO 
 
 I I 
 
 CM I 
 LT) 
 
 C\J LD 
 
 LD to 
 
 00 ^ 
 
 CM •— 
 
 00 CO 
 
 CQ 
 
 re 
 
 CT) 
 O 
 
 L- 
 O- 
 
 i- 
 o 
 
 0) 
 
 o 
 
 s- 
 
 o 
 
 <u 
 a: 
 
 V) 
 
 3 
 O 
 
 s- 
 
 o 
 
 c 
 o 
 
 •r— 
 -M 
 fO 
 N 
 
 Z3 
 
 c 
 
 oo 
 
 <T> 
 
 i- 
 Z3 
 CD 
 
 CO 
 
 to 
 
 r^ 00 cn 
 
 CM 
 
vo — - 
 
 II 
 II :s: 
 
 </) 
 Q- -^ 
 
 a. 
 
 tn 
 
 r— OJ .— in e— I— 
 >— r- C\ 
 
 CVJ CVJ I— LD r— r- 
 
 vo 
 
 II 
 
 o 
 
 CVJ C\J CO OJ c\j c*- 
 
 ^ CM CVJ LO ro 1— 
 >— 1— c\ 
 
 *X) CM ^ LO LO 1— 
 
 I — CM 
 
 CM CM r— I— 
 
 ro CM CM ro 
 
 ^ CM OO ,— 
 1— CM 
 
 r^ CM vo p— 
 
 o vD CM r^ en c 
 
 ^ CM CM <;r oo o 
 
 r— I— CM 
 
 *£> I ^ I lO 
 O VO CM r^ CT> o 
 
 II 
 a. 
 
 o 
 o 
 II 
 
 CM 
 
 CO t 
 
 VD CM ^ ID LTJ 1 — 
 O r-» CM CO (T> 
 
 ^ I CM I CO I 
 
 VO I ^ 
 
 I— VO CJ> o 
 
 ^ CM *;}■ O 
 
 •— CM 
 
 1^ I VO I 
 
 i— VO o o 
 
 •^ I 'S' I 
 
 r^ CM VO I— 
 
 ** I ^ I 
 
 VO I r>. I t^ 
 
 OVO CMOO <J\r- p— VO Or— 
 
 UJ \ 
 LU cC 
 
 00 
 
 r— I VO 
 
 00 I 
 
 CT> I r— I LD I 
 
 Or— «;d- «:J- ^ n 
 CO CM n CM CM CO 
 
 CO 
 00 
 00 
 
 00 <;!• 
 
 00 CO 
 
 (U 
 
 E 
 
 •r- 
 
 S- 
 
 CO I r- I 
 
 CM 
 
 cn 
 
 2 
 O 
 
 
 O) 
 
 r^ I— o CO 
 
 CO CM CO CO 
 
 00 ^ 
 
 00 '^J- 
 
 00 00 
 
 2 
 
 o 
 
 
 CO 
 00 
 00 
 
 CM 
 
 CO 
 X 
 
 r— CM 
 
 LO 
 
 o 
 
 to 
 
 (U O) 
 
 E E 
 
 rtJ ro 
 
 lO to 
 
 , 2. - 
 ,17,17 
 
 1 00 
 CM 
 
 CM ^ 
 
 1 
 
 CM 
 
 
 2, - 
 14.28 
 
 «;*• CM 
 
 CO CO 
 
 'd- 
 
 CM 
 
 CO CO 
 
 , 8. - 
 ,17,17 
 
 , 7, - 
 ,14,28 
 
 LO 
 
 
 4. - 
 14.28 
 
 LO 00 
 
 CM CO 
 
 LO 
 
 00 
 
 CsJ «d- 
 
 1 LD 
 
 1 r>. 
 
 CM 
 
 . 
 
 VO 
 
 1 r^ 
 
 CM 
 
 CO 1 
 
 r^ 1 
 
 LO 
 
 
 "S- 1 
 
 LO r«^ 
 
 CO CO 
 
 LO 
 
 00 
 
 CO ro 
 
 00 f— 
 
 1 1 
 
 LO 
 
 1 
 
 1 1 
 
 LO cyi 
 
 CM LO 
 r— n— 
 
 LO 
 
 cr> 
 
 CM lO 
 
 165 
 
 II II 
 00 I r^ I 
 
 LO 00 CO ^ 
 
 ^ I CO 
 
 I I 
 
 I I 
 
 LO I «;3- I 
 
 LO CT> CO LO 
 
 I I 
 
 CM I cy> i 
 
 r^r^s. r>>cM lo vo vor— 
 
 ^CM CO^ <::1- CM CO^ 
 
 00 'd- 
 
 00 "sa- 
 
 00 00 
 
 CM 
 
 cs 
 
 00 ^ 
 
 CM I— 
 
 00 CO 
 
 00 
 CM 
 00 
 
 00 
 
 OQ 
 
 E 
 
 i- 
 o> 
 o 
 
 i~ 
 
 S- 
 
 o 
 
 to 
 O) 
 
 u 
 
 3 
 
 o 
 
 CO 
 
 O 
 
 $- 
 
 O 
 
 N 
 
 ■o 
 <u 
 
 c 
 
 o 
 o 
 
 i- 
 
 CM 
 
 i^HC 
 
 
 ■Cj3 
 
 
 «2 
 
 ^3 
 
 :2S 
 
 VO 
 
 00 
 
 CT> 
 
 O r- 
 
 CM 
 
^O - — 
 
 II 
 
 II ^ 
 
 00 
 
 Q-, -^ 
 O- 
 00 
 
 ^ CO 
 00 • 
 CM CTt 
 
 CO . 00 • 
 
 CM en o Ln 
 
 ^ in o <x) U3 > — 
 vo "^ tn ^ wo ^ 
 
 CM ' CM ' 1— ^— 
 
 ^ • 
 
 CM 
 
 «X) c— 
 O • 
 V£3 ta- 
 
 ^ • 
 
 CM 
 
 >X> r— 
 O . 
 
 II 
 O- 
 
 CO CO CM ^ 00 <T> 
 (X) • CO • CM • 
 
 C\J >-^ CM — ■■ I— •>-- 
 
 s 
 
 CO CO CTl 
 
 • CM • 
 
 r«. ^ UD CO 
 
 CM 1—^ — - 
 
 O KD 
 <T> • 
 
 CM 
 
 ^ LD O UD 00 <— 
 
 ^ • <T> . V^ • 
 
 VD "^ LD ^ LO ^ 
 
 CM ^-^ CM — ' I 
 
 II 
 
 o 
 
 o 
 
 co 
 
 o 
 
 2 
 o 
 
 o <x> 
 
 to "d- 
 
 CM LO 
 
 166 
 
 ^ ir> CO r— 
 
 ^ . (JD • 
 
 l£5 'd- LO 'd- 
 
 CM ^— ' ■ ' 
 
 CO CO CM ^*- CO cr> 
 
 (£) . CO • CM • 
 
 r^ ^ r^ ^ U3 CO 
 
 CM ^-> CM — -' ■ — ' 
 
 Q 
 UJ 
 
 O- 
 
 oo 
 
 r>. CM 
 
 LO CM 
 LO^> 
 
 LO CM I — O 
 
 O • CO • 
 
 •^ CM 1 — CM 
 
 LO^> CO'^^ 
 
 to 
 
 to 
 
 o 
 
 s- 
 
 CM ^ 
 CO • 
 
 1^ ^ 
 
 CM — ' 
 
 in 
 
 2 
 O 
 
 5 
 
 o 
 
 to 
 
 O VO 
 
 as • 
 
 LO ^ 
 CM ' 
 
 CO CO CO CT> QJ 
 
 wo • CM 
 
 r^ «^ wo CO (Tj 
 
 CM ,— ^~> 10 
 
 <u 
 
 0) 
 
 E 
 
 E 
 
 ro 
 
 ro 
 
 U) 
 
 to 
 
 CO 
 
 CO 
 CO 
 
 CO ^ 
 
 ^ «d- 
 
 00 CO 
 
 r«> CM 1— o 
 
 r^ • CO • 
 
 LO CM 1 — CM 
 
 LO^^ CO 
 
 to 
 ro 
 
 to 
 ro 
 
 UJ 
 00 
 
 (O 
 
 S- 
 
 cn 
 o 
 
 i~ 
 o. 
 
 i- 
 o 
 
 o 
 
 4-> 
 
 u 
 
 ro 
 
 t3 
 (U 
 (U 
 Q- 
 
 00 
 
 -o 
 
 c 
 ro 
 
 CM "^ 
 
 O) 
 
 0) 
 
 0) 
 
 CO • 
 
 E 
 
 E 
 
 fc= 
 
 r^ *:!■ 
 
 ro 
 
 ro 
 
 rO 
 
 CM 
 
 to 
 
 to 
 
 to 
 
 LU 
 00 
 
 c 
 
 UJ 
 
 -J 
 
 CO 
 
 .. ol 
 
 t3 r-l 
 
 O 
 
 Qi •• 
 
 a. s: 
 
 O) 
 
 CL. 
 
 CO 
 X 
 
 CO ■* 
 
 00 '=^ 
 
 CO CO 
 
 CO 
 CO 
 CO 
 
 CO 
 
 CM 
 
 00 
 00 
 00 
 
 LO CM 
 O • 
 
 ^ CM 
 
 LO — ■> 
 
 00 «* 
 
 CO CO 
 
 0) 
 
 E 
 
 o 
 
 •r— 
 
 u 
 
 00 
 00 
 CO 
 
 si- 
 00 
 
 0) 
 
 zn 
 
 •r- 
 U- 
 
 CM 
 
 a 
 
 CO 
 
 LO 
 
 to 
 
 00 CTl 
 
 1 — CM 
 
I 
 
 OJ O .— .— CO CO 
 ^B^ I— |COC\J .— OJ r— C^0 
 
 c\j p^ c\j r-» CO o 
 
 a. 
 
 00 
 
 <o 
 
 II 
 
 Q. 
 
 u> cr> 00 cvj r>>. CO 
 ^ r^ ^00 CO en 
 
 en o 
 
 CvJ I— 
 
 CM •— 
 
 c\j 00 ■— r^ 
 
 >X3 ID 
 
 CT» O LO O 
 CM «— r— I— 
 
 ^ ID 
 CM I— 
 
 CM 00 CM 00 I— rv. 
 
 vo 00 in 
 
 CM 00 
 CT> CO 
 
 00 CT> 
 
 CM 
 
 «;r CT> 
 CT» CO 
 
 ^ cn 
 
 <0 CO 
 
 "Si- r— 
 CM 
 
 (D 
 
 CO 
 
 LT) 
 
 LO r>. r»» r^ r — cm 
 
 O^ CO (T> CO CO VD 
 
 CM 00 ^ CO 
 
 CT» CO r^ li) 
 
 CTl O ^ 1— 
 CM t— CM 
 
 CM 00 r— P*, 
 
 CM 00 
 
 cn CO 
 
 VO CO 
 
 r^ i£> 
 
 00 <Tl ^ to 
 CM CM r— 
 
 
 in 
 
 LT) r-> I— CM 
 CT> CO CO <^ 
 
 en I 
 
 CM 
 
 CVJ 
 
 CM CO CM 00 
 
 LO CO 
 
 CO pv. 
 
 o 
 ^ o 
 
 II 11 
 
 Q- CO 
 
 a. 
 00 
 
 CM 00 
 CT» ^ 
 
 00 I 
 CO 
 
 *i- CO 
 
 «;*• I 
 
 P% 00 
 
 CM 
 
 LO I 
 LO 
 
 CO 
 
 LO I 
 LT) 
 
 LO LO Ps LO 1—00 
 
 cTi *;r cn ^ CO p> 
 
 CT> 1 
 CM 
 
 CM 
 
 CM CO 
 LO 
 
 CO Pn 
 LO 
 
 C\J 00 P-s f— 
 CT> ^ P^ 00 
 
 00 I 
 C^O 
 
 CM 
 
 LO I 
 LO 
 
 LO I 
 LD 
 
 LO LO 1—00 
 CT> «::}• 00 r^ 
 
 "^ I CO I 
 
 CM I— 
 
 CM I 
 CM 
 
 CO I 
 LO 
 
 CO t 
 CO 
 
 LO I 
 
 LO 
 
 O CO CO LO O p> 
 
 cn ^ cy> «::j- 00 Ps. 
 
 Ui < 
 
 UJ 
 
 00 
 
 3i 
 U' 
 
 4 
 
 00 
 00 
 00 
 
 CO «* 
 
 CO 00 
 
 OJ 
 
 
 CO 
 X 
 
 CM 
 
 CM I 
 CM 
 
 CO I 
 LO 
 
 LO I 
 
 LO 
 
 O CO O P> 
 
 cT> ^ 00 r«>. 
 
 CO «* 
 
 00 "^^r 
 
 00 00 
 
 tA 
 
 O 
 
 2 
 o 
 
 
 to 
 
 O) 
 
 E 
 
 0) 
 
 
 00 
 CO 
 CO 
 
 00 
 
 CM 
 
 o 
 
 (/1 
 
 (TJ 
 
 dJ 
 
 (A 
 
 167 
 
 LO O 
 
 CM 00 
 CO 
 
 ^ cn 
 en CO 
 
 LO o 
 
 CM 00 
 CO 
 
 «;f cn 
 en CO 
 
 ':d- cn 
 
 CO 
 
 r^ p^ 
 
 CT> CO 
 
 LO 
 
 CM 00 
 
 CO 
 
 ^ CO 
 
 cn ^ 
 
 «:3- I 
 
 CO 
 
 P> LO 
 
 cn <;f 
 
 CO I 
 
 CO I 
 CO 
 
 0% A 
 CO LO 
 
 cn "d- 
 
 LO 
 
 o 
 
 LO 
 
 2 
 o 
 
 2 
 
 o 
 
 <u 
 
 
 00 
 CO 
 00 
 
 00 ^ 
 
 00 00 
 
 
 
 a; 
 
 E 
 
 (U 
 
 
 00 ^ 
 
 CO «* 
 
 00 00 
 
 czy 
 
 —I 
 
 Id 
 
 O 
 
 i. 
 
 a. 
 o 
 
 (U 
 u 
 
 o 
 
 v> 
 Q) 
 CC 
 
 (/) 
 
 3 
 O 
 •r- 
 
 S- 
 
 o 
 
 <TJ 
 
 o 
 >> 
 
 ■M 
 
 CM 
 
 E 
 
 cn 
 o 
 
 O- 
 
 
 
 .jbdOa. 
 
 ;8 
 
 CM 
 
 c; 
 
 I— CM 
 
 CO 
 
 LO 
 
 LO 
 
 00 
 
 cn 
 
 o I— 
 
 CM 
 

 ,_ 
 
 o 
 
 
 A 
 
 CO o 
 
 «^ 
 
 r— 
 
 1 — 
 
 <x> 
 
 ^_^ 
 
 •< « 
 
 
 II 
 
 ro o 
 
 II 
 
 s: 
 
 CM o 
 
 
 00 
 
 •t n^ 
 
 a. 
 
 >^ 
 
 i^~ •k 
 
 
 D- 
 
 f— CM 
 
 
 oo 
 
 
 
 
 O 
 
 
 
 ro O 
 
 
 ^ — s 
 
 CM r— 
 
 
 f-^ 
 
 «« •» 
 
 
 «k 
 
 r- O 
 
 
 r— 
 
 LT) O 
 
 
 
 23, 
 36,1 
 
 WD 
 II 
 
 O 
 
 O 
 
 K£3 I— 
 
 CO o 
 
 CO I— 
 00 CM 
 
 o 
 o o 
 
 I— o 
 <:1- o 
 
 CM 
 
 168 
 
 OJ o 
 CM O 
 
 (^ •« 
 
 *^ I 
 
 I— «£) o 
 CM ro $_ 
 
 CO I 
 
 CM •< 
 
 «o 
 
 I— o 
 IT) I— 
 
 2 
 o 
 
 o 
 
 o 
 
 <:a- O 
 
 o o 
 
 CM I— 
 
 CM 1— 
 
 ^ «« 
 
 ^ •« 
 
 CM O 
 
 <* o 
 
 LT) O 
 
 in o 
 
 m> r— 
 
 •* r^ 
 
 ^ " 
 
 CM •• 
 
 CM <X) «* 
 
 CM UD 
 
 CO 
 
 CO 
 
 IX> 
 
 <x> 
 
 VO 
 
 ^ o 
 
 CM O 
 
 CM O 
 
 r^ •> 
 
 
 (T> •> 
 
 
 "^ 1 
 
 
 ^ 1 
 
 
 9\ «l 
 
 S- 
 
 A t\ 
 
 j£ 
 
 CM WO 
 
 o 
 
 O WD 
 
 O 
 
 CM CO 
 
 s- 
 
 CM CO 
 
 S- 
 
 <^ 1 
 
 WD 1 
 
 CM " 
 
 CM •> 
 
 «o 
 
 «o 
 
 CM O 
 
 <d- o 
 
 UD f— 
 
 LO r— 
 
 2 5 
 
 O O 
 
 2 
 
 o o 
 
 2 
 o 
 
 cr 
 
 LlI 
 
 _1 
 
 oo 
 
 L. 
 
 cn 
 o 
 
 S- 
 
 o 
 
 (/) 
 
 (U 
 
 u 
 
 i- 
 
 3 
 
 o 
 (/) 
 
 0) 
 
 or 
 l/t 
 o 
 
 •r" 
 
 $- 
 
 »4- 
 
 O 
 
 c 
 o 
 
 •r- 
 +-> 
 
 n5 
 
 N 
 
 •r— 
 U 
 
 00 
 CO 
 
 0) 
 
 cn 
 
I 
 
 3^ 
 
 n\ 
 
 gpi 
 
 
 ^_^ 
 
 
 
 
 
 
 
 
 
 
 
 
 
 r— 
 
 
 
 
 
 
 
 
 
 
 
 
 
 M 
 
 <— o 
 
 1 1 — 
 
 1 CO 
 
 CO o 
 
 .— vo 
 
 CO O 
 
 CO to 
 
 
 r- O 
 
 
 
 
 ^ r- 
 
 ro 
 
 CO 
 
 CO 
 
 f — 
 
 r— 
 
 I — 
 
 , — 
 
 
 
 
 
 
 VO ^-^ 
 
 •> •> 
 
 « #k 
 
 «t «> 
 
 0k 0k 
 
 0k 0k 
 
 0k 0k 
 
 A 0k 
 
 
 •« A 
 
 
 
 
 II 
 
 LD r^ 
 
 CO r>» 
 
 CO o 
 
 CO 00 
 
 cr> r^ 
 
 to 00 
 
 1— p>. 
 
 
 CM CO 
 
 
 
 
 II z: 
 
 r— 
 
 ^— 
 
 r-~ 
 
 CO 
 
 r— 
 
 CO 
 
 CO 
 
 
 
 
 
 
 00 
 
 A ^ 
 
 A «t 
 
 A m 
 
 0k 0k 
 
 •^ •» 
 
 0k 0k 
 
 0k M 
 
 
 0k 0k 
 
 
 
 
 cu -^ 
 
 m CT> 
 
 uo cr> 
 
 CO r— 
 
 r^ ^ 
 
 vo r^ 
 
 to «:a- 
 
 in r^ 
 
 
 r^ ^ 
 
 
 
 
 o. 
 
 
 
 r^ 
 
 
 
 
 
 
 
 
 
 
 oo 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 vo o 
 
 CO o 
 
 Ln vo 
 
 r>. o 
 
 vo ^o 
 
 CO O 
 
 to to 
 
 
 ^ o 
 
 
 
 
 
 ^"^ 
 
 r-^ 
 
 r— 
 
 ^— 
 
 r— 
 
 r— 
 
 r— 
 
 ^— 
 
 
 ^— 
 
 
 
 
 
 r— 
 
 A » 
 
 A #1 
 
 «« M 
 
 •» •* 
 
 «s «« 
 
 0k 0k 
 
 0k 0k 
 
 
 0k 0k 
 
 
 
 
 
 •t 
 
 CO 00 
 
 VO 00 
 
 U3 P«» 
 
 CO 00 
 
 r>» r^ 
 
 «?^ 00 
 
 CO r^ 
 
 
 r«. CO 
 
 
 
 
 
 n— 
 
 CO 
 
 I — 
 
 CO 
 
 CO 
 
 CO 
 
 CO 
 
 CO 
 
 
 
 
 
 
 
 ^-^ 
 
 M •« 
 
 A A 
 
 A •« 
 
 #t A 
 
 0k 0k 
 
 #k 0k 
 
 0k 0t 
 
 
 0k 0k 
 
 
 
 
 
 
 f— «;!■ 
 
 r- ^ 
 
 r-^ CO 
 
 CO ^ 
 
 CO CO 
 
 o ^ 
 
 to CO 
 
 
 CO «;1- 
 
 
 
 
 
 
 CM r— 
 
 CO 1— 
 
 r— CO 
 
 CO 1— 
 
 1— CO 
 
 CO r- 
 
 I— CO 
 
 «^ 
 
 CO t— 
 
 LO 
 
 to 
 
 r^ 
 
 
 
 V£> CTl 
 
 CO cn 
 
 un vo 
 
 r^ en 
 
 to to 
 
 r^ CTl 
 
 to to 
 
 
 CO CD 
 
 
 
 
 , 
 
 (•'^^ 
 
 
 
 ^mm 
 
 
 ^.^ 
 
 
 ^^ 
 
 
 
 
 
 
 
 o 
 
 •% A 
 
 «« «t 
 
 0k 0k 
 
 0k 0k 
 
 A A 
 
 A 0k 
 
 A A 
 
 
 A « 
 
 
 
 
 
 A 
 
 r— 1 
 
 in 1 
 
 <■£> 1 
 
 1 — 1 
 
 r^ 1 
 
 CO 1 
 
 CO 1 
 
 
 to 1 
 
 
 
 
 
 ^- 
 
 CO 
 
 f— 
 
 CO 
 
 CO 
 
 CO 
 
 CO 
 
 CO 
 
 
 
 
 
 
 
 v.- ^ 
 
 •« 0^ 
 
 M «« 
 
 0k 0k 
 
 0k 0k 
 
 W\ #k 
 
 •> M 
 
 0k 0k 
 
 S 
 
 M •> 
 
 s 
 
 5 
 
 5 
 
 
 
 o CO 
 
 O CO 
 
 t^ CO 
 
 1— CO 
 
 CO CO 
 
 O CO 
 
 r^ CO 
 
 o 
 
 CO CO 
 
 o 
 
 o 
 
 o 
 
 VO 
 
 
 CO 1— 
 
 CO t— 
 
 I— CO 
 
 CO 1— 
 
 I— CO 
 
 CO 1— 
 
 1— CO 
 
 s- 
 
 CM r- 
 
 S- 
 
 $- 
 
 s_ 
 
 II 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 Q. 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 U3 1 
 
 CO I 
 
 ID 1 
 
 r^ 1 
 
 to 1 
 
 CO 1 
 
 to 1 
 
 
 ^ 1 
 
 
 
 
 
 -"-^H 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 r— 
 
 " •> 
 
 0k 0k 
 
 «t 0k 
 
 0k 0k 
 
 A A 
 
 •1 A 
 
 0k 0k 
 
 
 •* •» 
 
 
 
 
 
 A 
 
 CsJ 00 
 
 <£) 00 
 
 vo r-o, 
 
 CO CO 
 
 r^ rv. 
 
 ^ 00 
 
 CO r>. 
 
 
 to 00 
 
 
 
 
 
 o 
 
 CO 
 
 r— 
 
 cu 
 
 CO 
 
 CO 
 
 CO 
 
 CO 
 
 
 
 
 
 
 
 *•— > 
 
 0\ 0* 
 
 0i 0t 
 
 •k 0k 
 
 0k 0k 
 
 •* •* 
 
 0k 0k 
 
 0k 0k 
 
 
 0k 0k 
 
 
 
 
 
 
 rrr ^ 
 
 r- ^ 
 
 r>N «:J- 
 
 CO «d- 
 
 o^^ 
 
 o ^ 
 
 r-N CO 
 
 to 
 
 CO Lf> 
 
 m 
 
 to 
 
 10 
 
 
 
 CO I— 
 
 CO t— 
 
 f— CO 
 
 CO I— 
 
 r— CO 
 
 CO 1— 
 
 1— CO 
 
 <0 
 
 CO 1— 
 
 fO 
 
 rxj 
 
 «o 
 
 
 
 ^O 1 
 
 CO 1 
 
 LO I 
 
 rs. 1 
 
 to I 
 
 r-N 1 
 
 to 1 
 
 
 CO 1 
 
 
 
 
 
 ^""^ 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 O 
 
 •» •« 
 
 A ffh 
 
 •> 0k 
 
 0k 0k 
 
 «l «t 
 
 0k 0k 
 
 0k 0k 
 
 
 0k 0k 
 
 
 
 
 
 •« 
 
 r— 1 
 
 LO 1 
 
 to I 
 
 1 — 1 
 
 r>N 1 
 
 CO I 
 
 CO 1 
 
 
 to 1 
 
 
 
 
 
 o 
 
 CO 
 
 1 — 
 
 CO 
 
 CO 
 
 CO 
 
 CO 
 
 CO 
 
 SI) 
 
 
 <u 
 
 0) 
 
 cu 
 
 
 "*^ 
 
 
 •« A 
 
 O^ 
 
 0k 0k 
 
 r^ CO 
 
 0k 0k 
 
 I— ^ 
 
 0k 0k 
 
 CO CO 
 
 
 0k 0k 
 
 r> CO 
 
 (O 
 
 •» 0k 
 
 CO «;1- 
 
 
 E 
 
 E 
 re 
 
 
 
 CO 1 — 
 
 CO 1— 
 
 r— CO 
 
 CO 1— 
 
 I— CO 
 
 CO f— 
 
 1— CO 
 
 (/) 
 
 CM 1— 
 
 lO 
 
 to 
 
 10 
 
 ^ 
 
 
 
 
 
 
 
 
 
 
 
 
 
 o" 
 
 
 
 
 
 
 
 
 
 
 
 
 
 M 
 
 r- 1 
 
 vo ■ 
 
 O 1 
 
 ^ 1 
 
 CO 1 
 
 ^ 1 
 
 CO 1 
 
 
 r^ 1 
 
 
 
 
 ^ O 
 
 r*~ 
 
 
 ^_ 
 
 
 
 
 
 
 
 
 
 
 » — ' 
 
 •I 0i 
 
 #1 «k 
 
 •* A 
 
 0k 0k 
 
 M A 
 
 0k 0k 
 
 0k 0k 
 
 
 0k 0k 
 
 
 
 
 11 11 
 
 ^— 1 
 
 ^— 1 
 
 »^ 1 
 
 KO 1 
 
 r— 1 
 
 r^ 1 
 
 CO 1 
 
 
 ^ 1 
 
 
 
 
 21 
 
 ^ 
 
 CO 
 
 CO 
 
 <d- 
 
 'CJ- 
 
 <d- 
 
 ^ 
 
 
 CM 
 
 
 
 
 a. 00 
 
 M A 
 
 0\ 0k 
 
 A «k 
 
 0k 0k 
 
 A A 
 
 •t 0k 
 
 0k 0k 
 
 
 0k 0k 
 
 
 
 
 •^ 
 
 CTi r>v 
 
 00 CO 
 
 ^ CO 
 
 r% r^ 
 
 CO CO 
 
 ^ 1^ 
 
 O 00 
 
 
 (Tt CO 
 
 
 
 
 a. 
 CO 
 
 CO CO 
 
 CO CO 
 
 CO ^ 
 
 ^ CO 
 
 «* <:!• 
 
 ^ CO 
 
 «* ^ 
 
 
 ^ CO 
 
 
 
 
 _ 
 
 
 
 
 
 
 
 
 
 
 
 
 
 U4 ^.^ 
 
 CO 
 
 CO 
 
 ^ 
 
 00 
 
 ^ 
 
 00 
 
 "* 
 
 CO 
 
 CO 
 
 '^ 
 
 CO 
 
 ^ 
 
 UJ < 
 
 00 
 
 ^ 
 
 •^ 
 
 CO 
 
 ^ 
 
 00 
 
 ^ 
 
 CO 
 
 ^ 
 
 ^ 
 
 00 
 
 ^ 
 
 00 Q- 
 
 CO 
 
 CO 
 
 00 
 
 00 
 
 CO 
 
 00 
 
 CO 
 
 CO 
 
 00 
 
 00 
 
 00 
 
 00 
 
 
 ^^ 
 
 
 
 ^.^ 
 
 
 ^.^ 
 
 
 _ 
 
 
 
 
 
 LJ 
 
 ' — ■ 
 
 
 
 r— 
 
 
 F— 
 
 
 ^— 
 
 
 
 , — 
 
 
 •« 
 
 
 
 0k 
 
 
 A 
 
 
 0k 
 
 
 
 0k 
 
 
 ^ 
 
 r— 
 
 
 
 ,— 
 
 
 o 
 
 
 f— 
 
 
 
 o 
 
 
 CO 
 
 "" 
 
 
 
 "—^ 
 
 
 
 
 
 
 
 
 
 
 CU 
 
 
 
 
 
 
 
 
 
 
 
 II 
 
 E 
 
 
 
 -iiJ 
 
 
 
 .^ 
 
 
 
 
 
 z: 
 
 si 
 
 Q. 
 
 
 
 CO 
 
 
 
 CO 
 
 
 
 
 
 < 
 
 CO 
 X 
 
 
 
 
 
 cs 
 
 
 
 r— 
 
 CJ 
 
 CO 
 
 ^ 
 
 ID 
 
 to 
 
 r^ 
 
 00 
 
 (y> 
 
 o 
 
 _ 
 
 CO 
 
 169 
 
 I 
 
 oo 
 
 E 
 
 o 
 
 i- 
 o 
 
 o 
 
 I. 
 
 o 
 (/) 
 
 o: 
 
 (/) 
 
 3 
 O 
 •r- 
 
 S- 
 (O 
 
 O 
 
 (O 
 N 
 
 -o 
 
 o 
 
 C_) 
 
 0) 
 
 C7» 
 
 ^ 
 
 
 al 
 
 a 
 
 •i 
 
 .J 
 
 '■'T- 
 
 3 
 
 
 ;i 
 
 
 *) 
 
 '.iT^^'^a. 
 
 C3 
 
 1 
 
 e3 
 
 r. 3» 
 
 It Sm 
 
170 
 
 VITA 
 
 Kuo Yon Wen wis born in Shanghai, China on November 
 m, 19U9. He received his B.S. degree in Electrical 
 Engineering and Coirputer Science in 1971, and his M.S. 
 degree in Computer Science in 197u, both from University of 
 Illinois, nrbara-Chairpaign. From 1971 to 1576, he was a 
 research assistant in the Department cf Computer Science. He 
 was associated with the Illiac III pro-ject from 1971 to 1973 
 and with the Machine and Software Organization project since 
 1973. He was the coauthor with Prof. Duncan H. Lawrie of a 
 paper, "Effectiveness of Various Processor/Memory 
 Interconnections," presented at the 197f International 
 Tonference on Parallel Processing. He is a member of the 
 Institute of Electrical and Electronic Engineers. 
 
I 
 
 IIIOCRAPHICDATA 
 
 1. Report No 
 
 eport No. 
 
 UIUCDCS-R-76-830 
 
 lie and Subtitle 
 
 Interprocessor Connections-- 
 Capabmties, Exploitation and Effectiveness 
 
 y[hor(s) 
 
 JO Yen Wen 
 
 Irforming Organization Name and Address 
 
 diversity of Illinois at Urbana-Champaign 
 Ijpartment of Computer Science 
 itana, Illinois 61801 
 
 t, )ori soring Organization Name and Address 
 
 litional Science Foundation 
 lishington, D. C. 
 
 3. Recipient'a Acceaaion No. 
 
 S- Report Date 
 
 October, 1976 
 
 4. 
 
 t> Performing Organization Rept. 
 
 ''°- UIUCDCS-R-76-830 
 
 10. Proi«ct/Taak/Work Unit No. 
 
 11. Contract /Grant No. 
 
 US NSF MCS73-07980 A03 
 
 13. Type of Report 8t Period 
 Covered 
 
 Doctoral Dissertation 
 
 14. 
 
 i. jpplementary Notes 
 
 I. bstracts 
 
 licently, some research interests has centered around interprocessor connections for 
 ::MD type parallel machines. However, we still lack a methodology for evaluating 
 irious networks. In this paper, we first present some new results on network 
 I'operties. Then we show how to exploit various networks in ordinary computations. 
 Inally we describe how we can apply the theoretical results to predict the 
 lirformance of some network in a real program environment, which is the true measure 
 r network effectiveness. 
 
 '• ey Words and Document Analysis. 17a. Descriptors 
 
 i 
 
 /,Tay Slicing 
 ('inputation Adaptation 
 ftwork Control 
 Irinutation Capabilities 
 !MD Type Machine 
 Itmulation 
 
 'fc|dentifiers/Open-Ended Te 
 
 'e,:OSATl Field/Group 
 
 ••jailability Statement 
 
 fjlease Unlimited 
 
 '^ NTIS-3B (10-70) 
 
 19. Security Class (This 
 Report) 
 
 5I£J 
 
 U:iC:LASSIFIED 
 
 cunty Class (Thi; 
 
 20. Security Class ( 1 his 
 Page 
 
 UNCLASSIFIED 
 
 21- No. of Pages 
 
 176 
 
 22. Price 
 
 USCOMM-OC 40329-P7I 
 
 ■if 'Si 
 
 -'M 
 
 3 
 
 «2 
 
 «3 
 
 I 
 
ms^ 
 
■irtCTP 
 
 Ik 
 
 ..-ran 
 
 e 3 
 
 3 
 
 5» 
 
 'S 
 -a 
 

 ->' 
 
 m' 
 
 m^ 
 
 t 
 

■M' 
 
 ..my. 
 
 
 TJ^^ 
 

 Ml 
 
 Mk 
 
 JAN ^ 
 
 9 197b 
 
UNIVERSITY OF ILLINOIS-URBANA 
 510e4IL6Rno C002 no 830835(1976 
 Implementation ol the language CLEOPATRA 
 
 3 0112 088403073