bHMBBKSbb 
 
 wKm 
 
 HUH 
 
 BHNifisB Ififiu 
 
 SllSffil 
 
 H 
 MM 
 
 H 
 
 raw 
 
 BBSS 11888$ 
 
 IWwifniBHittv 
 
 Hi 
 
 ■fUfHttS 
 
LIBRARY OF THE 
 
 UNIVERSITY OF ILLINOIS 
 
 AT URBANA-CHAMPAIGN 
 
 cop' 2. 
 
Digitized by the Internet Archive 
 in 2013 
 
 http://archive.org/details/simulationoftree503swan 
 
V.D-r 
 
 .503 
 ip.2. 
 
 
 jdHn. 
 
 Report No. UIUCDCS-R- 72-503 
 
 SIMULATION OF A TREE PROCESSOR 
 
 by 
 
 Larry Allen Swanson 
 
 January, 1972 
 
 DEPARTMENT OF COMPUTER SCIENCE 
 UNIVERSITY OF ILLINOIS AT URBANA-CHAMPAIGN 
 
 URBANA, ILLINOIS 
 
 THE LIBRARY OR THE 
 
 UNIVERSITY OF ILLINOIS 
 AT URBANA-CHAMPAIGN 
 
Report No. UIUCDCS-R-72-503 
 
 Simulation of a Tree Processor 
 by 
 Larry Allen Swanson 
 
 January, 1972 
 
 Department of Computer Science 
 
 University of Illinois at Urbana-Champaign 
 
 Urbana, Illinois 6l801 
 
 This work was supported in part by US NSF GJ 27^6 and was submitted in 
 partial fulfillment of the requirements for the degree of Master of 
 Science in Computer Science, January, 1972 
 
iii 
 
 TABLE OF CONTENTS 
 
 Page 
 
 1. INTRODUCTION 1 
 
 2. MOTIVATION FOR A TREE PROCESSOR 2 
 
 3 . OVERALL DESCRIPTION OF SIMULATOR 6 
 
 3.1 Simulator Input 6 
 
 3.2 Simulator Scheduler 7 
 
 3.3 Processor Simulator 13 
 
 3.4 Machine Configurations. , 16 
 
 It, SIMULATION OF ROUTERS 23 
 
 4.1 General Design 23 
 
 4.2 Illiac IV and Semmelhaack Routers 26 
 
 4.3 Crossbar and Batcher Networks 2? 
 
 4.4 Log Router 2? 
 
 5. EXPERIMENTS AND RESULTS 28 
 
 5.1 Discussion of Experiments 28 
 
 5.2 Discussion of Results .0 31 
 
 LIST OF REFERENCES 40 
 
IV 
 
 ACKNOWLEDGEMENT 
 
 The author wishes to express appreciation to Professor 
 D. J. Kuck for valuable assistance and utmost patience during 
 the development and writing of this thesis. The author 
 acknowledges Paul Budnik and Yoichi Muraoka who designed and 
 developed parts of the simulator. The simulator's input was 
 produced by a program written by Joseph Han. Funds from 
 National Science Foundation Grant USNSF-GJ-2744 administered 
 by Professor Kuck were used for computer time to complete part 
 of the simulator. Finally, the author appreciates the many 
 hours spent by Diana Mercer and the author's wife, Judith, 
 typing the manuscript. 
 
1. INTRODUCTION 
 
 To increase the throughput of data in computers, the 
 organization of several computing machines have been designed 
 to take advantage of parallelism in computing. Examples of 
 such machines are ILLIAC IV, the Burroughs 5500, IBM 360/91 
 and the Control Data Star. A design which seems to exploit the 
 natural structure of assignment statements has been proposed by 
 Professor David Kuck. This paper will discuss part of the 
 motivation of such a machine, the overall organization of Kuck's 
 proposal and describe a timing simulator which was written to 
 simulate a class of such machines. In addition, the experiments 
 which were run on the simulator and their results will be dis- 
 cussed. 
 
2. MOTIVATION FOR A TREE PROCESSOR 
 
 Consider the assignment statement R=B*C+D*E which 
 maps on to the binary tree as shown in Figure 1. As 
 
 B 
 
 D 
 
 Figure 1 
 
 E 
 
 proposed by Kuck, a natural connection of processors in a parallel 
 processor would be in a tree configuration as shown in Figure 2 
 for the example assignment statement. Thus, PE2 
 
 Figure 2 
 
 and PE3 would calculate the products of B and C and D and E 
 simultaneously. These intermediate results would then be routed 
 to PE1 where the addition could take place. The statement would 
 be processed in two steps whereas three steps are required by a 
 single processor o Obviously, another similar assignment statement 
 
3 
 could be started in the processors PE2 and PE3 while PE1 is 
 processing the addition of the intermediate results. Thus, if 
 four assignment statements similar to the given example were to 
 be processed, twelve steps would be needed by a single processor 
 while only five steps would be required in the tree processor. 
 For ideally distributed operands and operators, one statement 
 of N operands requires N-l processing steps in a single pro- 
 cessor but only log ? N steps for a tree processor with N-l 
 processors. More significant is the fact that if S such state- 
 ments of N operands are to be processed, only S+logpN-1 steps 
 are required for execution in a N-l processor tree whereas 
 S(N-l) steps are needed by a single processor. Obviously, these 
 expressions are for ideal situations and several problems of 
 this design present themselves immediately. 
 
 One problem is that for real programs the distribution 
 of operands and operators is much less than ideal. In addition, 
 few assignment statement trees would fit onto any fixed size of 
 a tree simulator perfectlyj many assignment statement trees 
 would be too small and few undoubtedly would be too large. 
 Hence, buffers with a tag algorithm similar to that implemented 
 in the IBM 360/91 would be required to drive the processors in 
 the tree structure. 
 
 Another problem is supplying the tree processor with 
 a sufficient amount of data so that it can be kept busy. 
 Certainly at least one memory module for each processor in the 
 bottom level of the tree processor would be necessary. However, 
 
4 
 
 only in some very rare cases would the memory module associated 
 with a processor in the bottom row contain the data required by 
 the processor. Thus, a router or alignment network would be 
 necessary to distribute data from the memory modules to the 
 processor buffers. In addition, a router would be required to 
 route data from the outputs of the processors to memory and also 
 from the outputs of processors back to the bottom level of 
 processors in the tree processor. Data is required to be routed 
 from the outputs of processors to the input of bottom level 
 processors whenever a result of one expression is required by 
 a following expression. This condition will be referred to as 
 "feedback". 
 
 The general machine organization discussed above is 
 shown in Figure 3. (For simplicity, all descriptive diagrams 
 of the tree processor will include only seven processors „) 
 Since the performance of such a machine organization on real 
 programs is extremely difficult to determine theoretically, it 
 was decided that a timing simulator be written. The simulator's 
 purpose is to help determine what type of routing network in 
 each position would be required such that no bottlenecks occur 
 in the overall system. In addition 9 the simulator will also 
 help determine the buffer size necessary between processors and 
 in the router networks such that the system contain no bottle- 
 necks. Thus, throughput and buffer sizes will be measured for 
 various machine configurations. 
 
PET 
 
 PE1 
 
 PE2 
 
 HZ 
 
 PE3 
 
 I 
 
 PE5 
 
 1 
 
 PE6 
 
 Router 2 
 
 lemory Modules 
 
 Router 1 
 
 PE7 
 
 % ^ ? f ^ ? T t 
 
 connections 
 to memory 
 
 connections to 
 base processors 
 
 Figure 3 
 
3. OVERALL DESCRIPTION OF SIMULATOR 
 
 3.1 Simulator Input 
 
 In an attempt to provide a parallel machine design 
 which does not suffer from any serious bottlenecks, Kuck has 
 proposed the series of elements in this parallel processor be 
 pipelined, driven by buffers and that each element have an 
 equal pipeline step size. Thus, the instruction decoder, tree 
 processor and all routers are assumed to be pipelined with a 
 common pipeline interval. This concept about the overall 
 design should be kept in mind while reading the remainder of 
 the paper. 
 
 Input to the simulator consists of height reduced 
 
 assignment statement trees produced by a program written by 
 
 2 ... 
 
 Han which implements some tree height reducing algorithms pro- 
 
 3 . 
 
 posed by Muraoka . A simple example is given here to give the 
 
 flavor of the work done by Han's program. Consider the following 
 
 FORTRAN assignment statement. As shown below in Figure k 9 
 
 R = A * (B+C*D) + E (1) 
 
 a four level, fifteen processor tree would be required to pro- 
 cess the statement as written without using a temporary result 
 and routing the result back to another processor as input. Note 
 also that only four of the fifteen nodes in the tree processor 
 would be utilized. (If other instructions were present in the 
 
7 
 machine, the unscheduled processors might be able to do other 
 useful work.) However, with a distribution and a proper 
 
 Figure k 
 
 permutation of the input variables, the expression can be 
 evaluated with a three level tree and only a seven processor 
 tree would be required to calculate the expression without feed- 
 back as shown in Figure 5« 
 
 R, 
 
 A B E C D A 
 Figure 5 
 
 3.2 Simulator Scheduler 
 
 A scheduler designed and programmed by Paul Budnik 
 drives the entire tree machine using the tree height reduced 
 expressions generated by Han. The scheduler uses an instruction 
 look ahead unit which can be varied in length for different 
 
8 
 
 experiments. For any given program with a given number of 
 processors in the processor tree, some instructions would not 
 fill the tree while others would be too large to fit on the 
 tree. Thus, the scheduler is designed to schedule more than 
 one expression on the processor tree for each clock if one 
 expression does not fill the entire processor tree during a 
 particular simulation. The statements (2) and (3) would then 
 
 R1 = B*C+D+E 
 R2=G+H*I+J 
 
 (2) 
 (3) 
 
 be scheduled to be input into a seven processor tree as shown 
 in Figure 6 Note that the results Rl and R2 will emerge from 
 processors PE2 and PE3 respectively,. 
 
 The scheduler also cuts trees that are too large to 
 fit on the processor tree during a given simulation by assigning 
 
 *|PE6| 
 
 + IPE7I 
 
 / \ 
 
 / \ 
 
 H I 
 
 G J 
 
 Figure 6 
 
 temporary variables to intermediate results and setting the 
 proper tags to enable Rl to route the temporary result to the 
 correct destinations. An example of such a case is shown on 
 
the two trees in Figure 7 for statement (k) . Note that a 
 portion of another assignment statement tree 
 
 R=A*B*C+D*E*F+G*H+I 
 
 w 
 
 or a complete assignment statement can be scheduled on pro- 
 cessors PE6 t PE7, and PE3 while the second part of the 
 
 
 ♦ 
 
 PE1 
 
 
 * |PE2l 
 
 / \ 
 
 [PE>1 - PES 
 
 / \ / \ 
 A B C 
 
 
 *iPE?l 
 * PE6 - JPE7 
 
 / \ / \ 
 
 D E F 
 
 - fPEil 
 
 -&M 
 
 - PE7 
 
 / \ 
 
 Figure 7 
 
 instruction is being processed. Notice also that PE5 and PE7 
 in the first tree only "pass along" their operands to the next 
 
10 
 level. In Figure 7, - indicates either no operation or no 
 input datac 
 
 Budnik's scheduler also handles dependency such as 
 that in the instructions (5) and (6). The result of (5) is 
 required as input data in (6). Thus, the scheduler was 
 
 Rl ■ A * B + C (5) 
 
 R2 = Rl + B * D (6) 
 
 designed to have the result of (5) routed back directly to 
 the proper buffer in the bottom level of the processor tree 
 instead of delaying instruction (6) until Rl arrived in 
 memory. This is implemented by assigning a number in an in- 
 creasing unary sequency starting from zero to each different 
 variable which is encountered in the instruction stream. This 
 number is the index of an array named MEML which indicates 
 the location of the instruction producing the variable as 
 a result in the machine. The number assigned to each variable 
 as it is encountered in the instruction stream also indicates 
 the memory module in which the variable will be assumed to 
 be stored. The memory module number is the variable number 
 modulo the number of memories being simulated. Thus, no 
 fancy storage scheme was used for the simulator. The value 
 of MEML indicates whether the variable is in memory, being 
 processed as a left hand side in the machine before Rl or 
 being processed as a left hand side in the machine inside Rl. 
 
11 
 
 Then if a variable is being produced as a result in the machine 
 anywhere before Rl and the variable is required as input data 
 to another instruction as in (5) and (6) above, another element 
 is added to the linked list of destinations for the instruction 
 producing the result. Thus, in our example another element 
 would be added to the linked list of result destinations for 
 instruction (5) so that Rl could be input to the proper pro- 
 cessor buffer for instruction (6). IINSTRORES, one of several 
 arrays associated with the instruction format abbreviated by 
 INSTR, is an array indexed by the instruction number which 
 points to the head of a linked list of destinations to which 
 the result of the instruction is to be routed. One of the 
 elements in the linked list, INSTRFLG, indicates whether the 
 result is to be routed to memory or back as input to a bottom 
 level processor Thus, if (5) and (6) were the only two 
 instructions in the instruction stream, INSTRORES for (5) 
 would point to a linked list containing two entries. One 
 entry would indicate a destination to the memory location of 
 Rl and the other entry would indicate a processor destination 
 for instruction (6) . 
 
 The destinations for the results going to memory or 
 to a processor as input are stored in the array INSTRDATA. If 
 the INSTRFLG entry in the linked list of parallel arrays 
 indicates that the result should be routed as input to a pro- 
 cessor, another parallel array INSTRST indicates the target 
 instruction for the feedback result. Feedback results occur 
 
12 
 when assignment statement trees are cut by the scheduler pro- 
 ducing temporary results or when dependency occurs as in the 
 above example. The parallel arrays INSTRDATA, INSTRFLG and 
 INSTRST are linked together by another equally dimensioned 
 array named INSTRLINK. 
 
 Another array parallel to INSTRORES and of the length 
 of the look ahead is INSTRTYP. This array differentiates 
 between assignment statements of the type variable = variable 
 and those of the type variable = expression. Since it is 
 obviously inefficient for the tree processor to handle any 
 statement of the type variable = variable, INSTRTYP is tagged 
 to indicate that the instruction is of this type and the 
 instruction's linked list of results (pointed at by INSTRORES) 
 indicates to which destinations the left hand side is to be 
 routed. The interfacing between the scheduler and R2 then 
 insures that an instruction of this type does not reach the 
 processing tree but that each of the elements in the linked 
 list of result destinations is placed into the proper routers. 
 Summarizing, it would be well to point out again that INSTRORES 
 and INSTRYP are arrays associated with the look ahead unit and 
 are of the same length. INSTRDATA, INSTRST, INSTRLINK and 
 INSTRFLG are four parallel arrays which provide proper infor- 
 mation to the simulator regarding what is to be done with the 
 left hand side of an instruction when it emerges from the 
 processor tree or when it is available from memory and is 
 tagged of the type variable = variable. 
 
13 
 3 c 3 Processor Simulator 
 
 The design and programming of the tree processor it- 
 self was done mostly by Yoichi Muraoka. Its general structure 
 is shown in Figure 8. The buffers between the processing ele- 
 ments are shown explicitly in the diagram. The length of these 
 processor buffers can be varied between simulation runs. Each 
 of the processors contains an add/subtract unit as well as a 
 multiply/divide unit as shown. Both units are assumed to be 
 pipelined and the length of the pipeline in either the adder 
 or multiplier can be varied between each simulation. The 
 number of operands that each processor can accept and the 
 number of outputs from each processor per clock is also a para- 
 meter to the simulator. Since both an adder and multiplier 
 are present and are independent in each processor, the system 
 can be allowed to permit four operands to arrive at each pro- 
 cessor each clock - two for the adder and two for the multiplier. 
 If four operands are input to each processor tree node, two 
 results must be permitted to leave the processor during a 
 particular clock. Another option provides that an operand can 
 be "routed around" a processor to the next level of the tree 
 if no operation is to be performed on the data. An example of 
 this case has been shown in Figure 7. 
 
 As mentioned previously, the need for the processor 
 buffers arises from the fact that very few assignment trees 
 found in real programs are both full and balanced. Here the 
 term balanced indicates that each subtree of a particular 
 
Ik 
 
 processor 
 buffers 
 
 indicates a switch determining whether output from a 
 processor exits to the next level in the processor tree 
 or to the result router. 
 
 1 indicates inputs to the result router (Rl) . 
 
 2 indicates inputs to the bottom level processors from 
 the router between memory and the processors (R2) . 
 
 Figure 8 
 
15 
 processor take the same amount of time to execute and full 
 indicates that all nodes of the processor would be scheduled 
 for the tree processor. The following assignment statement can 
 be scheduled on a full balanced tree as shown in Figure 9. 
 
 R = (A+B) * (OD) + (E*F) + (G*H) 
 
 (7) 
 
 However j in the schedule shown in Figure 6, the result from 
 
 PE4 would not be calculated as soon as the result of the addition 
 
 from PE5 since the multiply would take more processing time. 
 
 
 + PE1 
 
 
 * PE2 
 
 +1PE2| 
 
 
 + |PE4] 
 
 / \ 
 
 A B 
 
 + |PE5i *|PE6| 
 
 / \ / \ 
 CD E F 
 
 Figure 9 
 
 *|PE7 
 
 / \ 
 
 G H 
 
 Hence, PE2 could not proceed until the product of B and C had 
 been executed,, Especially when sparse trees are being pro- 
 cessed, it is not difficult to construct reasonable examples 
 in which one whole instruction which originally was later in 
 the instruction stream would emerge from the processor tree 
 before the prior instruction. The inclusion of buffers before 
 each processor in the processor tree certainly seems reasonable 
 from an efficiency standpoint. Of course, the necessary length 
 
16 
 of these buffers is one question which the simulator should help 
 answer . 
 
 As mentioned previously in the discussion of the tree 
 processor, the number of operands arriving at each processor 
 per clock is variable. The delivery of a various number of 
 operands at each processor per clock can be achieved two ways. 
 Suppose one desires to have four operands arrive at each pro- 
 cessor in each processor clock instead of two operands. First, 
 the increase in the data rate can be achieved by increasing 
 the speed of the memory such that each memory module can pro- 
 duce twice as much data in a given amount of time. Perhaps a 
 more practical alternative is to provide twice as many memory 
 modules to feed the same number of processors. Either scheme 
 can be implemented in the simulator. The simulator will 
 simulate situations in which 1, 2 or 4 memory modules are pro- 
 vided per processor in the bottom level. These rates correspond 
 to 1/2, 1 or 2 arithmetic expressions arriving at the bottom row 
 of the processors at each clock. 
 3»^ Machine Configurations 
 
 The simulator allows three general router configur- 
 ations to be simulated. Figure 10 shows one of these con- 
 figurations. For the discussion of the various machine con- 
 figurations, the separate adder/subtract unit and multiply/divide 
 unit in the processor will not be shown in an attempt to keep 
 the diagrams fairly simple. In addition, dotted lines indicate 
 the presence of more than one connection. In Figure 10 R2 
 
17 
 
 PE1 
 
 >. 
 
 PE2 
 
 PE3 
 
 ■>■ 
 
 — >. 
 
 PE4 
 
 r 
 
 PE5 
 
 
 R 
 
 
 U 
 
 
 T 
 
 
 E 
 
 > 
 
 R 
 
 
 1 
 
 1 > 
 
 1 
 
 1 
 
 — — >• 
 
 PET 
 
 _J 
 
 PE7 
 
 -I < - 
 
 I 
 
 - < - - 
 
 <c SI 
 
 ROUTER 
 
 '7 
 
 If 
 
 If 
 
 _ _ _ j 
 
 memory 
 buffers 
 
 memory 
 modules 
 
 Figure 10 
 
18 
 routes data from memory buffers to processor buffers in the 
 bottom level of the tree processor. Rl routes data from the 
 outputs of all processors to the inputs of the processors in 
 the bottom level of the processor tree as well as data from the 
 outputs of the processors to each of the memory modules. The 
 square boxes marked S are switches which route results from 
 each processor to the next processor or to the result router 
 Rl. Another set of switches labeled SI represented by only one 
 square box, are required to switch the output of Rl to the 
 correct place - either to the input to one of the bottom level 
 processors or back to the memory buffers. Notice that the 
 number of switches of Type SI is the number of processors in 
 the bottom level of the processor tree. In Figure 10, the 
 number of memories per processor shown is two; thus, one 
 instruction should be entering the system per clock. 
 
 The other configuration with two routers is shown in 
 Figure 11. In this case R2 routes data from memory buffers to 
 the bottom level processors and also from the outputs of the 
 processors in the processor tree to memory. In this con- 
 figuration each piece of data in R2 would require a tag indicat- 
 ing whether the destination of the data is the processor buffers 
 or the memory buffers. Rl only routes results from the pro- 
 cessors back to the input of the bottom level processors. In 
 other words, Rl only handles feedback results. In Figure 11 
 the switches labeled S determine whether results from processors 
 should be passed to the next level of the tree or switched to be 
 
19 
 
 ROUTER 1 
 
 PE1 
 
 A 
 
 =*S1 - 
 
 I 
 
 ->■ 
 
 PE2 
 
 > - 
 
 FE5 
 "IT 
 
 peT 
 
 I 
 1 
 
 A 
 I 
 
 fcf 
 
 PE5 
 
 PEo" 
 
 __y 
 
 PET 
 
 - < - 
 
 1 
 
 <€ 
 
 ROUTER 2 
 
 1^ 
 
 h n i I 
 
 Jl H i. 
 
 IH 
 
 l\ II ,1 A 
 
 - > 
 
 V 
 
 V 
 
 — ^_ J/ 
 
 memory 
 buffers 
 
 memory 
 modules 
 
 Figure 11 
 
20 
 input to another switch represented by one block in the diagram 
 labeled SI. The switches SI determine whether the results from 
 the processors are to be placed into Rl and subsequently be 
 routed to one of the input buffers in the bottom level of 
 processors or whether the results are placed into R2 and sub- 
 sequently routed to one of the memory buffers. Notice the 
 difference in the function of the switches labeled SI in Figure 
 10 and those labeled SI in Figure 11 and that the number of 
 switches labeled SI required in Figure 11 is equal to the number 
 of processors in the processor tree. 
 
 The final general machine configuration uses three 
 routers instead of two as in the former two configurations. 
 Figure 12 shows that Rl routes data output from the processors 
 to input buffers for the bottom level processors. R2 routes 
 data only from the memory buffers to the input buffers for the 
 bottom level processors. R3 routes data from the output of the 
 processors to the memory buffers of the proper memory module 
 for the data. Thus, each router performs a single specific 
 task. Figure 12 closely resembles Figure 11 except that one 
 of the outputs of switch box SI ends at R3 instead of R2. 
 Notice that in all three configurations the switches S are 
 required. 
 
 Any of the following five types of routers can be 
 placed in each or all of the routing network positions (Rl, R2 
 and R3) during any simulation rum 
 
21 
 
 PE1 *-* 
 
 > SI _ 
 1 
 
 
 Ik A 
 
 • >| 
 
 i 
 
 
 1 * 
 
 
 i 
 
 
 V 
 
 /I 
 
 
 D 
 
 
 /I J 
 
 
 a ^ 
 
 A 
 
 
 R 
 
 s rsi_J 
 
 
 rA J ' 
 
 u 
 
 
 U 
 
 PE2 PE 
 
 rr 
 
 
 ^r- 1 h 
 
 j k 
 
 T 
 E 
 R 
 
 
 T 
 E 
 R 
 
 ^l -r- 1 
 
 
 h ~1< /v 
 
 w 
 
 
 — tH 
 
 1 
 
 
 3 
 
 s — 1 s -J s — J 
 
 s —J 
 
 1 
 
 
 /* | /, 'i , 
 
 PE4 PE5 PE6 
 
 PE7 i I 
 
 f 1, J. 
 
 ^' 1 1 
 
 
 V i 
 
 /i li 
 
 V 
 
 M* « k — — — - ' 
 
 
 ^ — -. p. — .. — 
 
 
 ROUTER 2 
 
 
 'l M ii -, 11 1 
 
 -l ', 
 
 Jl Jl ' i' _j' It _>' 
 
 
 memory buffers 
 
 ^i m ni /p yH i. 
 
 A Hi 
 
 l| W il yl '1 \l 
 
 rh rh 
 
 
 memory modules 
 
 Figure 12 
 
 
22 
 
 Log Router 
 
 Illiac IV Router 
 
 Semmelhaack Router 
 
 Batcher Network^ 
 
 Crossbar Switch 
 In addition, the size of the processor tree (and therefore the 
 size of the routers), memory buffer size, processor buffer size, 
 number of memory modules, length of instruction look ahead and 
 number of memory modules per processor in the bottom level are 
 variable between simulations. The maximum number of following 
 instructions to which the left hand side of an instruction can 
 be linked is determined by the length of the instruction look 
 ahead. 
 
23 
 
 4. SIMULATION OF ROUTERS 
 
 4.1 General Design 
 
 Since the Illiac IV and Semmelhaack routers have a 
 similar structure, essentially the same program was used to 
 simulate both routers. The sense in which the Illiac IV and 
 Semmelhaack routers are similar is that both networks shift all 
 data currently being routed in the network a certain fixed 
 distance each routing cycle. The sequence and distances of the 
 shifts are determined by some control hardware. Thus, a dif- 
 ferent subroutine is called from the same program to simulate 
 the control of the Illiac IV or Semmelhaack router. Indeed, 
 any router which has the characteristic of shifting all data a 
 uniform distance during a certain routing cycle could be 
 simulated with this program. One would only be required to 
 build the proper subroutine to simulate the control network for 
 the particular router. Since conflicts cause queue lengths and 
 time through an Illiac IV type router for a set of input data 
 vary depending on the input data, input frequency and control 
 network simulated, the structure of this type router is simulated. 
 
 No hardware structure was simulated in either the cross- 
 bar switch or the Batcher network. As a result, the same pro- 
 gram is used to simulate either the crossbar switch or a Batcher 
 network. The simulation of the hardware structure of these two 
 routers was not necessary. When a set of data to be permuted 
 is introduced into either of these routers, the design of the 
 
2k 
 hardware assures that each data element will reach its des- 
 tination after a fixed amount of time depending on the size of 
 the router. Since our simulator is only concerned with a timing 
 study, the program to simulate the crossbar switch and Batcher 
 network merely delays a data element the proper amount of time 
 before the element is placed in its destination buffer. 
 
 The third simulation program for routers is for the 
 
 N 
 log router. Suppose the log router is required to accept 2 data 
 
 elements during a given routing cycle. In other words, the router 
 
 N 
 width is 2 . Then each of the data elements can be shifted 
 
 distances of 0, 1, 2, k, . . .,2" during any particular 
 
 routing cycle. Notice the difference between the structure of 
 
 the log router and the Illiac IV type. In the Illiac IV type 
 
 router, all data being routed during a particular cycle must be 
 
 moved a fixed distance? however, the log router design permits 
 
 each of the data elements being routed to be shifted any of the 
 
 allowed distances during any routing cycle provided the 
 
 destination buffer is not full. The routing pattern in the 
 
 log router is simply a binary number of length N set to the 
 
 distance to be routed. A set bit in this binary number then 
 
 indicates that a distance equal to the bits place value in the 
 
 binary number should be included in the routing pattern. Since 
 
 in this algorithm more than one element of data may arrive at 
 
 a certain destination during a routing cycle, queues may form 
 
 depending on the routing pattern of the input data and the 
 
 frequency at which a new set of data is introduced into the 
 
25 
 router. Thus, it was necessary to simulate the structure of the 
 log router. 
 
 The general flow of each of the programs which simulate 
 routers is shown in Figure 13. Since any of the routers can be 
 placed in any of the router positions, a general interface 
 program named D0NE is called during each routing cycle to empty 
 buffers of elements which have reached their destination. D0NE 
 empties routed elements into the proper memory or processor buffer 
 depending on which simulation program calls D0NE and the current 
 machine configuration. This program corresponds to the first 
 block in Figure 13. 
 
 remove from router items 
 which have reached destination 
 
 v 
 
 input items to be routed from 
 interface buffers into router buffers 
 
 \y 
 
 do routing cycle 
 
 v 
 
 gather queue 
 statistics for router 
 
 Figure 13 
 
 Two other interface programs are used. MBFTCH removes 
 data from the memory buffers and places it into the interface 
 buffers of R2. PD0NE is called when a result emerges from a 
 
26 
 processor. PD0NE then places the data element in the proper 
 interface router buffer depending on the current machine 
 configuration. 
 
 k,2 Illiac IV and Semmelhaack Routers 
 
 Since queues form in the Illiac IV - Semmelhaack 
 type router network as well as in the log router, some pre- 
 liminary router simulations were done to determine what 
 discipline for removing items from the router queues would 
 result in the best throughput. It was found that the average 
 time for each data element through a router, average queue length 
 in the router and maximum number in any queue were minimum if the 
 oldest data element relative to the time the elements were placed 
 into the router have the highest priority to be shifted from each 
 queue. Thus, for any router in which queues form, the router 
 simulator removes data elements corresponding to the oldest 
 machine instructions first. 
 
 Since the control program for the Illiac IV - Semmelhaack 
 router is a subroutine, any sequence of shifts is possible. For 
 the Illiac IV router shift distances of +1 and + |*/F 1 were 
 simulated. N is the width of the router and fal denotes the 
 next highest integer of a. For the Semmelhaack router, any 
 distance uniform shift can be simulated. The sequence and dis- 
 tance of shifts used will be given in the results of each 
 experiment in which either or both of the Illiav IV or Semmelhaack 
 routers were used. 
 
27 
 
 4. 3 Crossbar and Batcher Networks 
 
 As mentioned previously, the program which simulates 
 the crossbar-Batcher network type of router (or fixed time router) 
 only delays each data element the time which the hardware would 
 delay the data. If queues form at the input buffers of the 
 router, the oldest data elements are removed first as in the 
 Illiac IV-Semmelhaack case. 
 
 4.4 Log Router 
 
 The sequence of shifts for a given data element in 
 
 N-l 
 the log router is fixed. The shifts are of distances 2 , 
 
 N-2 
 2 , . . .,1 where N is the router width. Each element is 
 
 shifted the highest possible power of 2 first, the next highest 
 
 power of 2 next etc. It is clear that at most log 2 N shifts are 
 
 required for each data element to reach its destination. As in 
 
 the previous two simulation programs for the other two router 
 
 types, the data element in the buffer corresponding to the oldest 
 
 instruction in the machine is removed first. 
 
28 
 5. EXPERIMENTS AND RESULTS 
 
 5.1 Discussion of Experiments 
 
 Unfortunately, time did not permit the performance of 
 as many experiments as originally was hoped. Three different 
 experiments were run on the simulator with the number of pro- 
 cessors equal to seven. Then two of these experiments were 
 simulated using fifteen processors. Each of these simulations 
 used a three router configuration with all routers being the log 
 router. In addition, some other simulations were completed to 
 arrive at a relatively firm idea of what would be required in 
 a system for the type of experiments performed in order that 
 bottlenecks would not occur or would be kept at a minimum. For 
 the additional experiments, router buffer statistics were not 
 kept. 
 
 The FORTRAN program segments chosen to be simulated 
 included a series of assignment statements taken from ACM FORTRAN 
 algorithms c The other two experiments were back substituted 
 sequences of assignment and conditional statements from ACM 
 FORTRAN algorithms. Let some justification be given for this 
 choice and the term "back substituted sequences" be clarified. 
 
 Programs in a high level language could be broken into 
 parts consisting of iterative loops, sequences of assignment 
 statements and sequences of conditional statements. An array 
 machine like Illiac IV performs excellently on iterative cal- 
 culations when arrays are being used. However, most of the 
 
29 
 processors in this type of machine must remain idle when an 
 
 assignment statement or the arithmetic involved in a conditional 
 statement is being performed. In other words, sequences of 
 assignment and conditional statements are potential bottlenecks 
 in an array machine like Illiac IV. Obviously, the organization 
 suggested here attempts to reduce this problem while keeping the 
 array processing power. The array processing power could be 
 kept by allowing router 2 to route data to all processors in the 
 tree — not just the processors in the bottom level of the tree. 
 In this way, the straight array calculation power remains. Since 
 bursts of assignment statements and a mixture of assignment state- 
 ments with conditional statements represent the major source of 
 bottlenecks for a parallel processor, experiments involving these 
 types of sequences were chosen. 
 
 In almost every nontrivial program, there exists at 
 least one sequence of assignment and conditional statements. 
 Since the tree processor can readily compute large expressions, 
 it is reasonable to make a sequence of FORTRAN expressions into 
 fewer and larger expressions to fill a processor tree. This can 
 be accomplished by "back substituting" the expression for the 
 assignment of a variable for any occurrence of that variable 
 in succeeding statements of the sequence. This substitution 
 can also be made for the arithmetic involved in any following 
 conditional statements. In some cases the calculations involved 
 for all paths through a sequence of statements could be done in 
 as small or shorter time by a tree processor of a given size than 
 
30 
 a single processor taking a single path. An example is shown of 
 this type of back substitution in Figure 14. The FORTRAN sequence 
 is shown in 14a and the back substituted sequence in 14b where 
 
 Tl ■ B+C-D-B-C+D-E-F 
 
 51 = B+C-D-B-OD+E+F+W 
 
 52 m B+C-DtB+C-EWE+F-W 
 
 
 A ■ B+C-D 
 
 
 R = A+E+F 
 
 
 IF (A.GR.R) GO TO 6 
 
 
 S = A-R+W 
 
 
 GO TO ? 
 
 6 
 
 S = A+R-W 
 
 7 
 
 • 
 
 a. b. 
 
 Figure 14 
 
 Tl represents the value of the arithmetic performed in the con- 
 ditional statement. A single processor doing an add/subtract in 
 1 clock would require 7 clocks for one of the paths while a seven 
 element tree processor would perform the back substituted con- 
 ditional and assignment arithmetic in 5 clocks. After the paths 
 were evaluated at run time* the effects of only the actual path 
 taken would result, depending on the temporary values assigned 
 for the conditional statements. In the example above, either SI 
 or S2 would be stored to memory location S depending on the value 
 of Tl. 
 
31 
 
 5.2 Discussion of Results 
 
 The gathered statistics for each experiment are shown 
 
 on Tables 1 and 2. Table 1 shows the time statistics for each 
 
 experiment and Table 2 shows the processor utilization and 
 
 router statistics (when taken) for the same experiments. Dashes 
 
 (-) in the tables indicate that this statistic was not collected 
 
 for the experiment. The description of the data for each 
 
 experiment will be given below. The letters in the descriptions 
 
 below correspond to the letters in the column marked "Data Set" 
 
 in Tables 1 and 2c 
 
 A - Several sequences of assignment statements 
 from ACM FORTRAN algorithms consisting of 
 one hundred eighteen FORTRAN statements. 
 
 B - All paths through a sequence of assignment 
 and conditional statements from the 
 FORTRAN algorithm ADIPZ. Each path was back 
 substituted and the data contained seventy- 
 nine FORTRAN expressions. 
 
 C - Same as B, except for the FORTRAN algorithm 
 CHSTEP^ This experiment had ninety-six 
 statements of which twenty-seven were of 
 the simple assignment statement type (A=B). 
 
 D - This data consisted of the longest path from 
 experiment C. This data was not back sub- 
 stituted and was eighteen FORTRAN instructions 
 in length. 
 
 All time statistics are averages except where noted and 
 represent time for the data for an entire instruction not 
 individual data elements. Time for a certain statistic began 
 elapsing when the first data element of an instruction reached 
 the stage corresponding to the statistic being taken and was 
 
32 
 
 terminated when the last data element of the instruction was 
 removed from this stage. A description of the statistic rep- 
 resented by each of the columns in Tables 1 and 2 follows i 
 
 Data Sett References the experimental data 
 used in the simulator. The code letters 
 correspond to the descriptions of the 
 data sets given above. 
 
 Number of PESi Number of processing elements 
 in the tree for this simulation. 
 
 Number of Instructions i Number of processor 
 tree instructions. In other words, the 
 number of expressions generated for the 
 simulation by the scheduler. 
 
 Instruction memory timei Average time from the 
 clock that instruction enters the machine 
 to the clock that all its associated data 
 is available in a memory buffer. 
 
 Time in memory buffer t Average time that data 
 for an instruction waits in memory buffer 
 after available. 
 
 Time in Router 2t Average time data for an 
 instruction spends in Router 2. 
 
 Time in PE treei Average time data for an 
 instruction spends in processor tree. 
 (Time for whole expression to be evaluated.) 
 
 Time in PE bufferi Average time that computed 
 result of an expression waits in output 
 buffer of processor tree. 
 
 Time from PE buffer to memoryt Average time 
 
 that computed result spends in router when 
 being routed from output PE buffer to memory. 
 
 Time from PE buffer to PE bufferi Average time 
 that feedback result spends in Router 1 
 when being routed from output PE buffer to 
 input PE buffer in bottom level of processor 
 tree. 
 
 Total time of instruction! Average time from the 
 clock that instruction enters the machine to 
 
33 
 
 the clock that all processing and routing 
 of results is completed. 
 
 Total clocksi Total number of clocks for entire 
 experiment . 
 
 Max # in PE bufferi Maximum number of data 
 
 elements at any one clock in a single pro- 
 cessor buffer. 
 
 PE utilization #1 for adder i (Total number of 
 
 results leaving all adders )/( (Number of total 
 clocks) * (Number of adders)) 
 
 PE utilization #1 for multiplier! Same as 
 
 expression for PE utilization #1 for adder 
 except replace adder by multiplier. 
 
 PE utilization #2 for addert (Total number of 
 
 results leaving all adders from the clock at 
 which the number of data elements in the 
 processors was greater or equal to the number 
 of adders to the clock when last instruction 
 entered the machine )/( (Number of clocks for 
 which statistic taken) * (Number of adders)) 
 
 PE utilization #2 for multiplier! Same as expression 
 for PE utilization #2 for adder except replace 
 adder by multiplier. 
 
 Ave # in Rl bufferi (Total number of elements 
 
 entering Rl buffers )/( (Number of Rl buffers) * 
 (Total clocks) ) 
 
 Ave # in R2 bufferi Same as expression for Ave # 
 in Rl buffer except replace Rl by R2. 
 
 Ave # in R3 bufferi Same as expression for Ave # 
 in Rl buffer except replace Rl by R3. 
 
 Max # in Rl bufferi Maximum number of data elements 
 at any one clock in a single buffer of Rl . 
 
 Max # in R2 bufferi Maximum number of data elements 
 at any one clock in a single buffer of R2. 
 
 Max # in R3 bufferi Maximum number of data elements 
 at any one clock in a single buffer of R3. 
 
3^ 
 
 Max # in 
 PE buffer 
 
 o 
 
 H 
 
 o\ 
 
 o 
 
 rH 
 
 CM 
 rH 
 
 o 
 
 rH 
 
 NO 
 
 rH 
 
 ■3- 
 
 rH 
 
 en 
 
 rH 
 
 o 
 
 rH 
 
 en 
 
 Total clocks 
 
 oo 
 
 CM 
 H 
 
 O 
 
 00 
 
 CM 
 
 ON 
 
 O 
 CM 
 
 -4" 
 ON 
 
 en 
 
 en 
 
 H 
 
 NO 
 rH 
 •3- 
 
 H 
 
 CM 
 
 en 
 
 o 
 
 H 
 en 
 
 OO 
 CM 
 
 rH 
 
 CM 
 
 Total time of 
 instruction 
 
 en 
 
 -3- 
 
 CM 
 
 O 
 
 CM 
 
 en 
 o- 
 
 nO 
 
 en 
 
 nO 
 
 00 
 
 nO 
 
 NO 
 
 NO 
 
 vn 
 
 NO 
 
 rH 
 
 o 
 
 en 
 
 o 
 
 o- 
 
 -3- 
 
 Time from PE 
 buffer to PE buffer 
 
 o 
 
 00 
 
 • 
 
 O 
 H 
 
 • 
 
 • 
 
 o 
 
 rH 
 
 o 
 
 00 
 
 • 
 
 CM 
 
 00 
 H 
 
 • 
 
 ■4- 
 
 en 
 
 rH 
 
 o 
 
 • 
 
 en 
 
 o 
 
 CM 
 
 • 
 
 >n 
 
 • 
 NO 
 
 o 
 
 00 
 
 • 
 
 »n 
 o 
 
 Time from PE 
 buffer to memory 
 
 oo 
 
 ON 
 CO 
 
 rH 
 H 
 
 en 
 
 00 
 nO 
 
 -3- 
 
 no 
 
 CM 
 
 en 
 
 r-i 
 ON 
 
 ON 
 
 ^- 
 
 CM 
 
 cv 
 en 
 
 en 
 
 rH 
 
 ON 
 
 vn 
 
 00 
 ON 
 
 CM 
 
 00 
 ON 
 
 NO 
 
 Time in 
 PE buffer 
 
 CM 
 H 
 
 nO 
 O 
 
 CM 
 CM 
 
 H 
 
 rH 
 
 H 
 O 
 
 CM 
 O 
 
 cn 
 
 O 
 
 O 
 
 o 
 
 rH 
 
 H 
 O 
 
 rH 
 O 
 
 CM 
 CM 
 
 H 
 
 CM 
 CM 
 
 rH 
 
 Time in 
 PE tree 
 
 CM 
 
 o 
 
 H 
 
 CN 
 rH 
 
 P*N 
 
 O 
 H 
 
 ON 
 
 O 
 H 
 
 no 
 
 rH 
 
 ON 
 
 rH 
 H 
 
 en 
 en 
 
 rH 
 
 CM 
 H 
 
 NO 
 
 en 
 
 rH 
 
 H 
 
 O 
 H 
 
 en 
 o 
 
 H 
 
 Time in 
 Router 2 
 
 O 
 H 
 
 o- 
 
 CM 
 CM 
 
 o 
 
 CM 
 
 CM 
 -3" 
 
 rH 
 
 o 
 
 en 
 
 co 
 
 ON 
 
 O 
 
 CM 
 H 
 
 CM 
 CM 
 
 Time in 
 memory buffer 
 
 O 
 
 o 
 o 
 
 O 
 O 
 
 o 
 
 O 
 O 
 
 O 
 
 o 
 o 
 
 o 
 
 O 
 
 o 
 
 o 
 
 O 
 
 o 
 o 
 
 o 
 o 
 
 o 
 
 o 
 
 o 
 
 o 
 
 O 
 O 
 
 O 
 
 O 
 O 
 
 O 
 
 O 
 O 
 
 O 
 
 Instruction 
 memory time 
 
 oo 
 
 ON 
 rH 
 
 rH 
 rH 
 
 CM 
 
 ON 
 
 o- 
 
 H 
 CN 
 
 en 
 
 -3- 
 
 CM 
 ON 
 
 en 
 
 CM 
 
 CM 
 
 • 
 
 o 
 
 Number of 
 instructions 
 
 O 
 
 rH 
 
 O- 
 H 
 
 ON 
 
 o 
 
 *n 
 
 rH 
 
 en 
 
 o 
 
 -3- 
 
 o 
 
 oo 
 
 H 
 
 NO 
 
 en 
 
 rH 
 
 o 
 
 CM 
 
 rH 
 
 vn 
 
 CM 
 
 
 Number of 
 PES 
 
 o- 
 
 rH 
 
 <n 
 ^O 
 
 O- 
 
 rH 
 
 
 O- 
 
 vn 
 
 rH 
 
 rH 
 
 en 
 
 O- 
 
 en 
 
 NO 
 
 Data Set 
 
 < 
 
 < 
 
 < 
 
 pq 
 
 CO 
 
 PQ 
 
 O 
 
 o 
 
 o 
 
 Q 
 
 Q 
 
 Table 1 
 
35 
 
 Max # in 
 memory buffer 
 
 un 
 
 rH 
 
 H 
 
 o 
 
 CM 
 
 ON 
 CM 
 
 ON 
 CM 
 
 CM 
 
 CM 
 
 ON 
 CM 
 
 rH 
 
 ON 
 
 Max # in 
 R3 buffer 
 
 ON 
 
 rH 
 
 1 
 
 CM 
 
 CM 
 
 ' 
 
 1 
 
 1 
 
 1 
 
 CM 
 
 1 
 
 Max # in 
 R2 buffer 
 
 ^* 
 
 ^* 
 
 1 
 
 ^ 
 
 ^* 
 
 1 
 
 • 
 
 1 
 
 1 
 
 ^" 
 
 1 
 
 Max # in 
 Rl buffer 
 
 on 
 
 CM 
 
 1 
 
 CM 
 
 H 
 
 1 
 
 1 
 
 1 
 
 1 
 
 CM 
 
 1 
 
 Ave # in 
 R3 buffer 
 
 ON 
 
 • 
 
 o 
 
 ON 
 O 
 
 • 
 
 o 
 
 1 
 
 ON 
 O 
 
 • 
 
 O 
 
 UN 
 O 
 
 • 
 
 O 
 
 1 
 
 1 
 
 1 
 
 1 
 
 CM 
 ON 
 
 • 
 
 O 
 
 1 
 
 Ave # in 
 R2 buffer 
 
 00 
 ON 
 
 • 
 
 CM 
 
 vO 
 
 • 
 
 o 
 
 1 
 
 CM 
 O 
 
 ON 
 
 • 
 
 O 
 
 " 
 
 1 
 
 1 
 
 1 
 
 o 
 
 ON 
 
 • 
 
 CM 
 
 1 
 
 Ave # in 
 Rl buffer 
 
 \o 
 
 UN 
 
 • 
 
 o 
 
 \o 
 o 
 
 • 
 
 o 
 
 1 
 
 UN 
 
 O 
 O 
 
 UN 
 O 
 
 • 
 
 o 
 
 1 
 
 1 
 
 1 
 
 1 
 
 UN 
 
 • 
 
 O 
 
 1 
 
 PE utilization #2 
 for multiplier 
 
 -3- 
 
 • 
 
 O 
 
 1 
 
 1 
 
 UN 
 O 
 
 • 
 
 O 
 
 ON 
 O 
 
 • 
 
 o 
 
 1 
 
 H 
 rH 
 
 • 
 
 O 
 
 CO 
 
 o 
 
 • 
 
 o 
 
 ON 
 O 
 
 • 
 
 o 
 
 1 
 
 1 
 
 PE utilization #2 
 for adder 
 
 ON 
 O 
 
 • 
 
 o 
 
 1 
 
 1 
 
 UN 
 O 
 
 * 
 
 O 
 
 ON 
 
 o 
 o 
 
 1 
 
 O 
 
 rH 
 
 o 
 
 o 
 
 • 
 
 o 
 
 o 
 
 • 
 
 o 
 
 1 
 
 1 
 
 PE utilization #1 
 for multiplier 
 
 ON 
 O 
 
 O 
 
 UN 
 O 
 
 O 
 
 H 
 O 
 
 O 
 
 UN 
 O 
 
 O 
 
 -3- 
 
 o 
 o 
 
 H 
 O 
 
 O 
 
 O 
 H 
 
 O 
 
 \o 
 o 
 
 o 
 
 ON 
 O 
 
 O 
 
 o 
 
 rH 
 
 o 
 
 rH 
 
 O 
 
 o 
 
 PE utilization #1 
 for adder 
 
 00 
 O 
 
 o 
 
 o 
 
 o 
 
 rH 
 O 
 
 O 
 
 UN 
 O 
 
 O 
 
 o 
 o 
 
 H 
 O 
 
 O 
 
 ON 
 
 O 
 
 O 
 
 O 
 O 
 
 ON 
 
 O 
 
 O 
 
 00 
 O 
 
 O 
 
 H 
 O 
 
 O 
 
 Number of PES 
 
 c^- 
 
 UN 
 H 
 
 ON 
 
 \o 
 
 O- 
 
 UN 
 
 rH 
 
 ON 
 M3 
 
 O- 
 
 UN 
 H 
 
 ON 
 
 o- 
 
 ON 
 sO 
 
 Data Set 
 
 < 
 
 <«! 
 
 < 
 
 m 
 
 cq 
 
 PQ 
 
 o 
 
 O 
 
 o 
 
 Q 
 
 Q 
 
 Table 2 
 
36 
 
 Max # in memory buffer i Maximum number of data 
 elements at any one clock in a single 
 memory buffer. 
 
 For all experiments, the maximum length of each router 
 buffer was five. The adder pipeline was of length two and the 
 multiplier pipeline was of length three. The experiments were 
 all simulated assuming the instruction look ahead unit was of 
 length ten. Up to fifty instructions could be simultaneously 
 active in the simulator. The maximum memory buffer length was 
 twenty-five and the maximum processor buffer length was twenty- 
 two. For each experiment, one instruction was introduced into 
 the simulator at each clock if the instruction look ahead unit 
 was not full or the number of active instructions did not exceed 
 the maximum. 
 
 Since only one machine configuration was simulated, 
 no conclusion can be made concerning the best type of routers 
 for each routing position. However, a few interesting observa- 
 tions can be made about the obtained results. The low average 
 percent utilization of the processing elements obviously indicates 
 a bottleneck at some location. The bottleneck appears to be that 
 the memory modules supply Router 2 (hence, the processors) with 
 data at too slow a rate. Apparently more than one data element 
 for several instructions were stored in the same memory modules 
 causing memory conflicts. This is reflected by the values in 
 the column marked "Instruction memory time" in Table 1; most of 
 the instructions required 30 to kO clocks for all of the data to 
 
37 
 
 be available from memory. Router 2 was clearly not the bottle- 
 neck for these experiments since the time data elements waited 
 in memory buffers to be placed in Router 2 was 0. In addition, 
 the average number of data elements in the buffers of Router 2 
 was small. Recall that the time in Router 2 is the time from the 
 clock at which the first data element associated with an instruc- 
 tion arrives at the router until the clock at which the last 
 data element of the instruction is removed. In determining the 
 bottleneck, the statistic for max # in the memory buffers is 
 misleading. For programming ease, the data for each instruction 
 was placed in the memory buffers when the instruction was placed 
 in the machine with a tag indicating at what clock the data 
 could be removed. Thus, although the buffer size grew in the 
 simulation, the hardware would not be required for such a large 
 buffer. The gathering of this statistic should be changed to 
 reflect the actual buffer size needed by the hardware. 
 
 To verify that the memory conflicts were causing the 
 bottleneck, simulations were run which cycled the memory and 
 routers more often than the processor tree. When the memory 
 and routers were cycled three times as often as the processor 
 tree, the number of total clocks for each experiment decreased 
 drastically. The decreases in total clocks were as much 
 as by a factor of between 4 and 5» When the memories and 
 routers were cycled four times as often as the processor 
 tree, Router 2 became the bottleneck. Another alternative to 
 
38 
 release the bottleneck of memory conflicts is to map data to 
 memory modules for instructions so that memory conflicts do not 
 occur. This mapping could require the multiple storage of some 
 variables. 
 
 Since Router 1 and Router 3 never became busy, a two 
 router system probably would be more practical. The average 
 number of data elements in the buffers of Routers 1 and 3 never 
 reached 0.6. Thus, it would seem reasonable for a single router 
 to route all results — results to memory and feedback results to 
 the bottom level processing elements. 
 
 Finally, notice that merely increasing the size of the 
 processor tree does not always reduce the total number of clocks 
 necessary to complete the processing of a particular data set. 
 The total time required by data sets A and B to be computed 
 using 63 processors was greater than with 15 processors. The 
 cause for this phenomenon is interesting. When no expression 
 needs to be cut apart to fit on the current tree size, any larger 
 processor tree would be too large for any expression in the sense 
 that no operation would ever be carried out in the top level 
 processor in the tree. In addition, the trees generated by the 
 scheduler for the larger tree when more than one instruction is 
 mapped onto the tree would be less dense. As a result, the 
 average PE utilization would decrease. However, these sparse 
 trees by themselves do not cause an increase in the total time 
 taken for a data set to be computed. The increase is caused by 
 the fact that routing distances in the router are longer which 
 
39 
 causes an increase in routing time. Hence, one would suspect 
 that no decrease in total time for an experiment would occur by- 
 increasing the number of processors when the largest expression 
 fits on the current processor tree. 
 
 More simulations need to be completed to determine the 
 adequate minimum router, processor and memory buffer lengths. 
 Different types of routers should be attempted in various router 
 positions. Especially interesting would be using a fast router 
 like the crossbar switch as the result router. Without a great 
 deal of work, the system could be changed to enable Router 2 to 
 route data to all processors in the tree; complete FORTRAN pro- 
 grams including iterative loops could then be simulated with a 
 more complex scheduler 
 
40 
 
 LIST OF REFERENCES 
 
 1. Kuck, D. J. Private conference. 
 
 2. Han, Joseph Ching-Chi, "Tree Height Reduction for 
 
 Parallel Processing of Blocks of Fortran Assign- 
 ment Statements", Master's Thesis, University of 
 Illinois, 1971. 
 
 3. Muraoka, Yoichi , "Parallelism Exposure in Exploitation 
 
 in Programs", Doctoral Dissertation, University of 
 Illinois, 1970. 
 
 4. Budnik, Paul, "A Thesis Proposal", A paper written for 
 
 preliminary examination at the University of Illinois. 
 
 5. Batcher, K. E., "Sorting Networks and Their Applications", 
 
 Spring Joint Computer Conference , 1968, pp. 308-313. 
 
 6. University of Illinois, Department of Computer Science 9 
 
 "Subroutine Library" . 
 
BIBLIOGRAPHIC DATA 
 SHEET 
 
 1. Report No. 
 
 UIUCDCS-R- 72-505 
 
 (. Title and Subtitle 
 
 Simulation of a Tree Processor 
 
 3. Recipient's Accession No. 
 
 5. Report Date 
 
 March 1972 
 
 6. 
 
 ', Author(s) 
 
 Larry Allen Swanson 
 
 8. Performing Organization Rept. 
 No. 
 
 '. Performing Organization Name and Address 
 
 University of Illinois at Urbana-Champaign 
 Department of Computer Science 
 Urbana, Illinois 6l801 
 
 10. Project/Task/Work Unit No. 
 
 11. Contract /Grant No 
 
 US NSF GJ 27Ulf6 
 
 12. Sponsoring Organization Name and Address 
 
 National Science Foundation 
 Washington, D. C. 
 
 13. Type of Report & Period 
 Covered 
 
 M.S. Thesis 
 
 14. 
 
 15. Supplementary Notes 
 
 16. Abstracts 
 
 Recent trends in computer organization have tended to include 
 more than one processor and more than one memory. In the future it seems 
 likely that to achieve fast speeds many processors and memories will be used. 
 A uniform way of thinking about such machine organizations is to assume that 
 there are no direct connections between pairs of processors, pairs of memories 
 or memories and processors. Rather all connections could be made through 
 alignment networks or several alignment networks operating simultaneously. 
 This thesis discusses the simulation of the operation of several such networks 
 
 7. Key Words and Document Analysis. 17a. Descriptors 
 
 7b. Identifiers/Open-Ended Terms 
 
 7c. COSATI Field/Group 
 
 18. Availability Statement 
 
 Release Unlimited 
 
 19. Security Class (This 
 Report) 
 
 TINt-l.ASSlFIED 
 
 20. Security Class (This 
 Page 
 
 UNCLASSIFIED 
 
 21. No. of Pages 
 
 41 
 
 22. Price 
 
 ORM NTIS-3S ( 10-70) 
 
 USCOMM-DC 40329-P7I 
 
s»* 
 
 & 
 

 HHH mat H 
 
 UBW IMS ■■ 
 
 ■ 
 
 m 
 
 iti 
 
 UN m 
 HBBWB HQ 
 RflwH B| 
 
 HHnil 
 
 2fi«I HI BHmi iH 
 
 BwuKtwwIuS HUES !;«r "'.4- 
 
 I 
 
 ram mII fill HUh 
 
 111 ss 
 ill 
 
 m ■ BH 1 
 
 m HI '--ft I &fr 
 [?M$I HUH ^^^h 
 
 isb