LIBRARY OF THE UNIVERSITY OF ILLINOIS AT URBANA-CHAMPAICN 510.84 IAGr cop. 2 The person charging this material is re- sponsible for its return to the library from which it was withdrawn on or before the Latest Date stamped below. Theft, mutilation, and underlining of books are reasons for disciplinary action and may result in dismissal from the University. UNIVERSITY OF ILLINOIS LIBRARY AT URBANA-CHAMPAIGN E gh, L161 — O-1096 Digitized by the Internet Archive in 2013 http://archive.org/details/machineparameter636wenk 5 ID. $4 /'buck; Report No. UIUC DC S-R -7^-636 NSF - OCA - GJ-36936 - 000002 lH.£ MACHINE PARAMETER DEDUCTION BY PROGRAM ANALYSIS by August 197 if Kuo Yen Wen 1 Report No. UIUCDCS-R-7^-636 MACHINE PARAMETER DEDUCTION BY PROGRAM ANALYSIS* ty Kuo Yen Wen August 197^ Department of Computer Science University of Illinois at Urbana -Champaign Urbana, Illinois 6l801 * This work was supported in part by the Department of Computer Science and submitted in partial fulfillment of the requirements for the de- gree of Master of Science in Computer Science, August lyjk. Ill ACKNOWLEDGMENT The author is very grateful to Professor David J. Kuck and Visiting Research Assistant Professor Duncan H. Lawrie for their guidance and suggestions. Special thanks are also expressed to S. C. Chen and R. Towle for their advice. The diligent efforts of Mrs. Vivian Alsip and Mrs. Cynthia Fossum in typing this document and the excellent illustrations drawn by Stan Zundo are also greatly appreciated. IV TABLE OF CONTENTS Page 1. INTRODUCTION 1 2. ANALYZER RESULT k A. Definitions k B. Model k C. A Description of EISPACK 6 D. Results 7 E. Observations and Interpretations 15 F. Typical Path Results 28 3. MACHINE ORGANIZATION 31 A. System Model 31 B. Types of Operations 35 C. Results k-2 D. Observations and Interpretations V7 k, CONCLUSION 55 APPENDIX 58 LIST OF REFERENCES 60 1. INTRODUCTION With the introduction of LSI (large scale integration) techniques into the computer world, the cost of computer hardware is dropping sharply. Computer designers can now enjoy substantial performance improvement through a multiprocessing system without having to pay the price of a much more expen- sive system. Using LSI, the cost of a system is no longer in terms of the number of gates in the system. A much larger number of gates can be put in an LSI chip; actually the number of pins available is the main limitation. More- over, the cost of n similar processors wired in a multiprocessor is considerably less than n times the cost of each individual processor wired in a uniprocessor. So the cost does not necessarily increase linearly with the processing power. Our motivation for a big fast machine comes from linear algebra compu- tation. In nearly all aspects of engineering, physics and statistics, solutions of large systems of linear equations are fundamental. Although any general purpose machine can do any programmable problem if given enough time, there exist some linear system problems that are so big that they can never get fully debugged on a slow computer due to the extremely lengthy time to get each run. The introduction of faster machines may hopefully lead to solutions of such problems. The higher cost-effectiveness due to LSI technologies and the problem necessities justify the emergence of big array machines like ILLIAC IV, STAR, and TI's ASC machine. One main area of interest in the development of a parallel processing machine is the compiler techniques such as parallelism detection and scheduling. This area is currently being tackled by a number of independent researchers [1]. Another area of interest is the organization problem. This is the problem of organizing components into the system to give the best efficiency in doing a given set of computations. It should be noted that different computation and data structures could induce different machine organization parameters. Since solutions of linear equation systems are among the most heavily used computations, it would be desirable if we can discover some parameters of machine architecture by carefully analyzing some existing linear algebra subroutines. This thesis analyzes in depth a set of matrix eigenvalue solution programs named the EISPACK routines. The first section will look into some Fortran Analyser results and will be dealing with the following problems: 1. What is the average number of processors required in the machine to achieve the best speedup? 2. What is the upper bound of the average speedup we will be get- ting with such a machine? 3. Is this machine efficient enough, i.e., what percent of the time is the machine doing useful work? k. How would the speedup and efficiency of our machine vary when the number of processors is reduced? The second section presents some hand analyzed results that attempt to tackle the following problems : 1. How should the data be accessed in the system? 2. How should the data he routed between the memory system and the processor system? 3. What should the instruction set of this machine contain to most efficiently process the given set of computations? In the conclusion, an attempt will be made to lay down some of the parameters of machine organization inferred by the analysis of the EISPACK programs . 2. ANALYZER EESULT A. Definitions Some definitions have to be spelled out before the analysis results are presented. We define T as the time required to perform some computation T l using p processing elements. Then the speedup S is defined as r^ 5, . We T x " P P define efficiency E as — — . Let be the number of operations executed P pT p p in performing the same computation using p processing elements. Then we define operation redundancy as -~— . Utilization is defined as U_ = ^~ P T ± p pT, We define first order of x, 0- (x) as x+a and the second order of x, 0p(x) as bx+c, where a,b,c are constants. We also define the speed- T up type ratio a - -- logg^ B. Model EISPACK programs written in Fortran are analyzed with the help of the Fortran Analyzer, implemented by Kuck and his group. In the Fortran Analyzer, a number of assumptions are made about the machine organization. They are discussed in [2] and can be summarized as follows: 1) I/O operations are ignored. 2) Control Unit timing is also ignored, assuming that instructions are always available for execution and are never held up by a control unit. 3) All processors are capable of executing any of the four arith- metic operations in the same amount of time, called unit time. k) All supplied Fortran functions are evaluated using a fast scheme proposed by DeLugish [3], which will allow each function to be evalu- ated in no more than a few multiply times. 5) A many-way jump processor as proposed by Davis [k] is assumed. This processor can determine in unit time which program statement is the successor to the statement at the top of a tree of IF statements. 6) Memory cycle time is assumed to be unit time, and it is further assumed that there are no memory accessing conflicts. 7) We also assume the existence of an instantaneously operating alignment network which can transmit data from memory to memory, from pro- cessor to processor and between memories and processors. Since no analyzer can have unlimited capabilities, certain limita- tions are imposed on the EISPACK programs to be analyzed [5]: 1) No more than ^0 statements are allowed in a DO loop. 2) Up to 7 subscripts are allowed for each variable. 3) Up to 10 variables are allowed on the right-hand side of an assignment statement in a DO loop when subscripts and function arguments are not counted. k) No more than 100 blocks or roughly 100 assignment statements are allowed. 5) No more than 50 label uses and 50 label definitions are allowed. 6) Logical variables and logical assignment statements are not allowed. Hence some large EISPACK programs have to be broken down into smaller modules to fit into the above limitations. Large loops are analyzed sepa- rately with the outermost loops discarded; we call these 'stripped* loops. This presents the danger of neglecting the parallelism due to the outermost loops. However, looking at the programs more carefully reveals that only one of the 15 big loops presented shows a possibility of parallelism in the outer- most DO-loop. The rest of the loops have one sort or another of recurrence relations embedded in them which cannot be handled efficiently by ordinary algo- rithms that deal with recurrence relations. Hence, imposing the restriction that large loops are to be computed serially will not hamper the analyzer results by a great amount. The analysis techniques, which can be treated as compiler models for the given machine, are described fully by Kuck, Muraoka and Chen [6]. In short, the Analyzer breaks a Fortran program into blocks of assignment statements (BAS)., DO loop blocks; and IF tree blocks. During analysis, T^ T , p, and for each block is computed separately. Then using IF and GOTO statements, all the traces through the program are found and the block T , T and are accumulated for each trace to give T^ T and of the trace. The maximum , p found in any block in each trace become p. R , E , S , and U are calculated for each trace. In each BAS, forward substitution is made whenever possible. Then using tree-height reduction techniques, the best speedup result for that BAS will be obtained. For DO loop blocks, horizontal scheme exploits Type 1 parallelism in the DO loop while the vertical scheme extracts Type parallelism. In the analysis of EISPACK programs, both schemes will be used. C. A Description of EISPACK Most of the original EISPACK routines are written in ALGOL and can be found in [7]. The Fortran EISPACK programs were translated from the corre- sponding ALGOL procedures by B. S. Garbow and J. M. Boyle of Argonne National Laboratory. It is a set of ~$h subroutines. According to the type of the matrix to be solved and how many eigenvalues or eigenvectors the user needs, different sequences of calls to these subroutines can be made. For example, given a real nonsymmetric matrix, if we went to find all the eigenvalues and eigenvectors, the following sequence of EISPACK routine calls will be needed: 1. Call BALANC 2. Call ELMHES 3. Call ELTRAN k. Call HQR2 5. Call BALBAK. All the EISPACK sequences are fully detailed in Boyle's path chart [8]. D. Results Table 1 shows the analysis results using as many processors as required. There are a total of kk program segments (8 original subroutines had to be broken down into two or more program segments). Program segments that are 'stripped' DO loops are marked with 'Loop' next to the program name. There are 15 such segments. In INVIT and TINVIT, the subroutines are essentially big DO loops which even when 'stripped' are too big to be handled by the Analyzer and have to be broken further in 3 and 2 segments, respectively. In these results, all matrices are assumed to be 10x10 and all vectors are of dimension 10. Since the results for both the hori- zontal and vertical schemes are available, we pick the best speedup re- sult out of the two and put it in Table 2. The choice of the scheme for ft ft c— ft- c— ON vo KN LfN KN o t- KN ON CM ft t> KN CM KN OJ 0J CM ft KN KN ft- o ft- O LfN OJ ft CM CM O LfN o KN O CO ft- KN CM ON o ON LfN s — ^ ft KN • CO ON VO CO OJ • ON ON CD ft H ft KN ft • OJ • • • • ft • • 1 o ft ft CO vO t- KN VO ft- OJ LfN ft ft t- ft OJ . ft ft" CO CO CM CM 0J t- ft KN • ft O rH ft OJ • LfN ft KN KN H P KN ft OJ OJ ft- ft ft KN ft 3 ON O « o LfN KN ON LfN o OJ ft- Pi ft KN CM KN 0J 0J -H- KN VO o • ft LfN OJ O N •H U ^ ft ft" KN ft- O ft ON ON ft ON KN LfN KN ft ft OJ O o LfN c— ON ON t- ON CO OJ ft- ft- OJ VO 00 ON OJ H OJ KN KN CO V0 o ON ON VO ft 1 CO ft vo ft 1 O ft OJ KN KN ft OO OJ s ft ,- ft LfN ft- LfN ft- LfN VO ft- OJ VO C- t— LfN o ft o EH CO ft ft KN ft ft VO ft KN KN CO KN t~ OO VO OJ ft ft o LfN O ft o O ft ft OJ OJ O VO LfN o ft- ft o\ ft 00 ft vo 00 ON VO ft ft- O KN ft- CO ft- H ft ft ft ft OJ KN OJ r-i ft OO H ft- CM ft- t— o VO ft KN O KN o o t~ Eh ft ON ON VO t- OJ ft ft c— LfN ft OJ CO o CO CM ON KN OJ V0 KN ON ft VO KN O O ft KN o ft ft ft OJ ft ft- LfN ft ft CM ^ ft VO KN 00 ft vo CO oo VO ft OJ VD H ft- LfN ft -H" OJ H 0J OJ OJ ft OJ OJ OJ ft OJ ft KN OJ ft H KN O ft- o ft OJ o VO ft KN 00 OJ O KN ft ON • O ON • • ON .- s rH H ft ft ft ft OJ ft • ft • ft ft • a> ft d CD ft ON OO 0- O KN ft ft LfN CO ft 1 t— ft LfN ft ON .3 CO KN • • ft ft • • • • • OJ • • • o ft c- VO VO VO CM KN CO LfN t— CO ft ft ft H ft kn o CO VO VO CO CO c— CO ft CM VO CO KN t- 03 o w ft- CM ft ft OJ ft o OJ OJ OJ ft OJ o KN OJ •H +3 Pi ft O KN OJ LfN OJ ON ft OJ ft- o 00 ft CO VO ft (D o LfN ft ON OJ t- OJ o KN VO LfN KN vo vo ft KN ^> CM ft- KN KN vo OJ KN ft vo LfN O ON ft KN O ft OJ ft- ft- OJ ft KN ft OJ 6 ft LfN OO ft 00 ft t— ft- OJ t— CO CO KN ON VO ON EH • CM t— OJ E- KN ft ft- ft- t- OO ON c- ft- o VO OJ OJ KN OJ ft ft ON ft OJ KN ft O ft OJ 00 ft- VO LfN o oo OO OJ KN OJ ON O on KN ft 0O ir\ LfN t— VO ft ft ft ON ft OO o ft KN H OO ft ft- OJ ft t- o VO ft KN O KN o o tr- EH H ON ON VO t— OJ ft ft t- LfN ft OJ CO o ee CM On KN OJ vo KN ON ft vo KN o O ft KN O ft ft ft- OJ ft ft LfN .-H ft OJ 03 • ^J O -P CM 00 LfN LfN LTN 00 VO OJ ft 1 ft 00 KN CO ft" ft 1 ft o3 00 O OO VO ON OJ ft ft ft OJ OJ o ft ft o o ft < ft o O u S EH O 3 EH H w CO p. OJ M CO l $ t a ft CO H ft H o o o o 1 | ft o o O o o o o ft ft o •H •P a3 bD •H ft o o u o w w CD O O & •p •d bD •H w ft w p ft 0) ft u (D tsl a3 PJ KA -d" -d- UA KA ft CM E— O ft" 3 ft CM -d- ft ON ft £ r- ft tr 0J ft O CM KA ft V0 CM ft O -d- UA KA ft VO EH ft O H KA KA OA CM KA 00 ft ft o 0J OJ vo 0J ft H OA ft ft ft O t- o UA VO O -d- 0^ o tr- ON UA ft KA OO O o E- CM ON CO O CM o C— ft OJ ft- UA UA ft ft ft CM ft J5f ft -d" -d- H co UA fc- IT- CM ON ON 0A o t- ft EH -d- UA O O OO o t- o « KA ON ir- CM 00 MD O rr\ ft H KA CM ua -4- CO -d- ON 0J ON ft o ft ft • ft 2 -^ H KA KA -=f- OO t- o -d- ft t- OA -d- ON ft VO -3" *& ka ft KA CM a f- ft KA ft ft ft ft ft ft ft o O ft ft O O O O o O O 5 ■a a o ft w ft ft ft ft o ft T5 CD c •H -p o o CD ft "§ ft 10 ft VO ON t— H -d- O CO -d- LTN OJ VO t- 0O 1=3 H OJ ft LfN _d- LT\ KN OJ ft OJ ft- o -d- Ph ft -d" LfN VO ON O -d- VO o KN -dr O OJ H ^ OJ H H OJ KN H H H ft OJ OJ OJ OJ 0) ft ft 00 t- VO t- OJ ft- O H c- KN KN LfN OO- o w -d" • • • • • • KN • ■ • • • CO t- H H kn KN CO KN -d- ON -d- VO -d- H o3 ft VO VO ?l H -d" OJ LfN -d" 3 ON vo KN vo -P w VO H OJ OJ LfN OJ OJ o 0J O 0J $ L\] •H ft O KN K\ ON VO KN VO H ON LfN o -d- CO !h o _d" KN H -H- 00 ON ITN OO VO OJ ON LfN VO O H ON CO OJ H ft H ft- OJ o- D— 0- -d- ■S H -d" H^ H H ft M ft CO -d" O LTN O VO -d- ^ -d" Or O t-r VO P EH o\ o H VO OJ OJ OJ OJ KN H -d- ft VO ON H CO o H CVJ KN o O 00 ON KN LfN VO O KN O VO ft c- VO o ft OJ OJ H H KN H -d- O H OJ o OJ OJ H -H- o\ KN o CO o ft H 00 VO ON -d- t- EH 00 LfN LTN o\ LTN ON O 00 O o CO C- LTN VO -d- On OJ OJ OJ H ft- 0J lTn" KN OJ LfN VO ft H CO VO 00 -d- ft OJ O LTN VO VO KN o- 1=) -d" KN KN -d" -d- LTN KN KN H -d- KN H KN ft LTN ON VO 0O O KN LfN ft- KN -d- t- OJ VO ft 0J • • • • • • ON • • • • • (D OJ H OJ KN OJ H • H OJ OJ OJ OJ CD ft VQ KN ft- o OJ On OJ LTN t- OJ OJ o- o .£ CO KN • • • • • • • • • • • • o VO LfN -d- KN KN KN t- ft- LfN -d- LTN -d" CO H ft VO KN KN O -d" O OJ OJ 3 ON °°, VO ON cti ft • H OJ 0J 0J KN OJ KN ft H O H O H •H -P ft OJ tr- ON ft VO KN ft LTN ON 8 -d- ^ tr- 0) O c— -d" -d" -d- CO KN LTN -d- VO H vo > H O 00 OJ H c— ft -d- OJ CO ON t- LfN H 0J -=fr KN H H H ft r$ ft O LfN 0J ft- o o KN -d- ft- CO CO ^d- ft R » LfN ft ON -d" OJ 0J VO KN vo -d- ON 9 H H OJ ON LfN ft ft 0J O kn -d- CO LTN VO LfN vo c— 00 o LfN OJ LT\ OJ OJ OJ OJ H OJ -d- 0J -d" o H -d- H _H" ON kn O CO o ft H oo VO ON -d- -d- EH 00 LfN LTN ON LTN ON O CO O o 00 O- LfN -H- on OJ OJ OJ H ft OJ L^ KN OJ LfN VO w • .3 o +3 ft ltn OJ [— t- OJ H -d" OJ H LfN OJ ft" S o3 -d- ON -d- 0J ft ft ft ft ft O O ft 3 O ft H OJ & o ft o ft H o o ft ft o O ft w CO ^t H O o ft ■8 EH 11 KN H H EH P CM ^ M O H w O -P ft O CVI OJ CVJ oo -4- ON OJ vO LfN LTN t- CO CO t— 3 t- KN o H o -d" H KN H -4- OJ kn t- KN CO OJ OJ 00 CO LTN CO CO VO 3 OJ 3 H CO ON CVJ OJ OJ KN CO -=i- OJ o ft VO c- -4- H ON VO ►0 LfN KN o c— VO ON OJ H H D -=J- OJ KN OJ OJ OJ H KN KN -4 o OJ o LTN CVJ ft H OJ O -4- o KN o CO -Hr KN CVJ CO o ON LfN ON Ph 00 ON VO OO ON « ON ON H H H H H rH OJ ' ' ' H ' ft on VO t- O VO -=h OJ LfN -* -d- c- H OJ -Hr -* CO CO KN OJ OJ H -Hr KN KN H H o H H OJ OJ CVJ -4 OJ KN -d- KN _ ft KN KN -=J- VO ON O vo O LTN KN ON VO o OJ -* r- H -Hr OJ KN H OJ OJ o -=J- KN VO o CVJ H LfN OJ VO ft O KN -=h LTN -=1- ON ON -Hr ON KN LTN <-\ rH -d- OJ , v— ' O LTN c- ON OJ t— ON CO OJ -4 -4 CVJ VO 00 ON OJ OJ ■O KN KN VO o ON ON VO 3 00 ON VO ^1 o H OJ KN KN KN OO H OJ ft c- -=J- LTN CO ltn vo -4 OJ vo C— t- KN O -=1- Q UA En 3- H OJ H -=t VO KN KN 00 KN ON OO OJ H "" H H VO H o LT\ O CO O o J; _* OJ CVJ 8 KN LTN O -4 -=j- ft o\ H 00 CO VO 00 ON VO -4 -Hf 00 -4- 00 -d; H H H H CVJ KN OJ H H CO H -4- OJ j* c- O VO H KN O KN o o c— KN EH H ON ON vo f- OJ H H C— LTN H 8 °Q o CO ^— ' OJ ON KN OJ vo KN ON H VO KN O H *■} o H H -5f OJ H -dr LTN H H OJ OJ k ^ s > H M > K K W w K W W > w HI M POO & & o Q ^--, 1-5 h5 <| pq o H c_> p iSd EH | EH ^ CO CVJ W CO ^^ M P*". <£ o H 2 C£] p< Pd pr"l 5 2 PQ w pq J £> H B h- 1 3 pu B a « 3 CO H s 3 pj o O o O a a m u u CJ> o o o o w H O •H -P cr3 O o fH O CO w - OJ t- CO KN NA o CO ft- V0 t- ft KN H ft OJ UA H ft CM 0J LTN OJ ft ft- KN OJ OJ ft -* KN ft O OJ na. -H- On 00 ft ON c— OJ ft- O 00 LJA UA VO ON ft ■ UA UA ft ft- H ft ft KN ft ft 0J ft ft KN ft ft O OJ KN t— CO O O kn UA 00 00 ft o ^ LfN ft 00 CO OJ D— • • • H • ft • • KN KN * • • CO OA CO -H- ft ft ON ir\ UA 3 VO ft o 00 -=h 3 ft VO 00 3 t— ft o ft- 0O ON o o t- w OJ ON o ft ft OJ o OJ KN OJ VO o ft- VO MD ft ft- 00 ft- H 0J KN ON MD UA 00 ft ON 00 00 0J UA ^_^ o ft- C— H ft 0J VO ft- UA o H KN t- OJ KN OJ UA KN. ft- -H- LT\ S 0J t— ft ft- OJ ft- ft ft" 0J O ft LTN ft r P* t-r OJ 3 o OJ KN H v£> 0J ON o ^ LTN ft ft" ^ UA EH H KN K\ oa 0J KN t- K\ 0J ft OJ r- o- ft v£> ft _ O c- O UA VD o ft- t- O KN ON UA ft OJ o Q ft- ft O ft t- O 0J On CO o H OJ ON O OJ VO ft KN ft ft" ON ft- ft _ ft -H- ft ft- -=j- ft CO UA C— IT- 0J ON ON OA o t- ft KN Eh ft UA O O 00 o E- o « KN ON t— 0J 00 K Q s KN H H KN 0J UA -H- 00 ft- ON OJ ft o ft ft OJ > U ft ft o p £ > ft > > > £ > ft ft > K > ft ft t> ft ft ft ft O ft ft O o o O ft O o O o o ft o o O ft ft ft • N o ft ft ft 0J KN H js; ft W ft ft < pq s — ' 2 ft OJ ft ^ pq < OJ o ft pq ft 9 ft gr ® EH H ft C5 H C5 H gj. § e EH EH j| w ft ft ft ft ft ft H H H o •H ■p •H ft o o u o w oa 8 OJ SN CM w H O H CO -ri- ft CO O CM LfN CM KN VO O O H "8 o ON O ON KN. CM -=£ CO UN VO C- -ri- ft H H CO O o ai vo o ft VO CM ON -rj- 00 t— KN CM VO LfN K w k k k s •H n3 ho •H ft S O U Jh O CO CO CD O o -ri- CM LfN VO CO o ft" LfN O KN -rt o CM H ft* H ft CM KN ft ft H ft CM CM CM CM CD -P •H S CO t— VO O CM -rt- CM ft t- KN KN LfN CO jn -rf • • • « • • KN. • • • • • C- H H ft- KN CO KN J- ON -n- VO -n- H o ft VO ^ 3 o -ri- CM CM -ri- 9 ON VO KN vo VO H CM CM LfN CM al o ai O CM p H w a> ft fH !h ft — ft O o CVJ ON H OJ rov LT\ LT\ • • H ft ro* o O OJ H OJ t- ON OJ OJ • • o CO o o -0- H OJ H -=h v£> H UA H o\ O OA o\ la o\ o\ LA VD o\ H -* o -4- -4- o H o H o •• <; IS\ (D M VI ctf £h H 0) £J !> < OJ ft H K> H II II II O ft -P O ft o ft -P O Eh o •H +3 ft ft ft 3 bD •H ft £ o o Pi o 03 CO O o 0) -p •H P! Pi H Pi B 03 CD ft in N c3 Pi < -p ra 5000). If we con- sider only the program segments with T < 5000 (there are h-0 such segments), the average number of processors used is 90, average speedup is 21, and averag efficiency is 29$. A general rule of multiprocessing is that an efficiency of more than 20$ is acceptable. Hence, we can claim that the EISPACK programs are suitable for parallel processing. T 3. From the a (= y^ — m— ) column, we can see that there are two programs where a's are less than 1. There are 31 program segments whose a's are between 1 and 10 and they are called the Type 1 programs. The remaining ] 16 have a values higher than 10 and will "be known as Type 2 programs. For further investigation, a few programs are chosen and the DO loop limit is increased to 20, 30, and kO. The result is shown in Table 3- From the table we can see that the two programs with a's less than 1 have relatively constant T 's. These correspond to the Type definition by Kuck [9]. A multiprocessing system can compute a Type program in the same amount of time regardless of the number of iterations required to be done. The speedup T (= ^ ) will then be of the order of T^ (T ). The a values for Type 1 programs are relatively constant over the different DO loop limits. So the minimum time to compute a Type 1 program T is (log T ). By forward substituting each BAS (Block of Assignment Statements) to its full extent, and carrying out normal tree-height reduction techniques, we can see that most program segments can be computed in a time on the order of log p T , (log T ). This is in full accordance with the result we have found in Table 3« The same definition for Type 1 is also given by Kuck [9], For Type 2 programs, the a values no longer are constant over different DO loop limits. However, we can see that the speedups of these programs are closely related to log T, . The consistency of this result is shown even more clearly in Figure 1(a) and Figure 1(b). Figure l(a) plots the 11 Type 2 program segments. Figure l(b) plots the effect on speedups by the variation of DO Loop limits for ORTHES and TRED1. It is readily observed that S = k^Log T + k , where k and k are constants, or S = (log T ). However, until now, there are still no apparent reasons for programs to behave like this. The Type 2 definition by Kuck [9] is different and offers a plausible explanation T of program behavior. However, his prediction that *■ ~ values for Type 2 programs will be roughly constant does not fit well with the EISPACK programs. k. When we are dealing with actual machine design, however, we IT O O H H OO VO £ OJ VO OJ CO -4 NA 00 H 0J NA NA UA H c- VO o H H H NA CO CO 00 MD H H H H -4- O V0 OJ -4 o VO OJ ■H H NA UA £- UA O UA O OA NA o OJ t— NA vO 00 Eh OJ 00 00 0J H NA -d- VO NA _4 H *A VO VO 0O NA H rr\ H NA OJ UA O H ft-P O Q '^ O O O o q O o O O O O o O O O o H OJ NA -* H OJ NA .* H OJ NA J* H OJ NA -4- *H O gl bfl E> *> H >" m S PQ "w o h*H V — s Ci3 s«/ ■ "i 3 & l H ft 3 18 LTN ON OO OO OJ D~T LTN _=)■ -4" CO VO VO -4 K^ 00 ON • • • • • • • • o VO -4" K\ o VO OJ ON H H OJ KA H H OJ OJ ft K ft CO. ft H ft O ft ft Pn ON OJ UN H t- VO H KN KN ON 3 K\ VO OJ OJ VO H OJ H o H OO -4 LTN K\ OJ o OJ OO -4- OJ ON H VO H H OJ CO o OO t- OJ K\ KN LTN LTN ^ VO ON iA OJ & oo ON CO KN O t— -4 OO ON -4" o OJ KN VO OJ o ON o ON KN OJ -=J- 3- OJ OJ 00 VO OJ CO OJ 00 H KN -4 H OJ OJ ON CO OJ OJ o H H VO oo OJ LTN VO ON kn KN OJ UA o H H OJ OJ LTN ■6 H H VO KN oo +3 •H i-q ft O o i-q o P o w o •H -P ctf •H o5 S3 •H -P a o o K\ a; ft-p o -H o o s Q 1-5 -H ON LTN ON KN OJ CO KN O OJ 1^ o VO CO o KN KN H O ON OO KN ON H O OJ ON OJ -4- OJ O KN ON o H -4 O -4 Eh K W 19 25r Figure l(a). Speedup Graph for Type 2 Programs (DO Loop Limit = 10 ) 20 25 20 - 15 - Q- Z> Q LlI UJ a. CO 10 J I I I L J i L 10 11 Log Tj 12 13 14 Figure l(b). Speedup Graphs for ORIHES and TRED1 (DO Loop Limit Varying) 21 cannot assume the machine to possess arbitrarily many processors. Hence we have to limit ourselves to a machine with a finite number of processors. Since the previous results show that the average number of processors used is around 99> we hypothesize a machine with 100 processors. With fewer processors, we can expect the speedup to degrade a little bit. Nevertheless, the efficiency will go up as the number of processors decreases. At some point in between, we should be able to find that both the speedup and effi- ciency are reasonably high. So the next step is to check how much the results will change if we fix the number of processors in our hypothetical machine to 100. For each test program, the analyzer is capable of breaking the jobstep requiring the maximum number of processors into two jobsteps each requiring half the number of processors as before (see Figure 2). Then the analyzer will try the same breakup procedure on the resulting jobsteps. This breakup procedure will continue until the machine is more than 50$ utilized. For each program, the analyzer provides the result for each breakup operation on each independent path through the program. Assuming speedup and efficiency are equally important in actual machine design, we use the product (speedup x efficiency) as the choice criterion. So, for each path, we choose the breakup result with the greatest (speedup x effi- ciency) value. However, the efficiency value is calculated with regard to the number of processors used for that particular path, not the number of processors available in the machine. So the efficiency for programs using a small number of processors is actually smaller than that shown in the result. In order to compensate this fact, we add an extra criterion that The only reason why this efficiency enters the choice criterion is that since we still do not know the optimal number of processors in a machine (100 is only a crude suggestion offered by the previous results). The choice criterion should allow us to find a better possibility for the number of processors. 22 JOB STEPS 1 2 3 4 5 6 7 75 60 100 100 40 75 75 NUMBER OF PROCESSORS USEO (a). Before Breakup Procedure is Applied JOB STEPS 1 2 3 4 5 6 7 8 9 75 50 50 60 50 50 40 75 75 NUMBER OF PROCESSORS USED (b). After One Application of the Breakup Procedure Figure 2. Breakup Procedure 23 if the number of processors used is less than 50, the breakup result with the highest speedup is chosen. In short, speedup x efficiency 50 < p < 100 Choice criterion = speedup p < 50 Then the averages of the chosen results of all the paths in a pro- gram are computed and tabulated as in Table k. V and H in Column 1 corre- spond to the scheme being chosen — vertical cut scheme or horizontal cut scheme. Subscript of <» implies that the results are equivalent to that of the infinite processor configuration. Subscript of M implies that breakup procedure has been applied and the results are some intermediate values. As can be seen, the new average number of processors used is 5>k, a saving of nearly 50$ of that used in the infinite processor configuration. The speedup decreases from 20.3 to l6A, a degradation of only 20$. This speedup degradation is compensated by a corresponding 20$ increase in efficiency So we can conclude that using a limited processor configuration does not degrade the performance of the multiprocessing system. Furthermore, from both the speec up and efficiency standpoints, we can also deduce that probably the optimal num- ber of processors in a multiprocessing system aimed at computing linear algebra should be between 55 and 100. However, it should be noted that all matrices are assumed to be 10 by 10, and in reality most of the EISPACK usages are dealing with much larger matrices. The number of processors required will then be much more than what is shown here. However, due to the limited capabilities of the Fortran Analyzer, we cannot provide similar results for matrices with higher dimensions. From the incomplete results we have in Table 3, the average number of processors used for the six programs shown are 82, 317, 706 and 12^9 when the dimensions of the matrices are assumed to be 10, 20, 30 and 2k o -p ft ft ft ft CQ ft ft ft o ft ft ft POP OJ 00 00 UA UA O H UA OO CO VO VO 0J 3 H CO o\ Ol CVJ OJ KA 00 CO o OJ E— O KA VO C— -=J- O H KA UA UA KA UA KA UA NA KA KA -d- -4" OJ KA H £ -* O H OJ o -4- O KA H 0O -=r KA OJ CO O •H • • • • • • • 0O OA VO CO OA • -P H ■H H H H H OJ H 0} _=h KA KA L>- t- KA KA -4 H -4 CO -=h UA fl • • * • • • • * • OJ • * • o VO OJ -4" CO UA _* O _=h KA OA O 3 o K-\ OJ OJ K> OJ H KA H H OJ ^ o ra ra CD OJ t— 0J CO O VQ O OJ VO KA UA -=J- KA O ua OJ UA OJ UA OJ 0J -=J- -=1- VO OJ KA OJ o ft -p o KA -3" UA -* OA OA -=J- OA KA UA H H '^ LP* L>- OA OJ t- OA CO OJ -4- -=f- 0J VO CO 3 OJ KA KA KA VO O OA 0A VO 3 CO OA VO H OJ KA KA KA CO H bO •H ra OO O L>- Ol OA UA O c— t- t- H E— CO LA H KA H VO 00 H O H KA 00 UA OA H NA ft" ra -P H CO KA OO VO OJ ft- OA KA 00 OJ VO o 00 VO OA -=*■ KA D— OA UA 00 OJ J- VO VO UA 00 H _=h OJ -4- t- O VO H KA o KA O H OA OA VO D— OJ H H 0- UA H OJ OO OJ OA KA OJ VO KA OA H VO KA O O H H H _=|- OJ H -=i- UA H > ri* J" > trf 2 ^ LEF ^ ^ HH l-H l-'H HH l-H K ft > ft Pi ra ft ?H CI) N -4- H ■s En EH O ft H m m ft o o ft o o ft H O ft u a 25 w • jc| O -P ft P ft ft ft CO ft ft ft o ft -d- S rA rA -4 00 t— O -3" » rA CM $ t- on 0J LA 0J 0J V0 CVI H LA ON H O (A rA 00 LA -=1" IT\ rA rA -4 00 rA o\ ON LA o H CVJ LA rA ON H CO P rA ON P [>• OJ 0J -4- rA -4- 00 H CVJ LA 00 H LA L^ VO VO rA VO OJ LA v0 CO J- r— LA 0J 0J ON o 0J o- CO o -=r" -=j- ON OJ CO OJ ON CO CO H OJ oo 00 LA rA OJ O rA -4- ON 3 CVJ 81 CVJ -=t- 00 -=f- r^- OJ rA ON VO -d- f- H H OJ VO -d" LA rA -=h rA LA £ OJ t- « CO H rA o OJ r-i o- -=h ft ft P O PI -rf CO ON OJ H cvj H OJ OJ 3- o o o c— o VO ou 00 t- S- Q OJ oo rA 3 H rA rA o o OJ -4- O ^p CVJ 3 ON -=r rA O O ON CVJ OJ LA _dr LA H O 0- -=r P -=h -=h H CO LA O 00 3- LA O O oo O t- rA O rA H P rA OJ LA H OJ t- o H ft sP sf 8 > p > OJ KN 00 ri* ON s H a EH C23 r*i l-l H M w ft ft ft ft O o O 5 p OJ H o CO LTN LT\ VO -4" • -4 • • LTN OJ H KN H • LT\ VO H CO OJ o ltn H O K\ OJ O KN -4" ltn LTN ?l VO CO ON O o VO ^Q OJ VO O -4 VO VO H CQ O o\ -4 Ol o O OJ KN VO -4 CO ON -4- H t- VO VO OJ OJ ON VO H ri* t> ft O O ft pq w H EH H H w O ft OJ > M ft ^ j> KN En ft O H EH H EH ft & O ft OJ > & O ft H ft G? Eh O KN H CO t- LTN O OJ -4" OJ H t- VO -4- C— O H -4 KN CO KN KN OJ -4 O -4 OJ OJ VO OJ OJ LTV OJ OJ 21 ON CO CO OJ LTN O KN KN H VO KN H H ON r— OJ KN OJ LTN -4- KN H -4- CO ON LTN CO VO -4 OJ O H LTN H H H ON CO -4 OJ H H H -4 OJ Ol LTN H -4 VO CO -4 OJ -4- O VO KN OJ -4 OJ OJ D- t- H ON O H VO OJ OJ OJ OJ KN OJ -4 o ON vo -4 ON ON O C- H -4 ON KN o CO o H H CO t- OJ CO VO O 00 u\ LTN ON UN ON O CO o -4 On OJ ON H VO -4 ON OJ OJ OJ H -4 OJ ri* ft 1 Eh ft O O ft pq § •H P 03 W) •H 03 P CQ ft •H P s O -4 H 27 co • p H LTs OJ -=f O P S n5 ft _ft OJ VO o CO LA P OJ -=|- N"\ ,=t- K\ • ft -=*■ O OJ H Lf> W • • • • • OJ OJ OJ OJ H ft K\ KA t- OO J- CO • • • • • o\ -Hr LfA -=}■ S ft ON VO ^d- VO OJ p o OJ H OJ KN • • • ft UA o 1* OO O o OJ ON v£) o t- t— f- 3 _H/ NA H H OJ H H ft f- o t- VO H EH H ON & O H S ft O KN OJ VO -=f O OJ -d- OJ UA H H EH « ON CO K UA 0> fc |0\ OJ VO 3 > u p 8 ^8 s ^8 POP H w > W ft ft O o Q £ p H o co CO cu o o & CO p a •H CO P CO p CO CD 0) ! d cu •H p o o -4- i 28 ho, respectively. Hence, we might extrapolate that the optimal number of V 2 2 processors used is of the order of n (0 (n )), where n is the average di- mension of the matrices used. F. Typical Path Results As pointed out in Section 2.C, whenever a user wants to use the EISPACK programs to calculate eigenvalues or eigenvectors, he usually uses a series of EISPACK routine calls. Depending on what kind of matrix he has and whether he wants all eigenvalues and eigenvectors or only some of the eigenvectors or even only eigenvalues, he will be using different paths of subroutine calls. Tables 2 and k only give speedup results for individual subroutines. To find out how much speedup will result when one particular path of subroutine calls is run on a multiprocessing system, we have to compute the speedup for the path from the existing results. The serial time (T ) and parallel time (T ) for each program segment used in the path are added up to form Path T and Path T . The program segment times have to be weighted 10 times if the program segment is marked 'Loop', i.e., the segment is only a single iteration of a big loop in the subroutine. The average speedup for the path, Path T Path S = .,_,,. m • p Path T P If our multiprocessing system is also a multiprogramming system, then we can define average number of processors needed for each path, P(ave), as the average of those of individual segments. If it is not a multiprogramming system, then we have to define the number of processors needed as the maximum of those of individual segments. 29 The following results are computed for the most commonly used paths and are in the order of their popularities. Eigenvalues and eigenvectors for real symmetric matrices are most commonly needed for most users. Results for both the infinite processor and limited processor configurations are available. 1 . Real Symmetric (all Eigenvalues and Vectors) Program Segments Used: TRED2A, TRED2B, IMTQ2A, IMTQ2B Infinite: T = 5274 + 657 x 10 + 479 + 929 x 10 = 21613 T =817+ 106 x 10 + 16 + 25 x 10 P = 2083 T l Tf- = 10.4 P P(ave) = (135 + 4l + 200 + 26 )A = 100. 5 Limited: T = 2l6l3 T = 22 + 25 x 10 + 927 + 106 x 10 P = 2259 T l P P(ave) = (90 + 41 + 42 + 26) A = 50 2. Hermitian (all Eigenvalues and Vectors) Program Segments Used: HTRID1, IMTQ2A, IMTQ2B, HTRIBK Infinite: T = 1832 x 10 + U.79 + 929 x 10 + IOO67 = 38156 T = 139 x 10 + 16 + 25 x 10 + 672 = 2328 T l 5f = 16A p P(ave) = (200 + 63 + 135 + 4l)A = 110 30 Limited: T = 38156 T - 1203 + 160 x 10 + 22 + 25 x 10 = 3065 P T l ^= 12.5 P P(ave) = (90 + ifl + J+0 + 50 ) A = 55-3 3. Real Non-Symmetric (Eigenvalues Only) Program Segments Used: BAIAWC, ELMHES, HQPA, HQRB Infinite: T = 991 + 20&7 + 30^ + 28l x 10 - 6192 T = kk + 110 + 30 + 320 = 50k P T l p P(ave) = (115 + Ikk + 95 + 86) /k = 110 Limited: T = 50 + 218 + kk + 380 = 692 P T l Tj- = 9.0 P P(ave) = (93 + 80 + 26 + kk)/k = 60.8 We can see that the average speedups are around 13 for infinite processor configuration and the average number of processors used is around 100. Using limited number of processors, we have an average speedup of 10, while using about 55 processors on the average. 51 3. MACHINE ORGANIZATION This section attempts to give more understanding about how actual user programs look in detail and hopefully deduces some guide- lines on how machines should be built in order to give better user pro- gramming results. Since EISPACK programs operate on matrices and vectors, the following results are most relevant for a machine built for vector operation purposes. However, the deduction methods used here are quite general. So machines aiming at computing programs that use different data and program structures may profitably use similar analysis techniques. Using this direction of attack, we can hope to produce a better under- standing of what a high level language should contain. All the results here are the results of hand analysis of the EISPACK routines. A. System Model In order to extract more information out of a set of test pro- grams, we first have to specify in general the kind of multiprocessing system we are dealing with. However, since most of the specifications will be dependent on the results we obtain or results that are yet to be uncovered, the initial specifications outlined here should be as general and flexible as possible. We will be dealing with a SIMD (Single Instruction Multiple Data) type of multiprocessing system, an example of which is the ILLIAC TV machine. In general it looks like: 32 Figure 3« Simple Multiprocessing System The problem to "be tackled is the program execution problem. This includes three main areas: data access, data alignment, and data processing [9]. Most high-speed computers have several random access memory units operating in parallel in order to make the memory bandwidth match the high processing speed. However, multiple memories will create problems in the assigning of data to different memory banks without causing any conflicts in accessing o Assume we have m memories and k operands to be accessed at some point of computation. For vector and array operations, the operands -x- to be accessed may be some rows, columns, diagonals or square partitions of an array. If m < k, we know for sure that we cannot access all n operands in one memory cycle. When m = k, if m is prime we can access rows, columns or diagonals in one memory cycle using a simple 1-skew scheme (Figure k(&)). However, if m is even, using 1-skew scheme (Figure 4(b)), we can access rows or columns in one memory cycle, All partitions parallel to the main diagonal are also known as diagonals. 33 y 1,2 1,3 1,4 1,5 2,5 2,1 © 2,3 2,4 3,4 3,5 3,1 3,2 @ 4,3 © 4,5 4,1 n — ' 4,2 5,2 5,3 5,4 (p> 5,1 Figure 4(a). 1-Skew Storage Scheme for m = k = 5 © 1,2 1,3 1,4 2,4 2,1 © 2,3 % 3,4 3,1 3,2 4,2 4,3 © 4,1 Figure 4(b). 1-Skew Storage Scheme for m = k = k 3h but we need two cycles to access the diagonals. The penalty is not great, especially for programs which access diagonals sparingly. By using redun- dant memory units, i.e., m > k, it is possible to find storage schemes that allow conflict-free access of rows, columns or diagonals. Some of these schemes are discussed extensively by Budnik and Kuck [10], and Lawrie [11]. However, they all have drawbacks of one sort or other. For example, the m = k+1 scheme presents the difficulty in indexing the memories and the m = 2k scheme uses only one half of the available bandwidth. Hence the choice of memory storage scheme is still an open question. After operands are successfully fetched out of the memories, they still have to be routed to corresponding processors in minimum delay time. The same is true for results coming out of the processors which are to be stored in the memories. A crossbar switch is the simplest answer. The total network delay (21og m) is the smallest known. However, the number of gates required is enormous (3m for each bit of data) especially when the number of processors or memories is big. Benes [12] and Batcher [13] designed relatively simpler alignment networks. One drawback of Batcher's networks is the increased data delay, while the Benes network does not have a fast control algorithm. Lawrie [11] designed an Omega Network which has a gate delay of Ulog m and a total number of (3mlog m(d + l/2(log m-l))) gates, where d is the number of bits to be passed. Although it cannot perform some permutations, it can effect most of the common connections required. One example is the broadcasting of one argument to a certain group of processors. It can be shown that, to transmit a k8 bit word in 200 nsec through a m by m network, the number of gates required by a Crossbar Switch will be more than that of an Omega Network whenever m exceeds 8. However, it should be noted that Crossbar Switch can be 35 made from off-the-shelf integrated circuits (multiplexors) while the Omega Network elements need new layouts. Hence the breakeven point may be higher than 50. As for the processor system, there seem to be two choices. One is the pure vector machine concept, in which at any one time each pro- cessor is either doing the same operation as others or is idle. This is fairly easy to implement, assuming some control vector schemes [Ik] are being used. The other choice is to allow each processor to perform different operations. To compute the following FORTRAN segment: DO 10 I = 1,4 10 A(I) = B(l) + C(I) + D(l) * E(l) if we have eight processors, using the first scheme, we can do it in 3 unit times; however, using the second choice, we need only 2 unit times. So the second scheme will produce better speedup. Moreover, in the first case, half of the processors will be idle all the time. However, to imple- ment the second choice, we would require the control vector to contain operation information for each processor. More time would then be needed for setting up the control vectors and more space will also be needed to store the vectors. Hence, if a cheaper organization is required, the pure vector machine concept will be sufficient. If the best speedup is needed, the processors will have to be able to do different operations. B. Types of Operations Array operations can be subdivided into whole matrix operations and one dimensional vector operations. Matrix operations are those which have a matrix as an operand or a result. The speedup should be of the order 36 of the square of the dimension of the matrix. However, this is only possi- ble if there are as many processors as the square of the dimension of the matrix. For a machine with less processors, the operation has to be done in smaller partitions, (e.g., square partitions, rows, columns or diagonals). Vector operations are those using only one -dimensional arrays, and scalars. The speedup will be of the order of the dimension of the vector. Some of the most common array operations are listed and explained below. The common notations used are shown in Table 5. 1. Whole Matrix Fetch and Store. With a big memory system o (m > n ), each element of the matrix should be stored in a separate memory to avoid any access conflict. For smaller memory systems (m < n ), we have to store by small partitions (square blocks, rows or columns), and subse- quent operations have to be done in the partition chosen. 2. Matrix Transfer (M -> M) and Matrix Exchange (M*-»M). The copying of one matrix into the other is called matrix transfer. The swapping of two matrices is called matrix exchange. Either of these oper- ations can be done in the alignment network without entering the processing system, if the alignment network has some internal buffers or registers. A typical example of Matrix Exchange that can be found in Fortran programs is DO 10 I = 1, 10 DO 10 J = 1, 10 T = A(I,J) A(I,J) = B(I,J) B(I,J) = T 10 CONTINUE 37 Arguments : S V M R C D Scalar Vector Matrix . Row of Matrix Column of Matrix Diagonal of Matrix One Dimensional Vector Operations Operator LS Log sum CLS Logsums of all Columns of a Matrix RLS Logsums of all Rows of a Matrix MAX Maximum of Array Elements MIN Minimum of Array Elements ■* Transfer *—*■ Exchange Others m n Number of Memories of Processors Dimension of Arrays Table 5. Notations Used in Array Operations 38 3. Broadcasting (S«M -* M, V*M -*■ M, V*V -* M) . Broadcasting is the routing of data from a partition of the memory system to a larger partition of the processor system, each datum in the memory partition being simultaneously required by several processors. An example of broadcasting is shown in the following Fortran DO-Loop: DO 10 I = 1, 10 DO 10 J = 1, 10 10 A(I,J) = A(I,J) * B(J) This is a row-wise matrix operation and will be denoted as (0*R -* R) under the column V*M -* M. If the matrix A is stored across the memory system in a row major order, after a direct route of A to the processors, the necessary routing pattern for B is shown in Figure 5(a). 1st ROW 2 nd ROW nth ROW Figure 5(a). Duplication Pattern 39 This is called duplication [15]» However, if the matrix A is stored in a column major order, a different routing pattern is needed: 1st COLUMN 2nd COLUMN n th COLUMN Figure 5(b). Fanout Pattern This is called fanout [15]. One common variation of broadcasting is the outer product of two vectors (V*V -* M) as in: DO 10 I = 1, 10 DO 10 J = 1, 10 10 A(I,J) - B(I) * C(J) Here, if we perform a duplication of B first, followed by a fanout of C, we will finally get the matrix A in a column major order. On the other hand, if fanout of B is done before a duplication of C, we will get the matrix A in row major order. Information about the types of ko the two vectors is essential in determining whether a fanout or a duplica- tion should be performed first. Another variation of broadcasting is the routing of a scalar to all the elements of the matrix (S*M -* M). Either a Crossbar Switch or an Omega Network can perform these kinds of broadcastings. k. Matrix on Matrix Operations (M*M -» M). All corresponding elements of the two matrices are operated on simultaneously in different 2 P processors. The speedup will be of the order n . As before, if m > n , 2 two whole matrices can be operated on. If m < n , some matrix parti- tioning has to be used. A typical example of this operation in Fortran programs is shown by the following loop: DO 10 I = 1, 10 DO 10 J = 1, 10 A(I,J) = A(I,J) + B(I,J) 10 CONTINUE 5. Matrix Reduction (R(M) -*• V or R(M) -* S). Examples of the reduction operators are the sum of some elements (LS) or maximum (or minimum) of some elements (MAX or MIN) . Some of the common examples of these reduction operators are: CLS(M) -* The column sums of the matrix forming a new vector. LS(M) -> S The sum of all matrix elements forming a new scalar. MAX(CLS(M) -O) -> S A new scalar is formed by taking the maximum of all the column sums of the matrix. Summation of k elements can be computed by the logsum method in k-1 flogpkl steps, i.e., a speedup of j? r-r . The maximum of k elements in can also be found by Tlog kl comparisons. The speedup will also be k-1 ■p: r-r . In Fortran programs, a CLS operation will look like this: riog 2 ki & ' DO 10 J = 1, 10 S = DO 20 I = 1, 10 20 S = S + A(I, J) B(j) = B(J)/S 10 CONTINUE 6. Vector Fetch and Store. As discussed in Section 3 -A, the access time of vectors depends on the kind of access schemes available. When m = n and n is even, if 1-skew scheme is used, the access time for rows and columns will be 1 time cycle while the access time of a diagonal is 2 cycles. 7. Vector Transfer (V -* V) and Exchange (V«— »V). These opera- tions are similar to those for the matrices. In both operations, an appropriate alignment is needed to conform the storage pattern of the source vector into that of the destination vector. 8. Broadcasting. The only type of broadcasting operation that results in a vector is the broadcast of a scalar to a vector. 9« Vector on Vector Operations (V«V -+ V). These operations should further be distinguished as like -vector-operations, where the operands have the same vector type, and mixed-vector-operations, where the operands have different vector types. An example of like-vector- operations can be shown by the following notations: R.R - R 14-2 An example of mixed-vector-operations is shown as : R-C -> R Like -vector-operations are easier to handle once the access scheme for the machine is defined. Mixed-vector-operations are much trickier. Which operand should be realigned into the other's form (and hence the result's form) is still an open question. 10. Reduction of Vectors (R(v) -* S). The vector reduction oper- ations are similar to those for the matrices. C. Results Each EISPACK program is studied thoroughly and 'decompiled' back to some type of vector notations. Subscript indices are back substi- tuted to the DO loop limits level to investigate any dependency between array elements. Independent array elements are then configured as array partitions (square partitions, rows, columns, diagonals or one dimen- sional vectors) and all the parallel vector operations are recorded. Table 6 keeps counts of the number of occurrences of each operation type in each of the ^>k EISPACK subroutines. The count will be a static count which shows the number of places where a given operation can be found in the program. This will be different from the dynamic count which shows the number of times the given operation is used when the program is actually run. The notations used in this Table are explained in Table 5 while the column headings are explained in Section 3«B« Whenever a vector is used in an operation, the vector type will also be marked. It will be a C (Column), R (Row), D (Diagonal) or (One dimension vector). When the h3 do JBtBog jo docrj // S*-(A)H (pax-pin) a-A-A (3XTT) A-A'A A-A-e (aaumioxa) a— a (aajsirejj,) a-A aao^s jojoba ip^aj jojoba s-(w)a A-(w)a w-w.w W«-A-A W-W-A W-W-S (88u«qoxa)W«W (MJBtTBJJ,) W»+| aao^s Xf j^bw a-[otiw qo^aj xi;a^«w aT°UM ?? CQ CO t t m o 3 CO CO cp ~— 05 ■9 JLJL aS D M 05 CJ OS • A* o CM CM '-' PS O v_ CO CO ►J J 3 CO 3 J Hi m 05 t + C) 53 CM 05 "1 '■') i-l h-l OS o CJ CM 05 U CJ OJ CM Q PC CM -=t" 05 CJ t£ O -*-d- J- -+ ^ ^^ ^^ ^^ PS o Q os -*.* CM J- ^^ ^^ ^^ ^ os CJ CG ! ) W CM CM CM os cj 05 CJ VJJVO OK CM vfl OS CJ vOvO a os CMvn u o o CM OK o CM ^t CO 05 CJ CM -* CJ 05 « ^-^- O CM CM -^ 05 O O O Q4 CM CM 05 CJ O 05 CM -* 05 CJ COCO Q 05 cm5 PS CJ co!? 05 CJ O 05 o CO H W •H CO CD § •H •P K w « C/J CO =n + + + t t 05 05 ■p t i J J 01 co co 33 cj m o i-l J >J --*= - <] * + do aex'B^S J° doo l // sous-uttoaa S-(A)H -* ^S s (paxpu) A-A-A CJ •Si J- o os~ CM (3XTI) A-A-A 05 CJ PS CJ PS O t— -* 05 CO A-A-S OS CJ 05 U LT\ LTN H PS CJ o\co iH r-l o CM 05 O O J- (aSuBHOXg) a— A 5- H os cj i-i H ~o O O (jsjsuBaa) A-A o H « ajoq.g joq.oaA OS CM ps o 05 CV) o £• "h 05 o n o ^ cm LTNCO o o N^ CM 05 O LT\ CM HO^aj jo^osa 05 CJ IA i-l PS o PS CJ "o .4- 03 O o CM 05 CJ ro ON P5_ o 1-1 LTN H CM "o s-(w)h *, A-(W)H * CM * + CM CM w-w-w i-l -tf W-A-A W-W-A W-W-S V TV CJ P3 o 05 « 05 ( s3u'9i[0X3)w«— +1 H SI W X o H W •H W 0) PI O •H ■P d <1> g 1 O 0) H ■§ EH h5 do j^ibos jo dooq // aouaxznosH S-(A)H (paxpn) A-A-A (3XTC) A-A-A A-A-S (aSireuoxg) a— -A (J3JSUTBJJ) a-A ajoo,s joq.08A tjo^sj jojoba 9~(w)a A-(W)H W-W-W W-A-A (W1«A W-W-S (aSiTBUOxa) |MI (jajamuj,) w-W 3-icng x-pj^nw 3"[0iw ipiaj xtj^bw 3TOHM 28 ■H P< <*■< o S3 01 8? W Q c T •H « -~* . o * + =*=.«- < + 1 1 Iff c^ ■ o — o o ^-- O M >J O + * =ffc + 2 CO ■P o ■"■^ t CM O w o o p. ^— o S3 > o •p •H O XI o CO CM ■P P. 53 fij oi 3 H^ ua « crsty r-l H OJ o cj o o O O OJ f> O o o oj aj o o OJ C\ o o E OJ OJ Jt ^^ ^^ , t ^^ o o ( i cs CVJ OJ IA cu o 1=f o & o f CJ OJ OJ CO H W a •H w CD O •H -P a5 tt) re re re w re o 5"h uo i-l CM o « o H CM ^_^ ^—^ ^-^ « re o o N"\ t^ J- CM ^^^-^ re o re C) o o O o\cm ON H H d CM £ o o o in H •H VI § •H ■P a5 a) T3 1 O EH *7 operation is a reduction, the type of reduction operation is indicated under the Remark column. For completeness of information about EISPACK, the last two columns are added to include the number of occurrences of loops with recurrence relations, and of loops of scalar operations which can actu- ally he computed in parallel. This information, although not essential in a cheap vector machine, may he of great value in a fancy general multi- processing system. An example of each of these loops is shown in the Appendix. D. Observations and Interpretations 1. Out of the 5^ programs analyzed, there are 17 that use matrix operations. Consider the most common case in which the order of the matrix is much larger than the number of processors in the system (i.e., n » m). There are two alternatives in partitioning the large matrix. The first one is to consider the matrix as a group of one dimensional vectors (rows or columns). The second alternative is to partition the matrix into submatrices each of size Vm by vm. The first alternative allows the use of already existing vector instructions. So no new instruc- tion is needed, thus simplifying the whole machine design. Furthermore, parallelism detection of matrix level operations from ordinary high level language (e.g., Fortran, PL/l) programs is much more difficult than that of vector level operations. However, from an efficiency point of view, Kuck [16] shows that Vm by vm square block processing of large matrices offers an efficiency always higher than that of row or column processing. Moreover, McKellar [17] points out that storage by submatrices induces less kS page faults than storage by rows when only a portion of the matrix can he put in the memory at one time. In the EISPACK routines analyzed, there are 155 matrix operations compared with 930 vector operations. If we separate those operations into memory operations (fetch and store) and processor operations, we will have the refined table of occurrence frequencies shown in Table 7: — One Whole Dimen- Matrix Column Row Diagonal sional Total Memory Fetch 50(11.9) 121(28.8) 168(^0.0) 16(3.8) 65(15.5) 1^20 (100$) Memory Store 27(11.8) 72(31.6) 78(3^.2) 1M6.2) 37(16.2) 228(100$) Pro- cessor 72(19-6) 116(31.5) 119(32.3) 9(2.5) 52(1*1.1) 368(100$) Parenthesized number is the percentage of total operations of a given type. Table 7« Frequency Table of Memory and Processor Operations We can see that only 15$ of the operations in EISPACK are matrix operations. This is expected due to the structures of the algorithms. It can be readily realized that in programs such as matrix multiplication subroutines there will be many more matrix operations. From the cost- effectiveness standpoint, probably it is better to use strictly vector operations. However, to fully exploit the speed of a machine, it may be desirable to put in matrix operations. 2. The most commonly used matrix operation type is broadcasting. This implies that for a machine to handle matrix operations, it has to have an alignment network that can handle broadcasting. In 21 out of the 26 occurrences of V'M -» M, the matrix being used behaves as a group of columns, ^9 Hence if the matrix is stored in a column major order, most of the time we ■will be using duplication. On the other hand, if the matrix is stored in row major order, most of the time we will be using fanout broadcasting. For the V-V -» M type of operations, both fanout and duplication are needed for each operation. Both Crossbar Switch and Omega Network can perform- the broad- casting connection. 3. Cumulative statistics of the occurrences of some of the re- duction operations in EISPACK routines are tabulated in Table 8. Reductii Operati ( 3n Dn No. of Occurrences i of Occurrences max(ls(m) -> V) -* S 2 3.5 LS(M) -> S 2 3.5 CLS(M) -» 11 18.9 RLS(M) -* 3 5.2 LS(V) -> S 37 63. 7 MIN(V) -> S 2 3.5 MAX(V) - S 1 1.7 Total 58 100.0$ Table 8. Frequency Table of Reduction Operations It can be readily observed that summation of vector or matrix elements is very common, and hence some fast schemes (like the log summa- tion method) to compute these are essential for an efficient parallel machine. It would also be nice if some fast schemes were available to compute the maximum and minimum of a group of elements. An important fact which is not shown in Table 6 is that in reduc- tion, as well as in other array operations, some elements are skipped in 50 the operation, i.e., some elements are not supposed to be processed. The easiest way to handle this is to set up a control vector having as many bits as the number of processors in the system. By appropriately setting up the control vector before doing the operation, only desired elements will be processed. h. Matrix exchange is not found in EISPACK programs. There are six occurrences of matrix transfers. So these are not dominating factors in machine design. However, there are 30 vector transfers found in the EISPACK routines, and also 32 vector exchanges. This implies a signifi- cant part of the computation of EISPACK programs is in the memory-to- memory transfer, where the processors are not needed at all. Hence, if the alignment network is powerful enough to route data from some pre- determined memory locations to some other predetermined locations without having to pass through processors, a lot of processor time could be saved. Since most routing networks go one way or the other, we should actually reconfigure Figure 3 as Figure 6. In Figure 6, the solid lines are essential links while the dotted lines are optional. If the processor system is to be bypassed for array transfers or exchanges, we will need the additional link of (A) or (b). This arrangement will be logical if the processor time is more valuable than the total cost of the new link and additional multiplexors. One network delay can also be saved with this kind of arrangement. 5. Cumulative statistics for the vector operations are tabu- lated in Table 9(a) and Table 9(b). Rows and columns are the most dominant vector types. Among the three types of partitioning for a matrix (row, column and diagonal), diagonals are used less than 10$> of the time. 51 Figure 6. Reconfigured Computer System 52 D Total Vector Fetch 121(13.0)* 168(18.1) 16(1.7) 65(7.0) 370(39.8) Vector Store 72(7-8) 78(8.4) 14(1.5) 37(4.0) 201(21.7) s-v -» V 59(6.4) 58(6.3) 9(1.0) 16(1.7) 142(15-4) v -> V 4(0.4) 3(0.3) - 12(1.3) 19(2.0) V <-* V iMi.5) 18(1.9) - - 32(3-4) v-v -> V 43(^.6) 39(4.2) - l(o.l) 83(8.8) Total 313(33.7) 364(39-2) 39(4.2) I3l(l4.l) Table 9( a )« Like-Vector-Operations Operations Frequency 16(1.7) 9(1.0) 7(0.8) 3(0.3) 8(0.9) R«C -» V R-0 -> V c-o -» V c - D -> Table 9(b). Mixed-Vector-Operations The first number in each entry is the number of occurrences and the paren- thesized number is the percentage of the total number of vector operations (930 in total). 53 Referring back to the discussion of memory accessing in section 3. A, we_ can see that as long as diagonals are not used often, it is more cost effective to use an m = n type of memory system with a 1-skew storage scheme. Using this 1-skew scheme, diagonals can be accessed in two memory cycles. So if we start out with 100 vector operations in a given EISPACK computation, less than ten of them will be using diagonals. Hence, the maximum access time will be (90 + 10 x 2) cycles = 110 cycles. The speed will be off only by less than 10$. The simplicity of the design and the nearly full memory bandwidth more than compensate the slight increase in access time. Once again this discussion is only valid for an EISPACK machine.' Other computations might use diagonals more often. In that case, it might pay to use more exotic storage schemes to cut down access time for di- agonals . 6. Referring back to Figure 6, the alignment network delay can be avoided if we are able to choose the direct paths (C) and (D). These paths can be used in most S»V -* V type of operations. They can also be used in like-vector-operations where the vectors operands begin in the same memory unit. However, to justify any of these paths, more information on the indexing patterns is needed, i.e., we need to know the differences in skew and skip distances of the vector operands. By the same token, if a large number of intermediate results need to be realigned before being operated on again, one network delay and one memory cycle time will be saved by adding direct links (E) or (F) between the alignment networks and the processors. 5k 7« In operations where two vector operands are used (i.e., the V*V -* V type of operations), 83 of them are like-vector-operations, 32 of them are mixed-vector-operations. This implies that nearly 30$> of the time we are working on mixed-vectors in vector-on-vector operations. How- ever, this is the area where only a little knowledge is available. Fur- ther understanding about the mixed-vector mode of operations would thus be very helpful. 55 k. CONCLUSION This thesis attempts to show how program analysis can lead to some parameters concerning the design of a better multiprocessing system suitable to do a given set of computations. In this thesis, EISPACK programs are the target programs and the system in mind is a SUVED (Single-Instruction-Stream-Multiple-Data-Stream) type of multiprocessing complex; however, similar techniques can be applied to most of the other program types and machine types. The analysis results are blended with the most recent ideas on multiprocessing systems to show how the results can help us find the more cost effective design. In Section 2, we find the maximum speedup for programs and the number of processors required to achieve such speedup assuming we use the fastest known algorithms. Work is made easier with the help of the Fortran Analyzer which has all those algorithms embedded in it. For EISPACK, the optimal number of processors appears to be of the order of 2 n , where n is the average dimension of the matrices used. The speedup obtained for 10 x 10 matrices is around 20 which is above average for a multiprocessing system. The overall efficiency is a tolerable 30$. So the EISPACK programs appear very suitable for a multiprocessing machine. It is further noticed that most programs tend to fall into three different types according to their speedup behavior. A factor which determines to T which type a program belongs is the value of a where a = 5 — ^r- . The logT.. cutting points of a value for different types are not definite. However, 56 1 and 10 seem to produce consistent results. In Section 3> we attempt to use operation type statistics to infer parts of machine structure. It should be noted that uncovering array operations from programs written for a serial machine is non-trivial, but not impossible. Hand analyzed results are used here. Deciding what to look for in a program is another difficult aspect of the problem. The information retrieved from EISPACK is by no means complete, yet some precious parameters of machine organization are revealed by the analysis. Furthermore, the statistics do not reflect the true operation counts, for the statistics given in section 3 are only static counts of the number of occurrences of the operations in the program. A dynamic count of the operations used during execution would be more precise. However, this would imply the writing of another analyzer program which simulates and counts the operations in the user program. The work in this thesis is a preliminary step to the analyzer program. In Section 3; we can see that it may be more cost-effective to use strictly vector operations. However, in systems where full speed is required, matrix operations are desirable. In such systems, the broadcasting operation is essential, implying that a crossbar switch or an Omega type network is necessary for the alignment network. In general, some fast schemes should be incorporated into the machine to do summation of array elements since a large number of such summations are found in the programs analyzed. The high number of memory-to-memory operations (68 in the EISPACK programs) also justifies the establishment of additional links to bypass the processor system in the architecture. Routes bypassing 57 the two alignment networks can be considered similarly if more information on the indexing patterns of vectors used in the operations is available. By comparing the frequencies of usages of the three major matrix partitions, we can decide on the storage scheme to be used. Here, in EISPACK programs, the m = n type of memory system with a 1-skew storage scheme seems to be most cost-effective. One by-product of the analysis is the evolution of some desirable vector instructions. Array summation and vector exchange are two good examples. The regular usages of these operations make it feasible to add them in the instruction repertoires of most vector machines and most high-level vector languages. 58 APPENDIX 1. Two examples of loops with recurrence relations : A) Loop 200 of program IMTQL1: DO 200 II = 1, MML I = M - II F - S*E(l) B = C*E(I) IF (DABS (F).LT. DABS (G))G0 TO 150 c = g/f R = DSQRT(C*C+1.0D0) E(l+l) = F*R S = 1.0D0/R C = C*S GO TO 160 150 S = F/G R = DSQRT(S*S+1.0D0) E(l+l) = G*R C = 1.0D0/R S = S*C 160 G - D(l+l) - P R = (D(l)-G)*S + 2.0D0*C*B P - S*R D(l+1) = G+P G = C * R - B 200 CONTINUE 59 B) Loop 620 of program TINVTT: DO 620 II = P,Q I = P + Q - II RV6(l) = (RV6(l)-U*RV2(l)-V*RV3(l))/RVl(l) V = u U = RV6(l) 620 CONTINUE 2. An example of a parallel loop of scalar operations is in Loop 1^0 of program HQR: DO 140 MM = L,ENM2 M = ENM2 + L - MM ZZ = H(M,M) R = X - ZZ S = Y - ZZ P = (R*S-W)/H(M+1,M)+H(M,M+1) Q = H(M+1,M+1) - ZZ - R - S R = H(M+2,M+l) S = DABS(P) + DABS(Q) + DABS(R) P = P/S Q = Q/S R = R/S IF (M.EQ.L) GO TO 150 IF (dabs(h(m,m-i))*(dabs(q)+dabs(r)).le.machep X *dabs(p)*(dabs(h(m-i,m-i))+dabs(zz)+dabs(h(m+i, X M+l)))) GO TO 150 140 CONTINUE 6o LIST OF REFERENCES [1] Baer, J. L., "A Survey of Some Theoretical Aspects of Multi- processing," Computing Survey, Vol. 5, No. 1,. pp. 31-80, March 1973- [2] Ruck, D. J., et al, "Measurements of Parallelism in Ordinary FORTRAN Programs," IEEE Computer, pp. 37-^6, January 197^. [3] DeLugish, B., "A Class of Algorithms for Automatic Evaluation of Certain Elementary Functions in a Binary Computer," (Ph.D. thesis) University of Illinois at Urbana -Champaign, Depart- ment of Computer Science Report No. 399* June 1970. [k] Davis, E. W., Jr., "A Multiprocessor for Simulation Applications," (Ph.D. thesis) University of Illinois at Urbana-Champaign, Department of Computer Science Report No. 527* June 1972. [5] Towle, R., "FORTRAN Analyzer User's Guide," Unpublished Memo, December 1972. [6] Kuck, D. J., Y. Muraoka and S. C. Chen, "On the Number of Oper- ations Executable in FORTRAN-Like Programs and Their Resulting Speedup, " IEEE Transactions on Computers, Vol. C-21, pp. 1293-1310, December 1972. [7] Wilkinson, J. A. and C. Reinsch, "Handbook for Automatic Computa- tion, " Vol. II, Linear Algebra, " Berlin, New York, Springer- Verlag, 1971. [8] Boyle, J. M., et al, "EISPACK: Eigensystem Package Path Chart," Edition 2, Manuscripts, June, 1972. [9] Kuck, D. J., "Multioperation Machine Computational Complexity," Proceedings of Symposium on Complexity of Sequential and Parallel Numerical Algorithms, pp. 17-^7, Academic Press, 1973. [10] Budnik, P. and D. J. Kuck, "The Organization and Use of Parallel Memories," IEEE Transactions on Computers, Vol. C-20, pp. I566-I569, December 1971. [11] Lawrie, D. H., "Memory-Processor Connection Networks," (Ph.D. thesis) University of Illinois at Urbana-Champaign, Depart- ment of Computer Science Report No. 557 > February 1973 • 61 [12] Benes, V. E., "Mathematical Theory of Connecting Networks and Telephone Traffic," Academic Press, New York, 1965* [13] Batcher, K. E., "Sorting Networks and Their Applications," Pro - ceedings of the Spring Joint Computer Conference, pp. 307- 31^, 1968. [Ik] Lawrie, D. H., "Vector Instructions," Unpublished Memo, 1973. [15] Lawrie, D. H., "More Patterns for Square Blocks," Unpublished Memo, 1973. [16] Kuck, D. J., "Student Memo, " Unpublished Memo, 1971. [17] McKellar, A. C. and E. G. Coffman, Jr., "Organizing Matrices and Matrix Operations for Paged Memory Systems, " Communications of the ACM, Vol. 12, No. 3, pp. 153-163, March 1969. BIBLIOGRAPHIC DATA MEET 1. Report No Keport Mo. UIUCDCS-R-7I1.-636 3. Recipient's Accession No. Title and Subtitle Machine Parameter Deduction "by Program Analysis 5. Report Date August 1974 Author(s) Kuo Yen Wen 8- Perforrnine Oreanizat.ionJRex>t No. UTUCDCS-R-7U-636 Performing Organization Name and Address University of Illinois at Urbana-Champaign Department of Computer Science Urbana, Illinois 6l801 10. Project/Task/Work Unit No. 11." Contract/Grant No. US NSF GJ 36936 2. Sponsoring Organization Name and Address National Science Foundation Washington, D. C. 13. Type of Report & Period Covered Master's Thesis 14. 5. Supplementary Notes 6. Abstracts In this paper, a generalized procedure is established to analyze Fortran programs in order to deduce some parameters of machine organization. The maximum possible average speedup and the corresponding efficiency for the set of Fortran programs are determined through the use of the Fortran Analyzer. Then a method of using operation type statistics to infer parts of machine structure is shown. In this paper, EISPACK programs are used. However, the same procedure can be applied to any set of meaningful Fortran programs. 17. Key Words and Document Analysis. 17a. Descriptors Eigenvalue Solution Programs Storage Schemes Alignment Networks Broadcasting Vector Operations 1 7b. Identifiers /Open-Hnded Terms 17c. ( OSATI Field/Group 18. Availability Statement Release Unlimited 19. Security Class (This Report) UNCLASSIFIED 20. Security Class (This Page UNCLASSIFIED ORM NTI5-1! I 10-70) 21. No. of Pages 65 22. Price USCOMM-DC 40329-P7I % en Zo JV>* ,tfl*