LIBRARY OF THE 
 
 UNIVERSITY OF ILLINOIS 
 
 AT URBANA-CHAMPAIGN 
 
 510.84 
 I£63c 
 no. I- 10 
 
AUG 51976 
 
 The person charging this material is re- 
 sponsible for its return to the library from 
 which it was withdrawn on or before the 
 Latest Date stamped below. 
 
 Theft, mutilation, and underlining of books 
 are reasons for disciplinary action and may 
 result in dismissal from the University. 
 
 UNIVERSITY OF ILLINOIS LIBRARY AT URBANA-CHAMPAIGN 
 
 r m r 
 
 . -LiLs 
 
 JUL 07? 
 
 ^um 
 
 EC # D 
 
 
 L161 — O-10% 
 
Digitized by the Internet Archive 
 
 in 2012 with funding from 
 
 University of Illinois Urbana-Champaign 
 
 http://archive.org/details/statisticalsysteOOschu 
 

ENGINEERING LIBRA, 
 UNIVERSITY Of OS 
 
 URBANA, IUJNC 
 
 Center for Advanced Computation 
 
 UNIVERSITY OF ILLINOIS AT URBANA-CHAMPAIGN 
 
 URBANA. ILLINOIS 61801 
 
 CAC Document No. 2 
 A STATISTICAL SYSTEM FOR ILLIAC IV 
 
 by 
 
 Stewart A. Schuster 
 
 December 11, 1970 
 
CAC Document No. 2 
 
 A STATISTICAL SYSTEM FOR ILLIAC IV 
 
 by 
 
 Stewart A. Schuster 
 
 Center for Advanced Computation 
 University of Illinois at Urbana-Champaign 
 Urbana, Illinois 6l801 
 
 December 11, 1970 
 
UifilNfcfcKWUi UOKAK* 
 
 ABSTRACT 
 
 The ILLIAC IV Statistical System will be designed to take advantage 
 of the impressive computing power of the ILLIAC IV hardware and at the same 
 time make this power easily available to users outside the Computer Science 
 disciplines. It is designed to exist within the framework of the ILLIAC IV 
 Information Management and Analysis System (CAC Document No. l) and it obeys 
 the conventions of input language and data handling required by that system. 
 The Statistical System is essentially a set of standard, relatively 
 independent statistical applications programs which are interlinked through 
 intermediate matrix files and a common control language. 
 
ACKNOWLEDGEMENT S 
 
 First, I would like to thank Professor Daniel Slotnick for sponsor- 
 ing and encouraging my work in this area. 
 
 Many of the ideas expressed within this document were originally 
 presented to the ILLIAC IV Project in February 1970 hy Dr. James L. Parker, 
 who is currently at the University of British Columbia. I wish to express 
 my gratitude for his continued counseling and encouragement. Also, I wish to 
 thank Professor Michael Sher, Peter Alsberg, and Thomas Mason, who are members 
 of the Center for Advanced Computation, for their patient reviewing and help- 
 ful suggestions given to this document. Finally, my thanks to the entire 
 secretarial staff and especially Mrs. Shirley Brown, Mrs. Coni Allen, and 
 Mrs. Pat Stippes for outstanding clerical assistance. 
 
TABLE OF CONTENTS 
 
 Page 
 
 1. INTRODUCTION 1 
 
 2. THE OVERALL SYSTEM DESIGN 3 
 
 3. USING THE STATISTICAL SYSTEM VIA THE 
 
 INFORMATION MANAGEMENT SYSTEM 6 
 
 k. THE STATISTICAL PROGRAMS 8 
 
 A) Introduction 8 
 
 B) The Individual Routines 8 
 
 1. Correlations 8 
 
 2. Multiple Regression 8 
 
 3« Principal Axis Factor Analysis 9 
 
 k. Varimax and Oblimax Rotations 9 
 
 5. Standard Scores . . 9 
 
 6. Matrix Operations 9 
 
 7. Analysis of Variance and Covariance 10 
 
 8. T-Test 11 
 
 9- Autocorrelation 11 
 
 10. Step-Wise Multiple Regression 11 
 
 11. Classification 12 
 
 12. Discriminant Analysis 12 
 
 13. Frequency Analysis 12 
 
 1^-. Transformations 13 
 
 REFERENCES ±k 
 
LIST OF FIGURES 
 
 Figure Page 
 
 1. Control and information flow of the statistical system 5 
 
 2. Blocking structure superimposed on an 80 by 80 data matrix ... 17 
 
 3. Mapping of blocks into ILLIAC memory for accessing 
 
 rows 0-15 IT 
 
 h. Mapping of blocks into ILLIAC memory for accessing 
 
 columns 16-31 IT 
 
 5. Storage schemes for l6 x l6 matrices 18 
 
Page 
 APPENDIX A. OUTLINE OF A BLOCKING SCHEME FOR 
 
 IMPLEMENTING A STATISTICAL SYSTEM 15 
 
 APPENDIX B. MATRIX COMPUTATIONS ON ILLIAC IV 20 
 
1. INTRODUCTION 
 
 The Center for Advanced Computation is proposing a total computa- 
 tional system which will embody statistical computations, information re- 
 trieval, linear programming, modeling and simulation techniques, and a 
 differential equation solver system. This system will he tied together by 
 the philosophy of its input control languages and the manner in which data 
 structures are manipulated. The proposal contained in the following pages is 
 for the design and implementation of a Statistical System which would be im- 
 bedded in the ILLIAC IV Information Management System. 
 
 The ILLIAC IV computational capability is particularly applicable 
 in the statistical area because most statistical computations involve a 
 series of calculations carried out on several sets of similar data items. 
 These computations are usually independent of one another. This means that 
 they can be done simultaneously. This principle of simultaneity of computa- 
 tion is the essential design feature of the ILLIAC IV. It is expected that 
 the parallel computational ability of the machine can yield elapsed times 
 for computations on the order of 50 to 100 times faster than the fastest 
 current systems for statistical operations. 
 
 The techniques themselves are relatively standard and are now 
 fairly well known. These techniques follow the experience of the SOUPAC 
 group at the University of Illinois on the IBM 709^ and the 360. 
 
 The system will be designed to be controlled as a sequential flow 
 of computations with intermediate results specified at each step. A given 
 step would be specified by indicating the statistical routine name, the 
 source of its input or inputs, a series of parameters associated with the 
 computation, and an indication of the various outputs and how they are to be 
 named. 
 
 The set of statistical routines and the control language are to be 
 (designed to be broad enough to handle the majority of standard computation 
 techniques which are found in physical, biological, and social science areas. 
 The system will also be designed to be flexible enough so that additional 
 techniques may either be constructed from the parts which are already 
 
 -1- 
 
available in the system or may be added either as temporary or permanent 
 parts to the system. 
 
 The Statistical System will be controlled through the Information 
 Management System, which will be oriented toward the retrieval of large data 
 bases. Thus, users will not have to provide for the reintroduction of their 
 data to the machine at each run, and file manipulation will be straight- 
 forward. 
 
 It is a primary goal that the system will be usable to researchers 
 outside the computer science area. Thus, the standard operations will be 
 particularly simple to avoid the requirement for learning at an unnecessary 
 level of detail just to use the system. 
 
 -:'- 
 
THE OVERALL SYSTEM DESIGN 
 
 Internal to the Statistical System will "be a variety of subroutines 
 which allow the individual programs to access input matrices, make various 
 statistical calculations, and to generate output. The formats of the input 
 and output matrices will be sequential files, each record of which represents 
 a row of a matrix. There will be a name and size label associated with each 
 matrix. There will be also the option of having a vector of names associated 
 with the columns of the matrix. Each row may also begin with a name. The 
 presence or absence of this name will be indicated in the labeled record. 
 
 In the general language, matrix names would be considered as vari- 
 ables. These would be converted at execution time to file names on the disk. 
 The system execution monitor would maintain a symbol table for these matrix 
 names and their locations. A symbol table is a table which contains all the 
 input and temporary alpha-numeric file names and other relevant data. Any 
 matrices called for which were in the memory hierarchy of ILLIAC IV would be 
 brought in by the Information Management System (IMS)- A description of this 
 system is found in reference [3]« If necessary, these files would be re- 
 formatted by the IMS for their use in the Statistical Package. It will be 
 assumed by this system that all external files would be on the ILLIAC IV disk 
 once execution of the problem program began. This would be done by using the 
 Information Management System before calling the Statistical System. 
 
 In the overall language specifications, there will be no control 
 card which indicates that one is passing into the Statistical System and out 
 of the Information Management System. The reason is that the user need not 
 be aware that a specific routine, say Principal Axis Factor Analysis, is 
 contained in the statistical subsystem. Also, it should be pointed out that 
 provision is to be made within the control language to output individual 
 variables from each statistical program as well as matrices. These variables 
 could be tested to indicate conditions which occurred in the computation. 
 Within the control language these tests could cause changes in the computa- 
 tion sequence. For example, an F test may control a multiple regression 
 step, or a variance limit may control the number of iterations of an 
 iterative factor analysis. 
 
 -3- 
 
Each routine will "be implemented separately. There will "be several 
 standard matrix computation routines provided to the Statistical System de- 
 signers as well as to other systems in the form of a utility package. These 
 utility routines can be called, for example, internally from the statistical 
 routines so that the designer of a regression routine would not have to be 
 concerned with creating matrix multiplication codes. 
 
 Other functions available to implementors of each routine will be: 
 ROWINPUT, ROWOUTPUT, COLUMNINPUT , COLUMNOUTPUT , and EXIT. EXIT indicates the 
 completion of the computation and returns control to the statistical monitor. 
 Internally, these functions work on 16 by 16 blocks. These are connected 
 either left to right for work on rows or from top to bottom for operating on 
 columns. This method of dynamically allocating disk and PE memory to matrices 
 has already been devised (See Appendix A). It should be mentioned that with 
 the data blocking scheme in 16 by 16 blocks, it is not necessary for an 
 individual statistical program within the statistical system to be aware of 
 the relation between the size of the matrices involved in a computation and 
 the core size of the ILLIAC IV. If the problem fits in core, this scheme 
 will allow it; otherwise, the partitions of the matrix will be present as 
 needed. 
 
 Figure 1 presents a diagram of the control and information flow of 
 the Statistical System. 
 
 -M- 
 
Information 
 Flow 
 
 Direction of 
 Control 
 
 Requests 
 
 Reports 
 
 ± 
 
 Information 
 Management 
 System 
 
 T~T 
 
 "I 
 
 ± 
 
 Statistical 
 System Requests 
 
 i_J 
 
 I 
 
 Input Language 
 Compiler/ Scanner 
 
 Input, Output 
 and 
 Temporary 
 Files 
 
 
 I 
 
 Execution 
 Time 
 Monitor 
 
 r 
 
 T 
 
 A 
 
 i i. 
 
 A 
 
 INDIVIDUAL STATISTICAL ROUTINES 
 
 Control and information flow of the statistical system 
 
 Figure 1 
 
 ■5- 
 
3. USING THE STATISTICAL SYSTEM VIA THE INFORMATION MANAGEMENT SYSTEM 
 
 There are two points which should he emphasized in this discussion 
 of the general concepts and techniques of the ILLIAC IV Statistical System. 
 The first of these is the ease with which researchers would he ahle to use 
 this system. The second point is to demonstrate the unique capacity of the 
 ILLIAC IV to operate with these particular statistical techniques. 
 
 Perhaps the easiest way to illustrate the use of the ILLIAC IV 
 Statistical System would he to describe a hypothetical research problem and 
 indicate the steps needed in order to solve it. Let us assume that a 
 researcher has collected data on 300 variables affecting the social and 
 economic structure of a suburb of a large city. Assume that he is interested 
 in studying the forces which change the nature of this area. Perhaps he 
 would begin his study by generating a correlation matrix and a factor analysi 
 of these variables. First he would have to prepare his data in machine 
 readable form through the Information Management System (IMS). If the data 
 existed within several files of the IMS then the data would easily be ex- 
 tracted and sampled and a new file containing the matrix would be prepared 
 to be compatible with the Statistical System conventions. At this point the 
 matrix would be ready to be operated on by the Statistical System as a data 
 matrix. The IMS allows the researcher to manipulate his matrix—break it 
 into subsections either by variables or by observations by simply giving 
 commands in the Control Language rather than by manipulating card images or 
 by dividing up data images on tape. The researcher would then give the 
 commands in the Control Language to do a correlation. The output of this 
 correlation program would again be a standard matrix form which can be re- 
 tained by the system. He would then specify that this output matrix be used 
 as input to the factor analysis program. He could have the option of re- 
 taining the factor analysis matrix for further computation outputing the 
 results of the factor analysis. At this point the researcher might investi- 
 gate the printed output he has received from the factor analysis program. 
 
 It should be noted that all of these operations could have been 
 done with a single set of instructions to the Statistical System in one pass 
 
 tfith the high compul > uma.1 capacity of the ILLIAC IV, 
 
 -6- 
 
as complex an operation as this could be done with relatively short turn- 
 around time. This allows the research to proceed forward at a rapid rate in 
 an "interactive" mode. The IMS allows for input errors to he easily cor- 
 rected. This minimizes the long-term delays of weeks or months now experi- 
 enced hy researchers attempting to input observational data. 
 
 It is possible that the researcher might decide to take another 
 set of observations at a later time in order to discover any forces of 
 change present in the data. He might also decide to input data on a con- 
 tinuing basis to update his original data matrix. With the IMS he could 
 input these observations on a continuing basis and have them merged with his 
 original data matrix by the machine. This minimizes the data handling by 
 the researchers. It also allows him to specify how he wants his data to be 
 manipulated without ever having to learn the detailed method by which the 
 computer does these operations. It is also possible that he might want to 
 relate his observations with data from a large standard data base, such as 
 the census data. He could simply specify the variables on which his data 
 were to be matched with the census data, probably location, and which census 
 data were to be incorporated in this data matrix. He could then use this 
 expanded data matrix, which might have been expanded by several hundred vari- 
 ables, in the same computational process that he used before. 
 
 -7- 
 
k. THE STATISTICAL PROGRAMS 
 
 A) Introduction 
 
 The following list specifies only those routines to he implemented 
 as a direct result of this initial proposal. As new requirements are gener- 
 ated it will he quite easy to add new analytic techniques "by simply obeying 
 the conventions specified for this initial set. This feature will allow 
 future users to broaden the base of tools originally made available. 
 
 B) The Individual Routines 
 
 The proposed statistical programs are as follows: 
 
 1. Correlations : 
 
 This program generates a product-moment correlation matrix with 
 associated outputs. The input to this program is a standard data matrix with 
 the rows representing observations and the columns representing variables. 
 Effectively these columns are standardized when the matrix is premultiplied 
 hy its transpose. The computational algorithm involves adding a vector of 
 ones to each column and multiplying each row hy its transpose to form partial,' 
 sums. These are then scaled at completion of the computation. Outputs are 
 the cross-products matrix, the covariance matrix and the correlation matrix. 
 The output matrix is built up in core as a lower triangular matrix to maxi- 
 mize the size of the in-core computation. The vectors of means and standard • 
 deviations are by-product outputs of this process, as is the matrix of 
 simple linear regression coefficients. 
 
 2. Multiple Regression : 
 
 The input matrix is taken from the Correlation Program outputs. 
 
 hat the dependent variables are on the right side and bottom 
 put matrix. The inverse of the independent variables matrix is 
 ■uted and the coefficients are computed simultaneously for each of the 
 
 i.riables. Outputs are the linear coefficients for the prediction, 
 i.ion coef i.rix, and on 
 
variable denoting the row on which singularity occurred, if it occurred 
 during inversion. Due to the matrix blocking method described earlier, the 
 program is not dependent on the size of the input. 
 
 3- Principal Axis Factor Analysis : 
 
 The purpose of this computation is to generate a matrix, F, which 
 has n rows (where n is the number of variables in the study) by f columns, 
 which is some integer smaller than n such that FF' = R where R is the inter- 
 correlation matrix and outputs the factor matrix. The Jacobi method is used. 
 If only eigenvalues and eigenvectors are required, the input matrix may be 
 any real matrix. Single variables output by this program are the number of 
 factors and the per cent of variance accounted for by those factors. 
 
 k. Varimax and Oblimax Rotations : 
 
 There are schemes for rotating factor matrices output from princi- 
 pal axis solutions. The first is an orthogonal method which inputs the 
 factor matrix and outputs the rotated factors and the rotation matrix. The 
 second technique is an oblique rotation of the factor structure which has the 
 same input and outputs and also has a factor intercorrelation matrix. 
 
 5- Standard Scores : 
 
 This program inputs an observation matrix and outputs the means 
 and standard deviations of each of the columns. It also outputs a scaled 
 data matrix of the observation which has a mean of zero and a standard 
 deviation of one for each of the columns. There is a single variable output 
 which indicates whether any of the standard deviations are zero, indicating 
 that one of the input variables was a constant. 
 
 6. Matrix Operations : 
 
 This program is a series of operations tied together. Each opera- 
 tion is referenced in the control language as though it were a separate 
 iprogram in the system. The following is a list of the operations and their 
 inputs and outputs. See Appendix B for a discussion of the matrix computa- 
 tions routines that will be provided as a utility to use by this system 
 
 -9- 
 
and other systems. 
 
 1) Matrix addition - two inputs, one output, no variables out. 
 
 2) Matrix multiplication - two inputs, one output, error if not 
 conformable . 
 
 3) Matrix transposition - one input, one output, no variables out. 
 k) Matrix inversion - one square input, one output, one variable 
 
 indicating row on which matrix was singular, if at all. 
 
 5) Column delete - one input, one output. 
 
 6) Row delete - one input, one output. 
 
 7) Constant - single number input, either full or diagonal matrix 
 output . 
 
 8) Diagonal - makes a row vector form a diagonal of the input 
 matrix. 
 
 9) Element multiply and element divide - two inputs, one output, 
 divides or multiplies elements of second matrix into first 
 matrix. 
 
 10) Horizontal and vertical augment - multiple input, one output, 
 glues matrices together. 
 
 11) Identity - generates an identity matrix of a specified size. 
 
 12) Vector - makes a diagonal matrix of a vector. 
 
 13) Move - moves one matrix to another. 
 
 1^) Partition - slices a matrix into arbitrary chunks. 
 
 15) Permute - permutes rows or columns of a matrix. 
 
 16) Scalar - multiplies a matrix by a constant. 
 
 17) Subtract - subtracts second matrix from first. 
 
 The bulk of these operations are standard or have been found to be 
 useful by the SOUPAC group at the University of Illinois. 
 
 7- Analysis of Variance and Covariance : 
 
 The design and implementation of a completely new and general 
 Analysis of Variance system is a difficult project. Therefore, this report 
 recommends that the BALANOVA 5 system in the SOUPAC system (see Reference l) 
 be used as a first model. It is applicable to a wide range of balance 
 designs and will approximate the least squares solutions in the event that 
 the number of applications in each cell is not proportional. It produces 
 
 -10- 
 
the least squares solutions in proportional or equal replication designs. 
 It also has a broad coverage of the standard designs as could be delivered 
 in a single program. This system also has the benefit that it has been 
 fairly widely used, resulting in a fair test of its flexibility and a broad 
 base of experience with its form of approach. 
 
 3. T-Test : 
 
 This program inputs an observation matrix and, in some cases, a 
 vector of means and standard deviations from a previously analyzed population. 
 It compares the input variables either in pairs or in all combinations and 
 produces t-tests of deviations from specified means of arbitrary populations 
 Dr from means of other analyzed populations. 
 
 9- Autocorrelation : 
 
 This program computes autocorrelation coefficients for an arbitrary 
 number of variables on a series of time lags and also computes the power 
 spectrum coefficients to give a harmonic analysis of the variables as a 
 function of time. 
 
 10. Step-Wise Multiple Regression : 
 
 The essential process here is the same as the multiple regression 
 program described above. The difference is that an estimate is made of the 
 contribution of each variable in the independent set to. the prediction of 
 the dependent variable. At each step one variable is added to the inde- 
 pendent set. The choice is the Independent variable which most improves the 
 least squares curve fit. This yields not only a set of predictors but also 
 ranks them according to their contributions. If, at a later stage in the 
 computation, it Is seen that a variable is no longer significant it will be 
 removed from the computation. This program necessarily operates on only one 
 dependent variable. The outputs are the same as the multiple regression 
 program plus a trace of the computation process. The user of the program 
 controls the F level at which variables are to be entered or ejected from the 
 pomputation process. 
 
 -11- 
 
11. Classification : 
 
 The classification program is designed to determine group member- 
 ship of an individual on a probabilistic basis. The group structure is out- 
 put from a previously executed discriminant analysis program (See 12 below). 
 For each individual the Chi Square and probability of membership in each 
 group is given. This matrix is output. 
 
 12. Discriminant Analysis : 
 
 The purpose of this program is to give a function which will allo- 
 cate a set of p variates into k different populations. The strategy is to 
 maximize the ratio of the between-group variance to the within-group variance. 
 The outputs are the classification vectors, group means, and dispersion 
 matrices. The result is an eigenvalue, eigenvector solution. Specific 
 references on the computational technique are found in the SOUPAC manual 
 (Reference l) . 
 
 13- Frequency Analysis: 
 
 
 The purpose of this program is to produce generalized frequency 
 tables and Chi Squares along requested dimensions. These frequency tables 
 are output for other uses, the first n variables in the row being the n-1 
 control variables and the column being the last control variable. For each 
 table a set of control variables is specified and for each control variable ; 
 a set of values, which are to be ignored, is specified. There is also a list i 
 of variables, values, and boolean conditions which must be met for any 
 observation to be entered into the frequency counting. This last feature is 
 particularly useful for large scale complex data bases such as those handled 
 by the Information Management System. Input is a standard observation matrix. 1 
 Each row of each table is allocated separately so that rows are allocated in 
 storage only for those elements which actually occur in the data. This means 
 that the limits of the data do not have to be specified by the user and that 
 the maximum number of tables is processed on each pass of the data. The 
 input parameters are in a block structure in the sense that within one set of 
 boolean conditions a wide variety of tables may be indicated. It is possible 
 that on the first pass, while only one block is being processed, the program 
 
 -12- 
 
can range all the variables so that on subsequent passes it can determine the 
 maximum amount of core required for a given set of tables, and thereby do 
 more than one block per pass on multiple passes—otherwise the program would 
 have to make one separate pass simply to range the data. For large data 
 bases an option may be inserted that allows the user to specify the ranges 
 of the data items. 
 
 lk. Trans formations : 
 
 The transformations program is similar to the matrix program except 
 that its basic element of operation is a data row rather than a whole matrix. 
 A set of codes for the transformation program represents a series of row 
 operations which is repeated for each row of the input matrix. There is also 
 a set of operations which are executed just once after the total matrix has 
 been passed by the program. At the completion of the basic set of instruc- 
 tions the current row, as modified by the user, is read out and the next row 
 is read in. It is also possible to cause the output of special rows into 
 other matrices during the processing of the basic matrix. There is also a 
 set of row elements which are carried forward and not zeroed out between the 
 processing rows. The basic transformations are done between elements of a 
 given row to produce new elements or to replace old elements of that row. 
 These instructions also have the capability of branching to symbolic labels 
 based on tests of certain variables within a given row. The types of opera- 
 tions that are common are ADD, SUBTRACT, KECODE, PERMUTE, etc. A complete 
 list of the set used by the SOUPAC system is to be found in the SOUPAC 
 manual (Reference l) . This program is written as a table-driven structure 
 with each operation as an entry point from a larger computed GOTO (or CASE 
 statement as in ALGOL) so that the number of operations is essentially un- 
 limited once the basic structure is present. 
 
 -13- 
 
REFERENCES 
 
 [l] "SOUPAC Statistically Oriented Users Programming and Consulting", 
 
 DCS Report No. 370. Urbana, Illinois: Department of Computer Science, 
 University of Illinois at Urbana-Champaign, (December 1969). 
 
 [2] Chouinard, P. "Outline of a blocking scheme for implementation in a 
 GLYPNIR written statistical system", Memorandum. Urbana, Illinois: 
 ILLIAC IV Project, University of Illinois at Urbana-Champaign, 
 (January 9, 1970). 
 
 [3] Schuster, Stewart A. "An Information Management and Analysis System 
 for ILLIAC IV", CAC Document No. 1. Urbana, Illinois: Center for 
 Advanced Computation, University of Illinois at Urbana-Champaign, 
 (December 11, 1970). 
 
 [h] Sameh , A. "On Jacobi and Jacobi-like Algorithms for a Parallel 
 
 Computer", Journal of Mathematics of Computation, (July 1971) (in press] 
 
 [5] Parker, James L. "The ILLIAC IV Statistical System", Proposal submitted 
 to the Graduate College by the ILLIAC IV Project, University of Illinois 
 at Urbana-Champaign, (February 1970). 
 
 ■ 
 
 -Ik- 
 
APPENDIX A 
 OUTLINE OF A BLOCKING SCHEME FOR IMPLEMENTING A STATISTICAL SYSTEM [2] 
 
 For a statistical system of any significance to be written for 
 ILLIAC IV, it is essential that the various programs be able to handle non- 
 memory contained arrays. This is not the same insurmountable problem -which 
 occurs in the compiler area, since one knows ahead of time how an array is 
 going to be referenced. Also, one has direct control of data overlay by 
 explicit I/O statements because the programmer is both the user of the system 
 and the designer of the data flow through the system. In this way, he can 
 monitor most of the activities. 
 
 For most statistical applications, array referencing is by rows 
 and, less frequently, by columns. The purpose of this appendix is to propose 
 methods of blocking an array such that rows and columns are easily accessible 
 for use within a statistical system. This particular presentation is for an 
 arbitrary statistical program written in GLYPNLR which has one n by n matrix 
 to be handled. The extension to several matrices is straight forward. 
 
 A solution is to block the array as in Figure 2 into 16 by 16 
 
 blocks. Figure 2 considers an 80 by 80 data matrix. For example, the 
 
 th 
 problem program may determine to do a row operation on the i row of a data 
 
 matrix (hereafter referred to as A) . The problem program would always do 
 
 arithmetic in a memory block which is at least large enough to contain 16 
 
 rows or columns of A. This work buffer shall be referred to as BUFFER1. If 
 
 th th 
 
 the i row is in BUFFER1, processing proceeds. If the i row is not in 
 
 BUFFER1, the problem program determines whether or not it needs the current 
 
 contents of BUFFER1. The problem of saving work buffers is discussed later. 
 
 The problem program can then call a GETROW routine which indicates the 
 
 following: 
 
 1. It needs rows (by default of calling GETROW ) . 
 
 2. Return the i one. 
 
 3- It is within matrix A (in case there is more than one 
 
 matrix in the program) . 
 k. Address of BUFFER1 (there may be BUFFER2, BUFFER3, etc.). 
 The GETROW subroutine looks to see what blocks it has stored in memory. If 
 
 -15- 
 
the required blocks are In memory, the problem program gets the blocks 
 directly (GETROW moves the data from its save area to BUFFERl, see Figure 3). 
 Any additional blocks needed can be secured by sending a request to the 
 Information Management System (IMS). IMS can be given a list of blocks 
 needed and where to put them in ILLIAC memory. Similarly, PUTROW saves the 
 blocks. 
 
 The use of GETCOL and PUTCOL is also similar to GETROW. The trick 
 is that within BUFFER1 the data is worked on as rows, see Figure h. It is 
 the job of GETCOL AND PUTCOL to get the column into BUFFER1 in row form. 
 There are two basic solutions to this problem. 
 
 First, within each 16 by 16 segment, data is stored "straight". 
 See Figure 5 for a diagram of straight storage. For rows, no remapping is 
 done. Data is copied directly by GETROW into BUFFER1. For columns, each 
 16 by 16 block is transposed and the blocks are lined up side by side. It 
 should be possible to transpose four such 16 by 16 blocks at a time. 
 
 The second solution would be that within each 16 by 16 segment, 
 data is stored "skewed". See Figure 5. For rows, a row is brought up to be 
 routed back into "straight" alignment and then stored. Columns may be 
 "indexed out", routed, and stored into BUFFER1 as rows. Note that routing 
 in both cases implies a 16 PE end around route which is not particularly 
 difficult to implement. 
 
 The trade-off between straight and skewed within the 16 by 16 
 blocks is that with straight one has no remapping for rows but a fairly 
 cumbersome transposition remapping for columns. With skewed one has essenti- 
 ally similar relatively straightforward remappings for both row and column 
 access. The outstanding question is, "What is the ratio of row accesses to 
 column accesses? " 
 
 Two more routines would probably be useful in the repertoire of 
 data retrieval routines, namely GETDIAGONAL and PUTDIAGONAL, for handling the 
 main diagonal of an array. It is clear that the skewed method of storage 
 complicates the retrieval of the main diagonal. It should also be noted 
 that the disk is used only if the GET and PUT routines can't store the data 
 in their own internal save areas. 
 
 -16- 
 
-I6-H 
 
 jL 
 
 o 
 
 GO 
 
 80 
 
 I I c i o 
 
 6 ; 7|8 
 
 1 1 T iafi3 
 
 • • i_ m » 
 
 re i it: is 
 
 21 |22;23 
 
 4 i 5 
 
 9 ilO 
 
 / 
 
 14 115 
 
 19 20 
 
 J 
 
 24j25 
 
 Blocking structure superimposed on, an 
 80 by 80 data matrix 
 
 Figure 2 
 
 Mapping of blocks into ILLIAC memory 
 for accessing rows 0-15 
 
 Figure 3 
 
 2 i 7 II 21 17 
 
 Mapping of blocks into ILLIAC memory 
 for accessing columns 16-31 
 
 Figure k 
 
 Block Numbers 
 
 -17- 
 
PE. PE. _PE. . 
 l l+l 1+2 
 
 longword i 
 i+1 
 
 i+15 
 
 longword i 
 i+1 
 
 i+15 
 
 Skewed Storage 
 
 PE 
 i+15 
 
 "1,1 
 
 "1,2 
 
 "l,3 
 
 
 a l,15 
 
 "2,1 
 
 "2,2 
 
 "2,3 
 
 
 a 2,15 
 
 "3,1 
 
 "3,2 
 
 "3,3 
 
 
 a 3,15 
 
 
 
 
 
 
 "16,1 
 
 "l6,2 
 
 "]6,3 
 
 
 a l^^ 
 
 Straight Storage 
 
 PE. PE.^.PE. . . . PE. ,„ 
 1 l+l i+2 i+15 
 
 a l,l 
 
 "1,2 
 
 "1,3 
 
 
 "1,15 
 
 a 2,15 
 
 "2,1 
 
 "2,2 
 
 
 "2,11. 
 
 a 3,l^ 
 
 "30.5 
 
 "3,1 
 
 
 "3,13 
 
 
 
 
 
 
 a l6,2 
 
 "l6,3 
 
 "16,11 
 
 
 a l6,l 
 
 Storage schemes for 16 x 16 matrices 
 
 Figure 5 
 
 -18- 
 
In summary: 
 
 1. Divide the matrix into 16 by 16 blocks. These are canonical 
 units which map well into disk segments. 
 
 2. Three GET and three PUT routines which keep track of the data 
 blocks upon command by the problem program. These six routines 
 are all entries in the same subroutine. If that isn't possible, 
 it can be made one subroutine with six options. 
 
 3- Data is stored in memory in 16 by 16 blocks if possible. 
 
 After available memory is used, write the rest out on disk. 
 k. Whether conceptually the data is rows, columns, or diagonals 
 
 the ILLIAC IV is a row machine, and should be used that way. 
 
 The GETPUT routines will perform the mappings into row order. 
 
 5. All arithmetic is done in work buffers. 
 
 6. Write it in GLYPNIR or COCKROACH. 
 
 A few additional points should be made clear. The GET and PUT 
 routines try to save blocks in memory first. If the blocks don't fit, they 
 are written out onto disk. This function is transparent to the statistical 
 programs, themselves. It is possible, therefore, to write the statistical 
 programs using temporary GET and PUT routines which only store in memory. 
 After the statistical programs are written, expansion of GET and PUT to 
 handle non-memory contained matrices, greatly enhances the power of the sta- 
 tistical system, and may be done without changing the original statistical 
 code. 
 
 The discussion in the first part of this paper assumes that a sta- 
 tistical program needs buffer space at least large enough to handle 16 com- 
 plete rows or columns of the data matrix. This potentially represents some 
 upper bound on the size of matrices the system can handle. The system could 
 perhaps be designed to operate on only four 16 by 16 blocks at a time in a 
 standard 16 by 6k work buffer. This would guarantee the facility of working 
 on problems where it is not always possible to contain 16 complete rows or 
 columns in memory. Implementation of such a scheme increases control-type- 
 statement overhead and flexibility. 
 
 -19- 
 
APPENDIX B 
 MATRIX COMPUTATIONS ON ILLIAC IV [k] 
 
 Since matrix computations, such as multiplication and inversions, 
 are necessary utilities for several different applications, they should be 
 provided to all users and system designers in the form of a subroutine 
 package. Part of such a package is currently being implemented and a proposal 
 to complete this work has been submitted to the Advanced Research Projects 
 Agency. 
 
 However, several other matrix operations are required in the 
 context of a Statistical System and thus would be provided as part of this 
 system. These routines are listed and explained in section U.B.6. Also 
 listed, depending on the generality of input forms of the matrices, are some 
 routines that may have the same names as those provided as utilities. This 
 is necessary since a data matrix may be partitioned in any form for the 
 Statistical System. Each partition's data may exist as a separate record or 
 parts of several records in files within the Information Management System. 
 It may be very inefficient to produce the concatenation each time to create 
 one record for the matrix input required for the utility matrix computation t 
 routines. Since the partitions may be needed in later computations, it would' 
 be disadvantageous to destroy their forms. It should be understood that the 
 duplicated statistical routines only contain algorithms to find the parti- 
 tions in the sequence required by the calculation. Once the proper parti- 
 tions are found the routine will call the utility routine to execute the 
 calculation. New partitions are then found and the utility routine is called 
 again until the computation is complete. 
 
 The discussion implies that although a matrix multiplication 
 routine may be provided to all users, we will also need a driver routine 
 which finds the data as needed in the calculation. It then calls the pro- 
 vided routine to perform the actual sub -calculations. 
 
 The routines that would be provided as a utility to the Statistical 
 System designers are listed: 
 
 1) Matrix Multiplication, Addition, and Subtraction 
 
 2) Square Root 
 
 -20- 
 
3) Matrix Inversion 
 
 h) Eigen Vectors 
 
 5) Eigen Values 
 
 6) Evaluation of Determinants 
 
 7) Matrix Transposition 
 
 -21- 
 
UNCLASSIFIED 
 
 Security Classification 
 
 DOCUMENT CONTROL DATA -R&D 
 
 (Security elaaallleatlon at till*, moaTy ol rt«Wc< and hymning annotation muat ba antarad whan 0— orarall import la elaaalllad) 
 
 4ICINATING activity (Corporal* author) 
 
 snter for Advanced Computation 
 
 liversity of Illinois at Urbana- Champaign 
 
 rbana, Illinois 6l801 
 
 U. NEFONT 1CCURITV C L ASSI FIC A TION 
 
 UNCLASSIFIED 
 
 2b. snour> 
 
 EPORT TITLE 
 
 STATISTICAL SYSTEM FOR ILLIAC IV 
 
 ESCRIPTIVE N6TEI (Trpa ol r a p ort and rncluaira dalaa) 
 
 ;search Report 
 
 jthor(S) (Flrat naata, mlddta Initial, lamt nana) 
 
 :ewart A. Schuster 
 
 EPORT DATE 
 
 member 11, 1970 
 
 7a. TOTAL NO. OF PACE* 
 
 30 
 
 76. NO. OF HEM 
 
 5 
 
 :ONTRACT OR GRANT NO. 
 
 5AF 30-(602)-4l44 
 
 PROJECT NO. 
 
 ?PA Order 788 
 
 •a. ORIGINATOR'S REPORT NUM1EKIH 
 
 CAC Document No. 2 
 
 •o. OTHER REPORT NOISI (Any othar numbara that may ba aaalfnad 
 Ihl a report) 
 
 DISTRIBUTION STATEMENT 
 
 pies may "be requested from the address given in (l) above. 
 
 SUPPLEMENTARY NOTES 
 
 »ne 
 
 12. SPONSORING MILITARY ACTIVITY 
 
 Rome Air Development Center 
 Griffiss Air Force Base 
 Rome, New York 13^0 
 
 ABSTRACT 
 
 The ILLIAC IV Statistical System will be designed to take advantage 
 of the impressive computing power of the ILLIAC IV hardware and at the same 
 time make this power easily available to users outside the Computer Science 
 disciplines. It is designed to exist within the framework of the ILLIAC IV 
 Information Management and Analysis System (CAC Document No. l) and it obeys 
 the conventions of input language and data handling required by that system. 
 The Statistical System is essentially a set of standard, relatively 
 independent statistical applications programs which are interlinked through 
 intermediate matrix files and a common control language. 
 
 ) •?<!?•■ 1473 
 
 UNCLASSIFIED 
 Security Classification 
 
UNCLASSIFIED 
 
 Security Classification 
 
 KE V WORD! 
 
 Mathematics of Computation (General) 
 Social and Behavioral Sciences (General) 
 
 UNCLASSIFIED 
 
 Security Classification