LIBRA R.Y OF THE. U N IVE.R.SITY OF 1LLI NOIS 510.84 K6r no. 30 1 -307 cop. 2 The person charging this material is re- sponsible for its return on or before the Latest Date stamped below. Theft, mutilation, and underlining of books are reasons for disciplinary action and may result in dismissal from the Univers.ty. UN.VERS.TY OF I L 1 1 UOlSUMM^l^^f^^ OEC 1 3 «7t kPR 6RE FEB 1 A, L161— O-1096 Digitized by the Internet Archive in 2013 http://archive.org/details/generalizedassem306geor is Report No. 306 2- ^Th^t a^c/ { C00-1U69-0109 A GENERALIZED ASSEMBLER by Christos Georgiou January 1969 THE LIBRARY OF THE FEB 2 1359 UNIVERSITY OF ILLINOIS Report No. 306 A GENERALIZED ASSEMBLER* by Christos Georgiou January 1969 Department of Computer Science University of Illinois Urbana, Illinois 618OI Supported in part by contract U. S. AEC AT(ll-l)lU69 of the Atomic Energy Commission and was submitted in partial fulfillment of the requirements for the degree of Master of Science in Computer Science, January 1969. ill PREFACE The ASSEMBLER described in the following pages is one which will assemble programs written in assembly language for small machines, one or two addresses, having the feature of indirect addressing but not having index registers and the base-displacement way of addressing, having up to U8 bit words (l6 octal numbers), and permitting double relocation. Macros are not allowed. It assembles according to the description of the particular machine that it reads. In this sense it is a "general" assembler. The author wishes to thank Professor C. W. Gear for his valued guidance and support during the preparation of this thesis. The technical assistance provided by Professor Gear is especially appreciated. Thanks are also extended to Miss Barbara Hurdle for typing the final manuscript. iv TABLE OF CONTENTS Page 1. INTRODUCTION 1 2. ASSEMBLY 2 3. PASS 1 5 k. PASS II 10 5. NAMES AND EXPRESSIONS 11 6. PSEUDO ORDERS 15 7. DESCRIPTION OF THE MACHINE 18 8. DIAGNOSTIC MESSAGES 21 BIBLIOGRAPHY 2h 1. INTRODUCTION In the following sections the ASSEMBLER is described, which is a collection of programs written in assembler (F or G) and operating on the IBM System 360/50-75. It assembles programs for the Digital Equipment Corporation's PDP-7, PDP-8, and PDP-9, computers. This collection of programs is general and flexible, not depending upon a particular machine, but it accepts a description of it, according to which, programs written in assembly language for that particular machine are decoded and the output produced is loaded into the machine. After a program has been assembled it may be punched on cards or paper tape or may be saved in a file. 2. ASSEMBLY An assembler transforms the symbolic language source programs into machine language. In the case of PDP-8, every machine instruction occupies exactly one location in its memory. The assembly language program is a sequence of input lines to the assembler which specifies these instructions in a symbolic form. The assembler reads these lines, decodes them, and constructs or assembles the corresponding binary words for the specific computer. Symbolic names for the memory locations are defined by their appearance at the beginning, in most assemblers, of an input line. Symbolic names for operation codes appear next, sometimes followed by operands. Then comments may or may not follow. The assembler lists a value corresponding to the value of the operator, augmented by the value of the operand. Each such value is associated with an address by means of the program counter (PC). The PC contains a value which is incremented after each word is generated. So normally assembled words are placed in serially ascending locations in memory. Some input lines will not generate words but are instructions to the assembler, for example the pseudo's END, ORG, DC, etc. The symbolic information on each assembly language line is grouped into four fields: the location name or label, the operation code or mnemonic, the address field or operand, and comment fields. These fields are usually delimited by blanks. The ASSEMBLER assumes blanks as delimiters. If the user wishes to use a different symbol, he has to define it in the tables that are used for the translation of the particular field. The fields are described in the description of the particular machine. The location name, usually, starts at character 1 and is terminated by the first blank. We say usually because even the case of a machine having first the operation code, next the location, and next the address or having an asterisk (*) for a delimiter, could be handled provided that the user describes it so to the ASSEMBLER. If the location name is non-empty it may contain a name of up to eight characters, beginning with a letter. Any variable used in the program must be defined by its appearance in the location field. The variables used with some pseudo's, i.e. EQU, must be predefined, that is, defined at some point before the pseudo is processed. For instance A EQU B+l B EQU 10 is illegal, whereas B EQU 10 A EQU B+l is allowed. As another example A EQU B+10 B EQU 2*A-30 does not define B as 10 and A as 20. The operation code field, or mnemonic field, is the expression starting with the first non-blank character after the location field and ending with the next blank. Any variable appearing in the operation code field must be an operation code. In the particular case of the PDP computers, if the operation code is a microinstruction then the address field or operand field is empty. Otherwise it starts with the first non-blank character after the mnemonic and ends with the next blank. Any variable appearing in the address field must be a label. The above three fields may total up to 72 characters. The comment field starts at the end of the address field and may extend up to the 80th character. Comments after the address field may or may not be preceded by the particular comment character, in the case of PDP-8, the character slash (/), in others, asterisk (*). The comment has no effect on the binary output of the assembler; it is only copied on the assembly listing, being very useful to the programmer so as to know what he is trying to perform with a particular instruction or group of instructions as well as to the other programmers who might want to use this program. The end of the program is sensed by the END pseudo. In the case of Load-and-Go assembly, the last instruction executed will be a transfer to the address evaluated from the END pseudo. When the output of the assembler is input to the loader then this address is placed in a suitable position on a transfer card image. The assembler provides two kinds of output: the binary object "deck", and the assembly listing. The former is a list of the machine program in a form acceptable by the loader of the particular computer and the latter helps the programmer to debug his program from certain possible programming errors. The ASSEMBLER is a typical two-pass one, where the first pass is used to produce a table of all symbolic addresses used and their address values, whereas the second is used to substitute these values into the original symbolic form to get the binary form. 3. PASS I To construct the table each input line of code is read, one at a time. If a name appears in the location field, then it is put into the table. The assembler assumes that the first instruction to be read is placed into location zero, the next into location one, and so on. In order to know what address value to associate with a name in the location field, the location counter keeps an account of the space used and it is incremented after each instruction has been handled. Each entry in the table consists of the name, the associated address value, a pointer pointing to the left, one pointing to the right, and information on relocation, whereby left and right we mean alphabetically smaller and bigger entries respectively. Since a check has to be done for double definition of names during Pass I, it is necessary to determine if the name just read in the location field is already present in the table. This involves some sort of table look-up procedure and the binary tree method has been chosen for this purpose. The search mechanism consists of comparing the desired name with the entry in the fixed position reserved for the middle entry, as in the binary search. If the desired name matches, there is nothing further to do, otherwise the search continues to the entry pointed to by the appropriate of the two pointers (chain addresses). The table built is a tree with two branches at every node, and nodes are labeled by a table entry. Storing the information this way for a binary search has the advantage that it is no longer necessary that the entry at the start, or root, of the table must be the middle entry. Whichever entry is used, it is only necessary that all other entries be to the left or right of it, depending on whether they are smaller or larger than it. It has the advantage, also, that names can be added to the table at any time. If a new name is to be entered, a search process is followed which will finally come to the end of a chain, indicated by a zero link address. If the name is found in the search then it obviously should not be re-entered. When the end of a chain is reached, the new entry can be added at that point and the appropriate link established. This method has the additional advantage that it is easy to print the table in alphabetical order when the time comes. It has the disadvantage that if the names entered are in order, then the tree is one-sided. It is then as slow as a sequential search, but takes up more space because of the pointers. If the input lines were containing only instructions, then the simple mechanism of incrementing the location counter by one for each line would be sufficient. However, there has to be one pseudo instruction in any code, the END pseudo which tells the assembler that the whole deck has been read so that the next pass can begin. Of course there are many other pseudo' s. So in order that Pass I recognize the difference between the instruction and pseudo orders, it must examine the mnemonic coding. The mnemonic is a string of characters similar to a symbolic name, so that similar techniques are used to handle mnemonics. In this case, the table of mnemonics has been set up by the user; they are the first cards of the description of the particular machine that are read and stored using the same technique of the binary tree, as with the names. Each entry consists of the mnemonic, the actual code, the left and right pointers, and a branch address which gives the address of the code used to handle the mnemonic or speudo order. The typical flow of Pass I is the following. The input line of code is read (if it is a comment no particular action is taken, so we will not mention comments) and the mnemonic extracted. The mnemonic code is looked-up in the mnemonic table. If it is not present and macros are not allowed, there is an error; if it is present, a branch is made to the address found in the table. Then the appropriate section of code takes care of the rest of the line. When the input line which is usually a card image from magnetic tape or disk file has been read into the program area of memory and a copy has been produced for later use in Pass I, the assembler must extract various fields from it in order to form names, mnemonics, and addresses. The positions and lengths of these fields are described to the ASSEMBLER by the user. The various fields are then handled as follows . Location field - This may be a fixed length field. Extract characters one at a time. If the first non-blank character is other than an alphabetic character, it is not normally an allowed name. Otherwise subsequent characters must be alphabetic, numeric, or blank. A blank indicates the end of the name in which case the field should contain no more non-blank characters. In other words, blanks are not allowed in the names. If the user wants a delimiter other than the blank, he has to describe it in the table used for the translation and testing of the characters making the location field. Mnemonic field - This may also be a fixed length field, except that it is required that it start in the first column of the field. A 8 similar program to the one above is used except that the first character must be a non-blank character. There is no logical reason to restrict mnemonics to start with an alphabetic character and contain only alphanumeric characters, but they frequently are so restricted which makes convenient the use of the same reading procedures. Address field - This field may differ from instruction to instruction, in many cases containing a number of subfields separated by commas. Within each subfield the address can be expressions involving names and numbers and some of the arithmetic operators such as plus (+), minus (-), and multiply (*). We will come back to names and expressions later. The first step in the process of decomposing such a subfield is to break it into separate elements such as names, operators, numbers » etc. There are many ways to do this. It is, however, faster to perform a lexicographic scan of the field first. By a lexicographic scan we mean an analysis that only concerns itself with each character one at a time, taking into account the immediate neighbor of the character. The recognition of the elements of the subfield is performed very simply by scanning from left to right, and noting the following. Names start with a letter and contain letters or digits. Numbers start with a digit and contain only digits. Starting from the left, the next character is examined. If it is a letter, then a name is recognized. A subscanner for name recognition examines consecutive characters until a non-alphanumeric character is read. This signals the end of the name. Since names are restricted to a maximum length, a check is made for excessive characters. After the string of characters representing the name has been scanned, control returns to the basic recognizer. The next character is examined and a branch to a basic recognizer for numbers, names, or operators is made. The recognition process for names, numbers, etc., involves more than just checking for the existence of the name, number, etc. Something meaningful has to be done with the address . Although the address fields of instructions need not be translated until Pass II, the address fields of some pseudo orders affecting the location counter will have to be converted to numbers in Pass I. This means that the characters in a name will have to be packed together in a form suitable for the table look-up process, that decimal digits will have to be converted into binary integers, and that the calculation indicated will have to be performed between the operands. This particular kind of action taken for certain pseudo orders is realized by branching to the appropriate address going with every mnemonic or pseudo. In the case that the expression will have to be evaluated in Pass I, normally it has to be well defined, that is, all of the names appearing in this expression must be previously defined in the name table. 10 k. PASS II The purpose of Pass II of the assembler is to convert the source language into binary, using the name table constructed in Pass I to convert the addresses and the mnemonic table to convert the instructions. To do this, a copy of the source program is read, a line at a time, and many of the steps of Pass I repeated. The location field is ignored because it was completely handled in Pass I. The mnemonic field is examined and a table look-up performed. In this pass, the mnemonic table provides both a branch address for instructions or pseudo orders and in the case of instructions, the binary code. The address subfields are converted into binary numbers for packing into the instructions or use in pseudo orders. The code in the Pass I handling of pseudo order addresses is re-used for this process. A location counter is maintained in an identical manner to Pass I. Pass II is also terminated when the END pseudo order is read. The END pseudo can involve an address field which is used to provide a starting address at execution time. In the case of a Load-and-Go assembler, the last instruction executed would be a transfer to the address evaluated from the END pseudo. In our case the output of the ASSEMBLER is input to a loader, this address is placed in a suitable position on a binary card image. 11 5. NAMES AND EXPRESSIONS We have mentioned names and expressions in the process of decomposing the address field. A name is a symbol which stands for a numeric value. It may stand for a self-defining value, called a constant; or it may stand for a value which is defined elsewhere, a variable. A variable may be an operation code or a pseudo order, in which case it is defined from the mnemonic table read and built in the description of the particular machine or it may be a label, in which case it is defined by its appearance in the location field of some input line. If this line corresponds to a memory location, then the defined value of the label is the address of this location. If the operation field of the line is the pseudo EQU or DC, the defined value of the label is the value of the expression in the operand (address) field. There is a special name which is self-defining. Its value is the current contents of the location counter. This special name is given in the description. It may be the dot ( • ) as in the case of the PDP series or the asterisk (*) as in the case of the IBM 360, for example. The following EBCDIC characters may be used in the formation of names and expressions. Alphabetic: Upper case letters A-Z Numeric: Digits 0-9 Operators: + - * (plus, minus, multiply) Delimiters: Blanks assumed unless otherwise specified. Special character for comment field as specified in the description. 12 Names must be up to eight characters long. Variables may contain alphabetic and numeric characters, but they must start with an alphabetic character. Constants contain only digits. An expression is a sequence of names separated by the operators +, -, and *, and delimited by blanks. In the mnemonic field, all variables must be operation codes or pseudo orders. In the address field (operand field) all variables must be labels. The assembler evaluates the expression from left to right by combining the values of the names according to the operators. The most general form of an expression in the address field is N*A±B where N is an integer and A, B are names, absolute or relocatable. The assembler produces relocation bits with each address , which tell the loader whether or not relocation is to be applied to that particular address. In addition to the value of the name, the name table contains an entry which provides information indicating whether or not a name is relocatable. Any name appearing in the location field of an instruction is relocatable, as are names in the location fields of certain pseudo orders. The pseudo order which can define a non-relocatable name is the EQU pseudo. An absolute (non-relocatable) address can be constructed from any allowable expression involving numbers and absolute valued symbolic addresses. For example, 20*3-^*A+5 is a valid absolute address if A is an absolute name. A relocatable address can be constructed by adding or subtracting any absolute amount to a relocatable name. For example, if B is a relocatable name then B-k is also relocatable, as it is B+20. 13 There are cases where an expression like B-A+l is needed and we would like to arrange that the difference of two relocatable address expressions is an absolute expression. In this way, the expression A-B+C would be legal unless A and C were relocatable and B absolute. Although the address A+C may be needed by the programmer in some cases where both A and C are relocatable, it is not possible for the loader to handle it with only one relocation bit. With only one bit, the loader can only apply either single relocation or none. Even with single relocation it is possible to allow expressions such as 2*A-B where both A and B are relocatable, since the total relocation is still single. The ASSEMBLER assuming two bits, and handling double, single, or no relocation, restricts the general expression N*A ± B to take values as in the following table, producing the corresponding relocation Table 1. RELOCATION RESULTING FROM THE EXPRESSION N*A±B Relocation Double Single or Double Single No Relocation No Relocation, Single or Double Single or Double Single No Relocation A B OP N Rel Rel + <_ 1 Rel Abs + 1 2 Abs Rel + "any" Abs Abs + "any" Rel Rel - 1 3 Rel Abs - 1 2 Abs Rel - "any" Abs Abs - "any" 11+ The restrictions on the integer N are imposed by the requirement of at most double relocation. The value of N "any" is such that the limitations of the particular computer are not exceeded. When an expression is evaluated a check is made to find whether or not the value of the expression is within the current memory core block referred to as "page." If it is then the same-page bit of the assembled instruction is set to one. If this bit is zero any location in "page" zero can be addressed directly from any page of core memory. All other core memory locations can be addressed indirectly by setting the indirect-bit. The rest of the bits specify the location in the current "page" or "page" zero, which contains the full absolute address of the operand. Indirect addressing is sensed by the presence of a special character, "I" in the case of the PDP-8, preceded and followed by blank. This special character is given in the description of the particular computer. 15 6. PSEUDO ORDERS Pseudo orders are operation codes which do not represent actual machine instructions, but are simply signals to the assembler to take certain action. Pseudo' s provided to the ASSEMBLER together with their effect are given below. Data Loading: DC - Define Constant Define the optional symbol in the location field to have a value equal to the current contents of the location counter. Then substitute the value of the expression in the address field for the memory location signified by the current contents of the location counter. It is necessary to determine how many words of storage will be occupied by the data given in the pseudo, so that the location counter can be incremented accordingly during Pass I. In scanning the field, the first character determines the type of dield following. It may be preceded by a repetition factor. For some characters, an L may follow with a length specification. Finally the data appears inside quotation marks CI. During Pass I the program determines the boundary alignment of each field in the DC in order to calculate the location counter change. During Pass II the location field is ignored, but the address field is converted into binary. At the same time the location counter is increased once for each word produced. 16 Location Counter Control: ORG - Set the location counter to a specific quantity. Sets the location counter to the value specified in the address field, in Pass I and Pass II, so that the next instruction read will be loaded in this value. If a name appears in the location field, then it is put into the name table after the location counter has been changed and given a value equal to the new contents of the location counter. BSS - Block Started by Symbol Any name in the location field must be entered by the name table before the location counter is incremented. BTS - Block Terminated by Symbol Any name in the location field must be equated to the location of the last word of the block, that is to one less than the contents of the location counter after it has been incremented. As long as the addresses in the address field are purely numeric, there are no problems. However, if a symbolic address is involved, then a value has to be assigned to it, namely it has to be predefined in order that the numeric value of the address can be calculated. In Pass I, only those names that appeared before the line being currently examined are in the name table with numeric values. Therefore, names must be defined before they are used in any pseudo orders that affect the location counter in a manner dependent on their address field. IT DS - Define Storage Define the optional symbol in the location field to have a value equal to the current contents of the location counter. Then add the value of the expression (predefined) in the address field to the contents of the location counter. Ds is similar to DC and the same piece of code is used in both passes, the only difference is that DS does not actually produce object, ■whereas the DC must have the data specified in the address field. Name Table Entry: EQU - Symbolic Equivalence Define the name in the location field to have a value equal to that of the expression (predefined) in the address field. Others: END - End Assembly Define the optional symbol in the location name to have a value equal to the current contents of the location counter. If the address field is non-empty, then its value will be punched on a binary transfer card as the starting address of the program. 18 7. DESCRIPTION OF THE MACHINE This is the part of the program which makes the ASSEMBLER work for many types of machines. At the end of the ASSEMBLER and before the END pseudo which signals the end of the process, the user has to include a small deck of cards which define to the program his particular machine. On each card there are three fields punched starting at columns 1, 10, and l6, respectively. The first field is the name used for this particular piece of information through the program. In this case, in order to define another machine, only this small deck of the description will have to be changed, with the names in the program and the description remaining the same. The second field defines the first field as a constant or equivalent to the third field. The second field, also, will remain unchanged for defining another machine. The third field is the actual description and it is the one which changes when the machine changes. For example if the character slash (./ ) is used to specify "comment" for one user and the character semicolon (;) for a second then the third field will be C'/' f° r the first user and C';* for the second, the C standing for character and the actual character following, included in quote marks. Also if the mnemonic starts at column 6 in the program of one used and at column 10 in the program of a second, the third field will be 6 for the first and 10 for the second. Similarly if the character I signifies indirect addressing for the one and the asterisk (*) for the other, the third field will be CL2'I' for the first and CL2'*' for the second, with CL2 signifying that there will be two characters in the quote marks, 19 the first specifying the indirect addressing and the second the character blank. The following list contains the use of the names in the description and the form in which the definition is given. 20 Alpha: Character specifying comment, given in the form ALPHA DC C ' / ' Beta: Starting column for mnemonic, given in the form BETA EQU 6 Gamma: Length of mnemonic, given in the form GAMMA EQU 8 Delta: Starting column for location name, given in the form DELTA EQU 1 Epsilon: Length of location name, given in the form EPSILON EQU 9 Eta; Starting column for address, given in the form ETA EQU 10 Theta; Length of address, given in the form THETA EQU 73-ETA Iota; Character specifying indirect addressing, given in the form IOTA DC CL2 • I ' Kappas Character specifying current address, given in the form KAPPA DC C ' . * Lamda; Length of operation code in number of bits, given in the form LAMDA EQU 3 Mi; Length of operation code in number of bits, given in the form MI EQU 12 Pi: Length of operation code in case of microprogramming in number of bits t given in the form PI EQU 12 21 8. DIAGNOSTIC MESSAGES When the ASSEMBLER detects an error, or when the user should be notified by means of a warning, it prints diagnostic messages to help the programmer correct the cause of error. If an error is detected in Pass I a flag is set so that the assembly will not continue to Pass II. The diagnostic messages and their meanings are listed below. 'MNEMONIC DOUBLE-DEFINED'. After printing the mnemonic this message appears when a variable is given, more than once, as input for building the mnemonic table. It is not a critical error, unless the user defines with the same name two different operation codes so the flag is not set and assembly will continue. It is simply given as a warning. 'INVALID MNEMONIC. The mnemonic contains an invalid character. The flag is set. 'NAME UNDEFINED'. During the evaluation of an expression in the address field, a name was encountered which was not defined in the program. Note that names in some pseudo orders must be predefined. 'SHOULD BE MORE ENTRIES IN THE TABLE'. This message is received when the mnemonic look-up procedure takes place and tracing the pointers from the root of the tree down to the branches the smallest entry is found but the search argument is smaller or the largest is found and the search argument is larger; in other words the mnemonic is not found in the mnemonic table built by the programmer. The flag is set in this case. 22 'FIELD EXCEEDS LENGTH'. This message is printed in the case that a name exceeds the limits of the location field (more than eight characters), or a number in the address field exceeds the machine limitations; also, if a name in the address field is too big or a too large number is added to the current contents of the location counter. In other vords this same message is printed in all cases of violation of length. The flag is set in all cases except in the case that the scanning of the address field takes place only during Pass II. Then the assembly will continue only for awhile, probably, because the too lengthy names will not be found in the table and the overflow of the location counter will be caught at a later point in the program. 'INVALID CHARACTER IN FIELD'. This message is received when an invalid character is found in a field, for instance in the location name, or the first character of a name, or a wrong character in a number in the address field, or a name in the address field, or the flag is set in the case that the error is discovered during Pass I, else something similar as in the case producing the previous message, will happen. 'NAME DOUBLE-DEFINED'. A name in the location field is used more than once. In the case of a twice stored as entry in the mnemonic table this error was not critical, but in this case the flag is set and assembly will not continue to Pass II. 'OFF-PAGE REFERENCE'. The value of the address field of a memory referencing instruction is neither an address in "page" zero nor an address in the current "page" . 23 At the end of the assembly the name table is printed in alphabetical order together with information for cross-reference, namely length, value, and definition references. 2k BIBLIOGRAPHY Gear, C. W. "Machine Language and Systems Programming," CS 201 Class Notes, Department of Computer Science, University of Illinois, Urbana, Illinois, September 1967. Powers, M. "PDP-8 Assembler," Memorandum 12, University of Michigan, Ann Arbor, Michigan, November 1967- Small Computer Handbook , Digital Equipment Corporation, Maynard, Massachusetts, 1968.