LIBRARY OF THE UNIVERSITY OF ILLINOIS AT URBANA-CHAMPAIGN bi.0.%4 IA(ot Digitized by the Internet Archive in 2013 http://archive.org/details/sequencedetermin416schw jt/^OT JVLLZif IU*/ Report No. Ul6 H- W SEQUENCE DETERMINATION FROM FRAGMENT DATA by John C. Schwebel November l6, 1970 COO-1018-1219 IH£ LIBRAHY OF Th£ f\PRi 1 , 0/| COO-1018-1219 Report No. Ul6 SEQUENCE DETERMINATION FROM FRAGMENT DATA by John C. Schwebel November 16, 19T0 Department of Computer Science University of Illinois Urbana, Illinois 6l801 *Supported by Contract AT(ll-l)-10l8 with the U.S.. Atomic Energy Commission. ACKNOWLEDGMENT The author wishes to thank Professor Bruce H. McCormick who introduced him to the sequence reconstruction problem area, formulated the framework of the specific problem considered herein, and who contributed frequent stimulation and many ideas for the preparation of this paper. Thanks are also extended to Mrs. Betty Gunsalus for typing and Mr. Stanley Zundo for preparing the figures in this report 1. Introduction The general problem considered here is that of determining or reconstructing a completely ordered linear string (sequence) of terminals, given the results of some fragmentation of the string. The degree of order assumed in the fragmented string, i.e. the properties of the input data, are crucial both in defining the class of problems which can be solved by a given method and in relating the abstract problem to physical applications. Here, we will consider a class of problems for the idealized case of unambiguous fragment identification. An algorithm and program will be given to solve this problem. The application of sequence reconstruction which motivated our interest in the problem is the area of monomer sequencing in a polymer. The goal here is to determine the linear ordering of the (monomer) subunits in natural polymers, which are chains of monomers configured into complex three-dimensional structures. Two examples which have received the most attention are amino acid sequencing in proteins and nucleotide sequencing in deoxyribonucleic acid (DNA) and ribonucleic acid (RNA). For an introduction to polymers see Natural High Polymers [l]. Reference [2] attempts to collect all extant material on pro- tein and nucleic acid sequences, and will be published annually. References [3] and [k] contain specific algorithms for special cases of sequence reconstruction and will be discussed later. Another possible area for the application of sequence re- construction is coding theory or cryptography. Many copies of a desired message could be fragmented and then transmitted. In this case, errors in the communication channel could produce the ambiguous fragment identification class of problems. -1- 2. Problem Definition Let S be a string over some finite alphabet, V. That is, S is an ordered string of characters (terminals) which belong to V. Assume that S has been broken up into a set of substrings- and that the order of the characters within each substring is not known. An unordered substring will be called a fragment. The general problem considered here is to determine the original sequence of characters in S given some complete sets of fragments from S . Definitions and Notation The lowest level elements, I.e. those which are never sub- divided are oalled terminals . Terminals will he represented hy upper case alphanumeric characters. A sequence is an ordered set of terminals with repetitions allowed. A sequence will he represented as a string of terminals separated hy dashs . Example: T-H-I-S-I-S-A-S-E-Q-U-E-N-C-E A fragment is an unordered set of terminals with repetitions allowed. A frasment will be represented as a string of concatenated characters. Example: THISISAFRAGMENT A chain is an ordered set of fragments. A chain will he represented as a string of fragments separated hy dashes. Example: THIS-ISA-CH-AIN . Note: If each fragment in a chain contains only one terminal the chain is a sequence. taI i=a collection of chains either unordered or partially ordered. An unordered copy will be represented as a string of chains separated by asterisks. Examples: TH-IS*IS«A«C-0-PV ASTER-ISKS*ME-AN*NO-ORDER . A partially ordered copy in which the order of some chains is fixed will he represented by delimiting the chains with fixed positions by dashes. Example: THE«CH-AIN*-CHAI-N1-»IS»FIXED -2- Any copy represents a finite number of distinct sequences. The copy will be called a valid copy for each of the sequences which it represents. With this terminology, summarized in Table 1, we can now restate the sequence determination problem. Given a sequence S we obtain a number of copies which are assumed to be valid copies for S. The problem is to determine the original sequence S from the set of copies. It is apparent that the set of all possible sequences for which the set of copies is valid is, in fact, just the inter- section of the sets of sequences for which each individual copy is valid. If this intersection contains more than one element, the original sequence cannot be uniquely determined from the input copies. The program SEQ1 determines the set of possible sequences given a set of input copies without fixed chains. The result is a set of copies which represent- all sequences consistent with the input. -3- Table 1. Syntax For Sequence Notation ;= < alphanumeric^ := [- ] . . ■ := .•• := [- ] . := - - : = {|> [*{|}].. -k- 3. Strategy Used To help visualize the sequence determination problem under consideration, we can picture the n input copies as being lined up vertically, one to a row, on a rectangular board. Each fragment of a copy can be thought of as a piece which may be moved anywhere within its row. Any ordering of the terminals within a fragment is also possible. The goal is to position the fragments and reorder terminals within fragments such that all the rows spell out the same sequence. All possible sequences must be found. The strategy used to solve this problem was chosen with the following criterion in mind. As we increase the number of input copies, and consequently the information available to help solve the problem, the algorithm should not be degraded in any way. Because of this criterion, strategies which attempt to compare all copies simultaneously to obtain a sequence were discarded. These strategies can be degraded because of additional storage requirements as more input copies are used. An example of this type of strategy is one in which pieces in a vertical column in all copies are matched up before all pieces in any one copy are ordered. To use the symmetry of the problem to eliminate choices, the strategy starts from the left end of a copy and attempts to place pieces sequentially towards the right. If one sequence or copy is retained at any stage, then its reverse should also be re- tained. Comparisons are made to eliminate saving both a copy and its reverse during processing. The final strategy adopted compares copies in pairs only. The basic matching algorithm compares two copies, copy I and copy J, and determines a set of copies which represent the intersection of the sequences represented by the two copies. At any stage in the process there exists a set of current candidate copies obtained from processing previous input copies, and a set of new candidate copies being formed by comparing a new input copy with the current candidate copies. -5- The new input copy, J, is compared with each copy, I, in the set of current candidate copies. The set of consistent copies from each I, J comparison is added to the set of new candidate copies. When 1 has varied over all current candidate copies, the current candidate copies and the input copy J can be deleted from consideration. The new candidate copies then become the current candidate copies, the next input copy becomes copy J, and the process continues. After each pairwise comparison the number of sequences represented cannot increase. The candidate copies which remain after all input copies have been used represent all sequences consistent with the input. In the program SEQ1, the first I and J are input copies 1 and 2, and then J is incremented sequentially. However, the choice of I and J, i.e. the ordering of the copies, can affect the algorithm drastically. Thus, a better strategy than that used in SEQ1 should attempt to order the copies to improve efficiency. For example, we would like to pick an I and J which will generate a small set of new candidate copies. One possibility is to define a measure re- presenting the internal ordering of a copy, and to use this measure in a dynamic ordering procedure of the type used in sequential pattern recognition [5]. The general strategy is outlined below. With each pairwise comparison the number of sequences represented is nonincreasing. 1. Choose an I and J. 2. Compare copy I with copy J to find the set of copies consistent with this pair. 3. Replace copy I and J by the new set of copies. k. Go to 1 unless stopping criterion is satisfied. -6- Pairvise Copy Comparison The comparison between two copies, copy 1 and copy 2, is done exhaustively by trying all possibilities of chains from each copy at the current position of the right end of the partially ordered copy. At the start, any chain is chosen to begin the ordering of chains in copy 1 and all chains from copy 2 are tried to match the order of the chain in copy 1. Moving from left to right at any point in the process, there are some initial matching terminals, some terminals from the overlapping part of one chain in one copy , and the chains in each copy which have not been used. A "sewing process" is carried out for the matching of chains. A chain is tried in the underlapping copy to match the overlap in the overlapping copy. If a match is found, the chain is added to its copy, the new overlap is calculated and added as a fragment to the new candidate copy, and the process continues. There are 6 cases of successful matches which are illustrated in figure 1. At the beginning of the process and whenever a chain is complete, i.e. when there is no overlap, a new chain is started by picking an unused chain in copy 1 as the initial overlap. A stack is used to save information on each chain being used. The chain information is pushed into the stack in the same order as the chains are selected in the sewing process. Each stack position corresponds to a node on the tree of all possible valid chain choices. Backup to a previous node is done by popping the stack whenever a match fails. See section k for an example of the operation of the stack. -7- State Candidate Chain: Input Fragment: F-T-EE-NT Addition F-T-EE-NT FTE Next State E-NT IF IF FI T-N T-N CENT CE TS-N-EE I S-N-EE CEN CEN CEN FIFTE IF-F FIFTE TE Figure 1. Fragment Sewing Process -8- h . Program Description A computer program, SEQ1, was written in SN0B0LU to solve the sequence determination problem according to the exhaustive search strategy described above. The program flowchart and program listing are given in Appendix I. Input One program run is the processing done on a set of input copies assumed to be fragmented from the same sequence. The program will execute any number of runs on different input data. The first data card contains an integer, NRUNS, specifying the number of runs. Then for each run the following cards are needed. The first card contains two integers separated by blanks specifying the number of input copies for this run, NC0PS, and the maximum number of fragments in any copy, MAXFRG. The remaining cards contain the input copies for the run. Each card contains one copy except that a copy may be continued on additional cards by starting the continue cards with a period in column 1. A copy is composed of fragments separated by blanks . Table 2 shows the input data format . Output The input copies are always printed out first. Then each candidate copy and its index in the chain array is printed out for each J. Candidate copies which are not saved because they are equal to another copy already stored, possibly with some chains reversed, are also printed out. The candidate copies for the last J represent the final set of sequences consistent with the input. Three variables, N0TPRT1 , N0TPRT2 , and N0TSTKP are flags which control the amount of optional information printed out for each run. This information is primarily useful for debugging. N0TPRT1 = specifies that every time a chain match is obtained in the pairwise copy comparison routine, the chain which matched, the overlap 0VLAP, and the index of the next copy where a matching chain will be sought, -9- Data Format For a Copy: 1st card continue cards . . . < fragments Input Data Cards Card No. Data on Card Beginning in Column 1. 1 NRUNS 2 NC0PS MAXFRG 3 :copy NC0PS> Format repeated for each run run l run 2 run NRUNS Table 2. Input Format -10- NC, are printed out. N0TPRT2 = specifies that the complete stack contents will be printed out in the backup program each time the backup is within the same chain. N0TSTKP = specifies that the complete stack contents will be printed out together with the value of I whenever a new candidate copy is found. -11- Comparison Routine The pairwise copy comparison routine is the part of the program which attempts to find- an unused chain in copy C to match the overlap (0VLAP). There are 6 cases for the output of this routine when a successful match is made. TFRG and TCHN are strings which hold the fragment and chain being added to the right hand end of copy 2 and copy 1 respectively. C=l means TFRG is set equal to the 0VLAP in copy 2 and an unused chain is searched for in copy 1 and set equal to TCHN. C=2 means TCHN = 0VLAP in copy 1 and TFRG = the first unused chain in copy 2. Table 3 shows the values of variables for each of the 6 cases when a match is made between TFRG and TCHN. A new chain is being constructed in CHAINS . In general, when a match is found the order of the last fragment may be improved and a new overlap may be added to the chain. Slashes, which are eventually replaced by dashes , are used to delimit the overlap sections of a chain and thus show the new order after a match. If no match is found a new TFRG or TCHN is sought. If none are left which are unused and not tried previously in this position the backup part of the program is entered. If the matching fragments line up exactly, cases 2 and 5, a chain within the copy is complete and either another chain can be started, at NEWCHN in the program, or a new candidate copy is complete. In the current program a new candidate copy will be saved if no copy has already been saved which can be obtained by reversing some chains of the new candlate. Let the reverse of a chain, a, be indicated by a. For example a* (3*6*y will not be saved if A A A A {a|a} * {B|$} * {6|6} * {y ] y ) has already been saved. -12- > a CD O S Eh > « (L) fc S En CJ Si CD > * A o V CO o o Eh CD -P CD CD Q K o fo H * H >> ft O o CM >> ft o o 0) to o TS. TS. CM TSl ■o. eg en CM CM ia •sl cm is. is. TSl CM CM ca is. IS. CM r— ■o. •sn K o Eh H Eh O K En Eh O Eh Eh 00. CM ca H ca ca , co H H > IS. r) H O CM CM ft O o ft o o CO a •H o •H x: V -p aJ S o to 0) CO a) o CO on cd H ■8 Eh CO LT\ VD -13- Data Arrays A stack of two fields is used to store the indices of the chains which are being used in attempting to construct a candidate copy from C0PY<1, > and C0PY<2,' >. The stack is needed so that when an attempt fails we can backup and try new chains until all possibilities have been tried. The first f i eld, STK, holds- the index of the copy, either 1 or 2. The second field, STK, holds the index or the negative of the index of the chain within the copy. A negative index means the reverse of the chain is being used. Thus, the chain indicated by one stack location is either C0PY , STK > or C0PY , -STK , and since 1=3 and J=3, we see that the candidate resulted from comparing CHAIN<3,> with input copy 3 and that the order of chains which matched is as shown below. I J 1,1 1,2 2,1 2,2 2,3 2,1+ SM-A-LL-T EST S MAL LTE | ST Also note that a negative index appears in the stack when the next candidate is found. Since 1=3 the index, 1-1, shows that the reverse of the chain SM-A-LL-T was tried. Table h shows other arrays used by the program. Refer to Table 5 for a description of the variables used. -Ik- input copies ITREE<1,1>, ... , ITREE<1,NFRG<1» _ITREE, ... , ITREE, CHAIN, CHAIN, .. , CHAIN> , CHAIN, ... , CHAIN> J current input copy I current candidate copy- copy I {C0PY<1,1>, ... , C0PY<1, NENT<1»} copy J {C0PY<2,1>, ... , C0PY<2, NENT<2»} copy I used flags: {USEDCHN<1,1>, ... , USEDCHN<1, NENT<1»} copy J used flags: {USEDCM<2 ,1>, ... , USEDCHN<2, NENT<2»} complete chains in partial new candidate: {CHAINS<1>, ... , CHAINS} Table k. Data Arrays for Storing Copies and Chai ns -15- now on Storage of Copies Most of the storage used is in the CHAIN array which holds all the candidate copies. The. size of the array is MAXCAN which is set in each run to NC0PS * '(MAXFRG + l). This may be changed in e assignment statement. If MAXCAN is exceeded, the program exits to CHN0V, prints a message and ends. At any point in the program, however, more storage is available by using the locations CHAIN<1,> to CHAIN which hold old candidate copies that can be dis- carded. That is, the CHAIN array can be thought of as circular storage which need never be larger than the maximum over J of the number of old candidates which are unused plus the number of new candidates for an input copy J. -16- Table 5. Variables for Program SEQ1 The type abbreviations used are I, S, P, and R which represent Integer, Str'ng, Pattern, and Real respectively. Name Type Description c I CH S CHAIN ARRAY ( MAXCAN v ,' MAXFRG ) CHAINS ARRAY (MAXFRG) CK I C0PY ARRAY (2 ' ,'MAXFRG ) DIFF S FRGPAT P FRGS S I I IT I ITREE ARRAY ( NCOPS ' , • MAXFRG ) J I K I LB I M I MAXCAN I MAXFRG I NC I NCHN ARRAY (MAXCAN) NC0PS I NENT ARRAY ( 2 ) NEWCAND I NFRG ARRAY (NC0PS) N0TPRT1 I N0TPRT2 I N0TSTKP I Index of copy we are trying to add chain to Temp, location for one character Holds all candidate copies Holds chains of partially built new candidate No. of chains in CHAINS Holds copies I and J which are being compared Temp, for characters Pattern to get one input fragment Temp, to hold input copy Index of candidate copy being compared Holds index of current size of CHAIN Holds all input copies Index of input copy being compared Index Lower index in CHAIN of candidates for last J Pointer to top of stack (STK) Maximum no. of candidate copies allowed in CHAIN Maximum no. of fragments in any copy Index of next copy to try to add on to(next C). Holds no. of chains in each copy in CHAIN No. of input copies for present run No. of chains in C0PYand C0PY<2,> No. of new candidates found for present J No., of fragments in each input copy Flag for no printout of chain , 0VLAP , and NC on success Flag for no stack printout on backup Flag for no stack printout for new candidate -IT- Name Type Description NRUNS I 0UT 0VLAP R I S I RF S SAVE S SF s STCHN s STFRG STK s ARRAY(STKMAX' ,'2) STKMAX I TCHN S TFRG s TITLE s TRVS s TSG1 s TSG3 s TSG5 TV s I UB USEDCHN I ARRAY (2 ? ,'MAXFRG) No. of program runs with different data Temp, flag Holds overlap after a fragment or chain match Flag to indicate that reverse chain need not "be tried Temp . Holds next input line during run Temp, to hold overlapping fragment part after a match Temp, to hold TCHN Temp, to hold TFRG Stack of 2 fields which holds tree backup point Maximum depth of STK Holds current chain being compared from copy 1 Holds current fragment being compared from copj Output variable, format 132A1 Temp. Temp. Temp . Temp . Temp, index Upper index in CHAIN of candidates for last J Indicates which chains in C0PY have been used -18- Examples Figure 2 shows the output of the program for the first example whi " h consists of seven input copies representing fragmentation of a protein which is twenty-one amino acids (terminals) long. No extra debugging output is printed since the variables N0TPRT1, N0TPRT2, and N0TSTKP were all set to 1. After six input copies have been processed unambiguous results are obtained, since the set of candidate copies consists of one completely ordered sequence. The complete program executed 15,935 SN0B0L1+ statements for this run. Figure 3 shows an example with intermediate stack printout obtained by setting N0TPRT2 and N0TSTKP to 0. The initial string was S-M-A-L-L-T-E-S-T. The stack contents are dumped on each backup within the same chain. The stack printout following each candidate printout shows the order of all the chains in copies I and J which was used to obtain the candidate. In this example there are two copies in the final set of candidate copies. The first is contained within the second, so that the two sequences represented by the copy in CHAIN <6, > (and their reverses) are all sequences consistent with the input copies. This program executed 7,^5 SN0B0L4 statements. Figure k illustrates the most complete output possible with N0TPRT1, N0TPRT2, and N0TSTKP all set to 0. The initial string which was fragmented was 0-U-T-P-U-T. The final set of consistent sequences can all be represented by the copy 0-U-T * PUT in CHAIN<8,> . This gives 2 • 3 I =12 total possible sequences consistent with the input copies . This example illustrates that the program need only print out the copy in CHAIN<8,> if it eliminated all the other candi- dates which were contained in it from consideration. This program executed 5,8l8 SN0B0IA statements. Another example tried was six randomly fragmented copies from a sequence of transfer RNA of length thirty. The sequence was G -G -G -C-G -U-G -U-M-G -C-G -C-G -U-A-G -U- C-G -G -U-A-G -C-G -C-M-C-U. The program terminated due to insufficient storage for candidates after generating fifty-four candidates from the first I-J copy pair, -19- and after execution of 53,582 statements. It is apparent that the exhaustive strategy is too slow for this case. One important factor here was the small size of the terminal alphabet which allows many possible fragment matches. -20- COPY 1 IS COPY -> IS COPY 1 IS COPY 4 IS COPY 5 IS COPY 6 IS COPY 7 IS GIVf? * QCC * ASVCS * LOQ * LEN * CCN GIVEQ * CCJLSV * rsi no ♦ iFNfi * rK| GI * VFOCCA * SVCSLO * QLEN * OCN GIVE * QCCAS * VCSLO * CLENO * CN G * IVEQ * CCASVC * SLOQL * ENOC * N GIV * EQC * CASVCSL * OQLE * NOCN GIVE * OCC * ASVCSLO * QLENOCN GIVE-Q-CC-AVS-CS-LOQ * LEN-O-CN IS NEW CANDIDATE COPY IN CHAIN<2, > NEW CAND. NOT SAVED IS G I VE-Q-CC-AVS-C S-LOQ * CN-Q-LEN MATCHES #2 NEW CAND. NOT SAVED IS L OQ-CS-AVS-CC-Q-G IVE * LEN-O-CN MATCHES #2 NEW CAND. NUT SAVED IS LOQ-C S-AVS-CC-Q-G I VE * CN-O-LEN MATCHES #2 LFN-C-CN * GIVE-Q-CC-AVS-CS-LOQ IS NEW CANDIDATE COPY IN CHAINO, > NEW CAND. NCT SAVED IS LEN-O-CN » I HC-PS- Av/S-rr-Q-G I V F. MATCHFS #3 NFW CAND. NOT SAVED IS CN-O-LEN * GIVE-Q-CC-AVS-CS-LOQ MATCHES #3 NFw CAND. NCT SAVED IS CN-O-LEN * LOC-CS-AVS-CC-Q-GIV E MATCHES #3 FNO OF CANDIDATES FOR J = 2 GI-VF-C-CC-A-VS-CS-LC-C-LEN-O-CN IS NEW CANDIDATE COPY IN CHAIN<4, > NCW CAND. NOT SAVED IS -Gl^V£-Q_-il£=A-rVS-£S.-JJl-CrJ-Eii-a-C^ MATCHES #4 END OF CAND I DATES FOR J = 3 GI-VF-Q-CC-A-S-V-CS-LO-Q-LFN-O-CN IS NEW CANDIDATE COPY IN CHAIN<5, > END CF CANDIDATES FOR J = 4 G-I-VE-Q-CC-A-S-V-C-S=:La-a-L-£RrQ-£=fil_LS-Al£M_CANniDA.I£._CDEY. IN CHAIN<6 , > (,-I-VF-Q-r.C-A-S-V-C-S-LC-C-L-N-E-O-CN IS NEW CANCIDATE COPY IN CHAIN<7, > tUD OF CANDIDATES FOR J = 5 G-I-V-fc-Q-C-C-A-S-V-C-S-L-C-C-L-E-N-O-C-N IS NEW CANDIDATE COPY IN CHAIN<8, > FND CF CANDIDATES FOR J = 6 G-I-V-F-Q-C-C-A-S-V-C-S-L-O-Q-L-F-N-O^C-N IS NEW CANCIDATE COPY IN CHAIN<9, > NO OF CANDIDATES FOR J = 7 Figure 2. Protein Sequence Determination -21- COPY 1 I S SMA * LLT * EST COPY 2 IS £rt ♦. . AN * TFS * T . fOPY 3 IS S * MAL * LTE * ST SM-A-LL-T-ES-T IS NEW CANDIDATE COPY IN CHAIN<2, > STK CONTENTS: I I , 2 1 t 2 2 » 1 2 * 2_3 * 1_3 ^ 2_4__* , I = STK CONTrNTS: 11,21,22,12,23,13,24, UK CONTENTS: 11,71,27,17.73.13. STK CONTENTS: I 1 , 7 I , 2 2 , i 2 t • 2 3 , SM-A-LL-T * EST IS NEW CANDIDATE COPY LN CHAIN<3* > STK CONTFNTSJ 11, 21, 22, 12,24, 13,23, , I ■ = 1 STK CONTENTS: 11,21,22,12,24,13,23, STK CONTENTS: 11,21,22,12,24, STK CONTENTS: 1 L , 2J. ,.2.2_ulJ_, STK CONTENTS: 11,21,22, STK CONTENTS: 11,21, NEW CANO. NOT SAVED IS T-LL-A-SM * EST MATCHES #3 STK C INTENTS: 12,24,22,11,21,13,23, , I ' ■ 1 STK CONTENTS: 12,24,22,11,21,13,23, STK CONTCNTS: 12,24*2 2__*_ 1. 1. i. 2. 1 , . ... . . STK CONTENTS: 12,24,22,11, STK C JNTENTS: 12,24,22, STK CONTENTS: 12,24, EST * SM-A-LL-T IS NEW CANDIDATE COPY IN CHAIN<4, > STK CONTFNTS: 13,23,11,21,22,12,24, , I = S1K CJNTENTS: 1 3 , 2 3 ._, L 1 T 7 1 . 2 2, 12,24, UK CONTENTS: 13,23,11,21,22,12, STK CONTENTS: I 3 , 2 3 , 1 1 , 2 1 , 2 2 , STK C INTENTS: 13,23,11,21, NEW CAN?. NOT SAVED IS EST * T-LL-A-SM MATCHES #4 STK CONTENTS: 13,23,12,24,22,11,21, , I = ■ 1 STK CONTENTS: 13,23,12, 2 4 , 2_ 2_* JL. 1 . «_ 2 L , STK CONTENTS: 13,23, 12,24,22, 11, STK CONTENTS: 13,23,17,24,22, STK CONTFNTS: 13,23, 12,24, STK CONTENTS: 13,23, NEW CAND. NOT SAVED IS T-ES-T-LL- A-SM MATCHES #2 STK CUNTENT5: 1 3 , 2 4._, 2- 3_*-l_2. »_ 2 ._2_»...i. 1 * 2 1. , , I i = 1 STK CONTENTS: 13,24,23,12,22,11,21, STK CONTENTS: 13,24,23,12,22,11, STK CONTENTS: 13,24,23,12,22, STK CONTENTS: 13,24,23,12, STK CONTFNTS: 13,24,23, STK r ONTENTS: 13,24, FND OF CANDIDATFS EO° J = 2 s _m_ a _ l _ l _t-e-S-T IS NFW CANDIDATE COPY IN CHAIN<5, > STK CONTENTS: 11,21,22,23,24,01, 1= = 2 STK CONTENTS: 11,21,22,23,24, STK CONTENTS :._1 l , 2_ 1 * _2 „2-«_.2 _3_.*. ... STK CONTENTS: 11,21,22, STK CONTENTS: 11,21, S-"-A-l -L-T-F-ST IS NEW CANDIDATE COPY IN CHAIN<6, > STK CONTENTS: 11,21,22,23,12,24,01, I = 3 STK CONTFNTS: 11,21,22,23,12,24, STK CONTENTS: LI, 21.* 22, 23, 12, STK CONTENTS: 11,21,22,23, STK CONTENTS: 11,21,22, STK CONTENTS: 11,21, Figure 3. Program Example with Intermediate Stack Output -22- - Sheet 1 of 2 STK CJNTFNTS: 12,21 r 2 3 , STk CONTF-NTS : 12,21 Ml a CAND. NOT SAVED IS ST-E-T-L-L-A-M-S MAJC.HES #6 st k CONTENTS: 12,24 ,23,1-1,22 ,21,01 STK C'lNTENTS: 12,24 ,23, 1-1,22 ,21, STK CONTFNTS: 12,24 , 2 3 , 1 -I , 2 ? , STK CJNTENTS: 12,24, ,23,1-1, siK C 1NTFNTS: 12,24 , 2 3 , STK CJNTENTS: 12,24 i STK CJNTENTS: 11*21 ,23, STK CJNTCNTS: 11,21 NEW CANH. NOT SAVED IS ST-E-T-L-L-A-M-S MATCHFS #6 STK CONTENTS: 11,24, 2 3,1-2,22 ,21,01 STK CONTENTS: 11,24, ,23,1-2,22 ,21, STK CONTENTS: 11, 2 4, 2 3,1-2,22 ., — STK CONTENTS: I 1 t ?. 4 , 2^, I -2 , STK C STENTS: 11,24, ,23, SI k CONTENTS: 11,24, NE •' CAN'T. NU1 " SAVED IS S-M-A-L-L-T-F-ST MATCHES #6 STK C INTENTS: 12,21, ,22,23,11, 2 4,01 STK CONTENTS: 12,21, 2 2,23,11, 2 4 , STK CONTENTS: 12,21, ,22,23,11, STK (. INTENTS: 12,21, 2 2,23, STk c INTENTS : 12,21, 2 2 , STK C INTENTS: 12,21, FNT ^ r C WHDATFS FOP J = 3 I = 3 I = 4 I = 4 Figure 3. Sheet 2 of 2 -23- rnpy i is COPY 2 IS COPY 3 IS (JUT * PUT QU * TPU * T * UTP * UT C0PY<2,1> = COPY<2,2> = rnPY = CCPY<2t3> = OU-T-OU-T STK CONTENTS: STK CONTENTS: STK CONTENTS: 1 STK CONTENTS: I COPY<2,3> = T IS Cf;PY<2,2> = too I OU-T * uiJT IS N STK C INTENTS: 1 STK CONTENTS: 1 STK CONTENTS: 1 STK CONTENTS: 1 COPY = T IS C0PY^?,1> = JU IS CUPY<2,2> = TPU I NEW CAND, NUT S STK CONTENTS: 1 STK CUNTENISJ _1_ STK CONTENTS: 1 STK CONTENTS: 1 Cf>PY = TPU I COPY<2,l> = OU IS COPY<2,3> = T IS PUT * OU-T IS N STK CONTENTS: 1 STK r r :MTEJTS: 1 STK CONTENTS: I COPY = T IS r.r'PY = OH IS NEW CAND. NUT S OU IS TPU I PUT I I IS. IS NEW CANDIDATE COPY IN CHAIN<2, > I 1 USED HERE OVLAP = T NC = 2 S USED HERE OVLAP = PU NC = X S USED HERE OVLAP = T NC = 2 USED HEP. E QVLAP = NC = 2 2,23 2,23 2 , JL--Q 1,21,22, 1,21 ,22, 1,21,22, 1,21,22, USED HERE _QVLAP_= NC = 2 S USED HERE OVLAP = NC = 2 EW CANDIDATE COPY IN CHAIN<3, > 1,21,23,12,22,0 1 = 1 I = 1 I 2 2 2 1 , 2 1,23, 1,21,23, 1,21, USED HERE OVLAP = OU NC = 2 USED HERE OVLAP = NC = 2 S USED HERE OVLAP = NC = 2 WED IS T-QU * PUT MATCHES #3 1,23,21,12,22,0 , 1 .23.21. 12. 2 _2__, I = I STK CONTENTS: 1 STK CONTENTS: 1 STK CONTENTS: I STK CONTENTS: 1 C0PY<2,3> = T IS CQPY<2,2> = TPU 1 C0PY<1, 1> = OUT I C0PY<2,1> = OU IS NEW CAND. NOT S STK CONTENTS: 1 STK CONTENTS: 1 STbL rnNTFNTS: l STK CONTENTS: STK CONTENTS: 1,23,21, 1,23, S USFD HERE OVLAP = NC = 2 USED HERE OVLAP = T NC = 2 USED HERF OVLAP = NC = 2 EW CANDIDATE COPY UN_ _£HA-LNj£^ ± . 2,22,11,21,23,0 , 1=1 2,22,11,21,23, 2,22,11,21, USED HERE OVLAP = OU NC = 2 USED HERE OVLAP = NC = 2 AVED IS PUT * T-OU MATCHES #4 2 3,21,0 , 1 = 1 2,22,11, 2,22, 11,23, 2,22, 11,23, 2,22, USED HERF OVLAP i USED HERE_QV_LAP = S USFD HFRF OVLAP = USED HERE UVLAP = AVED IS T-PU-T-OU 2,23,22, 11 2,23,22, 11 2- . 2 3 . 2 2 , 1 1 2 I PU NC = 2 = I._NC .= _!_ OU NC = 2 NC = 2 MATCHES #2 ,21,0 ,21, I = I 12,23,22, 12,23, END OP CANDIDATES FOR J = 2 U-T-PU-T NC = = U-T NC = 2 CHPY<2,1> = n IS USFD HERE OVLAP = CL'PY<2,2> = UTP IS USED HERETO VL A P C0PY<2,3> = UT IS USED HERE OVLAP = NC = 2 0-U-T-p-U-T IS NEW CANDIDATE COPY IN CHAIN<5, > STK CONTENTS: 11,21,22,23,01, 1= Figure k. Program Example with Complete Output -2k- - Sheet 1 of "3 STK CONTENTS: 11,71,27,23, STK CONTFNTS: 1 1 , 2 I , 2 7 , COPY<2,3> = UT IS LLSLD HERE QVLA£ _= Pli-I NC = 2 COPY = UTP IS USEO HERE OVLAP = NC = 2 Ci-U-T-PU-T IS NEW CANDIDATE COPY IN CHAIN<6, > STK CONTENTS: 11,71,73,72,01, 1=2 STK CONTENTS: 11,21,23,22, STK CONTENTS: 11,21,23, STK CONTENTS: i 1 , 2 L , . r 'PY = J IS USED HFRE OVLAP = U-T NC = 7 CO°Y<2,2> = UTP IS USED HERE OVLAP = P NC = 1 C0PY<1,?> = PUT IS USED HFRE OVLAP = UT NC = 2 CUPY<2,3> = UT I S USED HERE OVLAP = NC = 2 >J-T-P-or IS NEW CANDIDATE COPY IN CHAIN<7, > STK CONTENTS: 11.21,22.12.23,0 . 1=3 STK CONTENTS: 11,2 1,22,17,23, STK CONTENTS: 11,21,22,12, STK CONTENTS: 11,21,22, COPY<2,3> = UT IS USED HERE OVLAP = NC = 2 CnPY<7,?> = UTn IS USED HFRE OVLAP = NC = 2 >U-T * PUT IS NEW CANDIDATE COPY IN CHAINO, > STK CONTENTS: 11,21,23,12,22,0 , [=3 STK CONTFNTS: 11,21,23,12,22, STn CONTFNTS: 1 1,71,73, STK C INTENTS: 11,21, C r PY = UT" IS USEO HFRE OVLAP = NC = 2 C0PY<2,1> = IS USED HERE OVLAP = U-T NC = 2 r»PY = OT IS USFO HERE OVLAP = NC - 2 U 1JT * i-J-T IS NEW CANDIDATE COPY IN CHAIN<9, > STK r .rr jts: 12,72,11,2 1,23,0 , [ = 3 STK CONTFNTS: 17,77,11,21,23 STK CONTFNTS: I 7,77,11 ,21, STn CONTENTS: 17,22, C' = UT IS USED HERE OVLAP = P NC = 2 C0PY = UTP IS USED HFRF OVLAP = UT NC = 1 Cf?PV = Qii-T IS USED HFRF OVLAP = NC = 2 Cf:PY<7,l> = ) IS USED HERE OVLAP = NC = 2 \F*' CANO. NOT SAVED IS UT-P-T-U-0 MATCHES «7 STk CONTFNTS: 12,23,22,1-1,21,0 , 1=3 STK CONTFNTS: 17,73,27,1-1,71, STK CONTFNTS: 17,73,22,1-1 STK CONTFNTS: 1 7,73,77, STK CONTENTS: 1 2 , 2 3 , COPY - UTP IS USFO H-RE OVLAP = NC = 2 CGPY<2,1> = IS USED HERE OVLAP = U-T NC = 2 COPY<7,3> = UT IS USED H^RF OVLAP = NC = 2 NEW CAND. NOT SAVED IS PUT * O-U-T MATCHES #9 STK CONTENTS: 11,22,17,71,23,0 , 1=4 STK CONTENTS: 11,22,17,71,23, STK CONTFNTS: 11,72,12,21, STK CONTENTS: 11,22, COPY<2,3> = UT IS USED HERE OVLAP = P NC = 7 C0PY<7,2> = DTP IS USED HERE OVLAP = UT NC = 1 CQPY<1,2> = OU-T IS USED HER^ OVLAP = NC = 2 CPPY<2,1> = IS USED HERE OVLAP = NC = 2 NFw CAN). NOT SAVED IS UT-P-T-U-0 MATCHFS ¥7 STK CONTENTS: 11,23,22,1-2,21,0 , 1=4 STK CUNTFNTS: 11,73,22,1-2,71, STK CONTENTS: 11,23,22,1-7 STk CONTFNTS: 11,73,77, , , , Figure k. - Sheet 2 of 3 -25- STK CONTENTS: 11,23, _. C0PY<2,1> = IS USED HERE QVLAP = U-T NC = 2 CC?Y<2,2> = UTP IS USED HTRP n vi AP = P NP = 1 CDPY<1,1> = PUT IS USED HERE OVLAP » UT NC = 2 COPY<2,3> = UT IS USED HERE QVLAP * NC . »_.2_ Nt-W CANO. NOT SAVED IS O-U-T-P-UT MATCHES #7 STK CONTENTS: 12,21, 22 , 1 1 ,_ 2_3 .*. _» I = 4 S1K CONTENTS: 12,21,22,11,23, STK CONTENTS; 1,2 -^- 2 1.22,11, STK CONTENTS: 12,21,22* C0PY<2,3> = UT IS USED HERE OVLAP = NC_.= 2 COPY = UTP IS USED HERE OVLAP = NC = 2 NFW CAND. NOT SAVED IS O-U-T * PUT MATCHES #8 STK CONTENTS* 12,21,23,11,22,0 , 1=4 STK CONTENTS: 1 2 , ?1.?3 T 11 T ?7 P STK CONTFNTS: 12,21,23, STK CONTENTS: 12,21, END OF CANDIDATES FOP J = 3 Figure k. - Sheet 3 of 3 -26- References 1. Greenwood, C.T. and Milne, E.A. , Natural High Polymers , Con- temporary Science Paperbacks, Oliver & Boyd, London, 1968. 2. Dayhoff, Margaret 0., Atlas of Protein Sequence and Structure, Vol. U, 1969, The National Biomedical Research Foundation, Silver Spring, Maryland. 3. Dayhoff, Margaret 0. and Ledley, Robert S., Progress Report on Sequences of Amino Acids in Proteins by Computer Aids ,' NBRF Report #08710-681115 Part II, 1968. h. Shapiro, Marvin B. , An Algorithm for Reconstructing Protein and RNA Sequences, Journal of the ACM, Vol. lU, No. U, Oct. 1967 , 720-731. 5. Slagle, James R. and Lee, Richard C.T., Application of Game Tree Searching Techniques to Sequential Pattern Recognition , will appear in the Communications of the ACM in 1971. -27- Appendix 1. SEQ1 Program Listing and Flowcharts PRDCtPAM SPQ1 eSTLIMIT = 100000 NOTSTKP = 1 N0TPRT1 = I NQTPRT2 = 1 tMAXLNGTH = 160 _& 01.1 MP =-1- OUTPUK 'TITLE* ,6,« (132A1) •) ' * INPUT PART - PIT * INPUT COPIES GO TO ITREE ARRAY * INPUT CARD 1 CONTAINS NO. OF RUNS ONLY -.„...„.. ■ , NRU.NS = TRIM! INPUT) * Ft ?iIS * £ET NO. Df CHPT F S AH" «"" ™ - np FRAGMENTS IN A NY COPY * FOR THIS RUN FROM NEXT CARD .,, nmpRB , TSG1 = TRIM! INPUT) ' * JFLDATAERR) RERUN &ANCHOR = I , TSG1 TITLE BREAM. •) . NCOPS SPANC •» . Fl n AT AFRR)_ BiLEAKJ J_ 'i . MAXFRG ;Fin A T A FRR) ITREE = APRAYtNCCPS '♦' MAXFRG) NFRG = ARRAY(NCOPS) I = I FR r PAT - BREAK!' ') • TSG3 SPANl' ■ 1 — -- * RUILO UP FRAGMENTS FROM INPUT AS FRAGMENTS SEP. BY BLANKS TSG1 = TJUMliNPUT) L_J :F(n ft T A FRR) NXTCOPY FRGS = TSG1 J = 1 CONTCP TSG1 = TRIM(INPUT) ' ' * OATA CONTINUATION CARD HAS . IN COL. 1 cfl ,, FRGS ^ERULISGI 4^^ 1 L? FRGS FRGPAT = :PIL3) ITREE = TSG3 J = J + I mlL£1 L3 NFRG = J - 1 .cfwv-rrnPY) I = LT(I,NCOPS) I ♦ 1 .S(NXTCOPY) * NOW ITREE^J^-OlttlAJ-M S FRAGMFNT , ) OF CHPY I t * FOR 1=1 TC NCOPS , J =1 TC NFRG . * ALL INPUT FOR THIS RUN SHOULD BE IN ITREE * SAVE CONTAINS T HF NEXT INPUT LINE , SAVE = TSG1 * NOW PRINT OUT ITREE J J =_J , P1T1 J = l i|D1T , n TSGl = ITREE :iPlTll) PIT2 TSGl = TSGl ' * • ITREE . CIP1T „ PlT11 j = LT(J^NFRGJ J ♦ 1 :S(P1T2) OUTPUT = ' COPY • I • IS ' TSGl . clolTll OUTPUT * * CATA DEFINITION PART - P2T INITIALIZE MAX. NO. OF ALLOWED CANDIDATE COPIES MAxr.AN = Nir.nPS ♦ (MAXF RG ♦ 1) DEFiNE ARRAYS FOR CANDICATE COPIES AND NO. OF CHAINS IN EACH CHAIN = ARRAYtMAXCAN ',' MAXFRG) NCHN = ARRAY(MAXCAN) -28- PART P4T INITIALIZE FOR NEW J. COPY J BECOMES COPY 2 DEFINE ARRAYS FOR CURRENT COPIES BEING COMPARFn wn nc ENTRIES IN EACH, AND CHAINS USED Fn EACH ° F QIlfi ^ = ARRAYf? '.' Maefpp,) USFDCHN = ARRAY(2 »,' MAXFRG) ' — JJ NENT = ARRAY (2) 41 ewiS ""Y^r^m^ CU " RENT CAN °*°" E 8E,NG 8U,LT - * DEFINE STACK FOR SAVING TRFE BACKUP POINTS * 3 S TKMAX = I? * M AXFRG) t 4 STK = ARRAYISTKMAX •»• 21 44 45 INITIALIZATION PART P3T INPUT COPY I BECOMES CANDIDATE COPY 1 LB =_J UB = i ~~ 46 NCHN<1> = J|fRG 47 K = i 48 CHAIN<1,K> = ITREE<1,K> * 9 * = LT(K,NCHN<1>) K ♦ 1 :S(P3T1) 'foT/^,^!^^ J = I"" 51 52 K = i 53 USEDCHN<2,K> =0 5 * C0PY<2,K> = ITREE 55 K = LT(K,NENT<2>) K ♦ 1 :S(P4T11 l^ FIRST I BECOMES FIRST CANDIDATE ^1**111 57 i = lb ; N£ F WCAND WI = LL Q CCUNT ALL NEW CANDIDATES f5r A~FlxI5~J 5 * 59 COPY I BECOMES CQMPARISCN CCPY 1 PART P5T NENXOJt =._NCHNil> K = i - -6-C_ US£DCHN =0 61 COPY = CHAIN 62 K = LT(K,NENT<1» K + 1 :SIP5T1) tJ JZlI A n A L TF E ?n^ N ? IDATES C0NS ^rENT WITH INPUT COPY J AND " 2 AND 1 RESPECTIVELY, WILL BE FOUND. PART P6T CLEAR STACK M = 1 _.. RESET CHAIN COUNTFR FOR CURRENT COPY ^ START WITH FIRST FRAGMENT OF COMP. COPY 1 * 6 — K = i 67 P7T TRY TO BUILD A NEW CHAIN IN THE NEW COPY WCHN u^u^f R]L - k:LLB - f - nMP - mPY ' ' rHrtT " *- m.HN USFDr.HNkT 1 - K\ = i USE0CHN<1,K> = 1 PUSH STACK 68 STK = I 69 -29- * * STK<*.2> ! * T(MfSTKMAX , M Vf ~ :>(STKOVI 7! CLEAR . N FW top O P STAC K POSITION C - ; STK =0 TRY TQ ADD NEXT CHAIN TG COPY 2 CVLAP IS OVERLAP WHICH MUST MATCH WITH NEXT CHAIN OVLAP = CQPY<1 tK> MUS T AnOJlVLAP— IIL.CHRRFNT NFW f.HAIM CHAINS = V OVLAP NOW ENTER GENERAL SEGMENT TO BUILD ONTO THE NEW COPY BY ADDING A CHAIN FROM COPY C WHICH MATCHES WITH OVLAP. * * PART P8T rr c TUP TWnFX Tn T MP CURRENT C. HA I N RF I "<*. TR I FD IN COPY S C R^^TVe^S^hI- REVERSE CHAIN His BEEN TRIED OR DOES * NOT NEED TO BE TRIED 7 MIDCHN IT = 7 MTn IT I LT(IT,NENT) IT ♦ I :F(BACKUP) ** G Q 10 BAOOIP-lE-AU^HAlNS.-iiAV£- BFFN TRIED ? EQ(USE0CHN,0) IfIPBTSI 8 TRYFWD EQ(C,1) :S(P8T3) 8 FINc'If THE REVERSE FRAGMENl CHAIN EQUALS ^ n F ^ ARQ FRAGMENT CHAIN OF COPY. IF IT DOES WE DO NOT * HAVE TOTR-Y_JL HF RFVFRSF . SO WF SFT R = 1 * FIRST REVERSE THE WHOLE CHAIN, PUT IT IN TCHN. * NEED CANCHOR = HERE f &ANCHOR = J T oo G a check'tS^et'oK'fast if it is only I fragment. _ j TSGl BREAiLLL^yJ- ^ 8T3) ~ R = 1 I I4T7 TCHN = ,ri iatoi I {„, ISG, BREAK..-.. . SF-' = ;«!♦{«• TCHN = SF '-* TCHN | I4T? TCHN = TSG1 •-• TCHN , TCHN RJID-S-IJ.J - ""* ~ ■ ' " * NOW SEE IF TCHN = TSGl FRAGMENT WISE. ,1 CUT = < TRVS = TCHN TSGl = C0PY<1,IT> . F (I4T5) I4T 3 TSGl flREAKC-') . SF •-• = .FCI4T51 , TRV S RPPAXI '-') . RF !-J ^ _ — THECK IF SF = RF . BOTH ARE FRAGMENTS. ^^^ I4T4 SF LEN(l) . CH - • S( I ATA) F ( P 8T4) RF CH I AT5 SF = TSGl RF = TRVS QUI ^-J _L4.IAJ_ :F(P8T4) I4T6 IDENT(RF) :F(I4T3> EQ(0UT,1) * REVERSE IS EQUAL* SO SET R = 1 R = 1 IP8TA) PBT3 TCHN = C0PY<1,IT> # P8TA -IE&G = HV I AP • PBTS R = l TCHN = OVLAP TFRG = C0PY<2,IT> -30- PART P9T TRY Tfl MATCH TTHN yrTH TFftr,. PITHF Q f T FAIIS HP IT IS SUCCESSFUL AND 3 CASES ARE INDICATED BY : NC = 1 NC = 2 , OR QVLAP = NULL. ^ANCHOR = Q DIFF CUT = D OUT = 1 109 110 111 112 113 STCHN - TCHN STFRG = TFRG GET THE LEFTMOST FRAG. FROM TCHN. TCHN BREAKI f -M . SF •- • = :S :F(P9T1) " ' 130 TFRG IS LONGER AND MATCH SUCEEDED. OVLAP = TFRG NC = l :(CEX1) MATCH FAILED EQ-LR+.OJ : F(NXTIT) _ 133 R = I : (TRYFWD) 134 MATCH SUCEEDED WE HAVE FOUND A MATCH ON ENTRANCE TO THIS PORTION OF CODE THROUGH ££JLL OR CFX? . WF HAVF CWIAP ANin Mr r Ai fill ATPn. 120 121 122 123 124 125 126 127 128 131 132 135 WE NOW ALTER CHAINS AS APPROPRIATE FOR ONE OF 6 CASES OF SUCCESS AND THEN WE GO TO SC . F Q«C.2I :S(CM5) CHAINS STFRG^ RPOS(O) = l36 137 m. 139 14Q TSG1 = :(CM4) CHAINS COO: = CHAIN <; «/« PViAP ; (SCI _ IDFNT(OVLAP) :F(CM6)~ E0(r -»2) :SCSC) CHAINS STFRG RPOS(O) = [^ 142 143 CHAINS = CHAINS STCHR MSC) EQ .STFRG RPflStn) =_ :(f.Mft> ,u CHAINS STCHN PPOS(O) = STCHN TCHN RPOS(O) STCHN •-• RPOS(O) = 145 146 147 -31- 1< * C y2 TSGl BREAK<»-«> '-' " If, STrHN tsgi R pnsiOl — h , F(CM4) i«. CM3 SF LEN(l) . CH = JLCM31 1,! CHMNS ^ CHAINS STCHN TSGl V OVLAP U SC STK = C :S(N0T1) If N QT1 USEDCHN = 1 13 1T = FQ -IT 1= STK = IT MUST CLEAR OUT TOP OF STACK POSITION C SINCE TH!S is used IN BACKUP. 1 STK = ^ L C ;^ C -.F(MIDCHN) l P^TURnTo TRY FOR A NEW ADDITION TX COPY - C FLSE A CHaIn IS COMPLETE SINCE OVLAP IS EMPTY * SEE IF WHOLE NEW COPY IS COMPLETE P9T8 K = l .ctPlflTl PQT9 E_01US£-DCHN.. K ♦ 1 =S(P9T 9 . : \?£e%fl™M_K^™V»^ NEWCAND = NEWCAND ♦ 1 IT = UB ♦ NEWCANO * FRROR EXIT TO CHNOV IF CHAIN STORAGE- XS_ USED UP- ffife^"2££&3"&^ * BFLCW CHAIN CAN BE USED NOW. LCHNOV ) GT(IT,MAXCAN) STOReT.E CHAINs'oF THE NEW COPY AT CHAIN ANO PUT THEN IN TSG! FOR LATER PRINTOUT. TSGl J K = 1 &ANCHOR = PICT? CHAIN = CHAINb :SIP10T3) PICT3 CHAIN •/• = :S(P10T<»> ]i P10T4 CHAIN CHAINCLUJa PnSI Ten - TSGl ' * ' CHAIN K = LT(K,CK1 K ♦ 1 ,M ^ * * * -32- * IF HE HAVE A-B-C HE DELETE A-B REVERSE-C REVERSE. * BUT WE DO NOT DETECT C-B-A AS DELETABLE. j* __ __ * DON'T CHECK IT AGAINST ITSELF EQINEWCAND, 1! :S(P10T15) 182 TV = UB ♦ 1 183 PIOT10 EQINCHN,CK) . ;F(P10T14) 184 K = 1 185 P10I11 XSG-3 = CHAIN 186 TSG5 = CHAIN 187 EQISIZE1TSG3) ,SIZE(TSG51J :F(P10T14) 188 IDENT(TSG3,TSG5) :S(P10T13) 189 * CHECK THE REVERSE CHAIN. OUT = 190 P1QT16 — TS G3 BRFAKf'-M . SF 1^1 = ;S(P1QT1U _.. 191 SF = TSG3 192 OUT =1 193 P10T17 TSG5 LEN(l) . CH RPOS(O) = 194 SF CH = :F(P10T14) 195 IDENT(SF) :F(P10T17) 196 EQLQUI*L1 ;S(P1QT13) .. _ 197 TSG5 ■-■ RPOS(O) = :S(P10T16)F(P10T14) 198 P10T13 K = LTIK.CK) K ♦ 1 ._ ;S(P10T111 199 * DELETE THE NEW CANDICATE OUTPUT = • NEW CAND. NOT SAVED IS ■ TSG1 • MATCHES #• TV 200 NEWCAND = NEWCAND - 1 MSTKP1) 201 P10T14 IV - = t-T(TV f HR ♦ NFyCAND - 1 > TV ♦ 1 LS1E10T.1QJ 202 P10T15 OUTPUT = TSGi • IS NEW CANDIDATE COPY IN CHAIN<« 203 IT • , >• 203 STKP1 FQINOTSTKP, 1) :S(BACKUP) 204 TV = 1 2 05 TSG5 = 206 SLBL1 TSG5 = ISG5 .STK ' t ' _. . 2Q7 TV = LT(TV,M) TV ♦ 1 :S(SLBL1) 208 OUTPUT = • STK CONTENTS: • TSG5 » I = • I 209 * NOW BACKUP TO GET OTHER ALTERNATIVES * * BACKUP PART: POP INDICATOR OF OF LAST ADDED CHAIN TO * TCP OF STACK BACKUP M = M - 1 C = STK IT = STK 210 211 212 LT( IT, 0) :F(P11T3) 213 R = 1 214 P11T3 |T = -IT : (P11TA1J 215 R =0 216 PilTAl USEDCHN = 217 * DFLETE LAST /OVLAP FROM CHAINS. ANCHOR = TSGI = CHA1MS 218 IDENT(OVLAP) :S(P11TA2) 219 f.ANfMriR = n 220 TSGi '/• OVLAP RPOS(O) = :F(ERRP) 221 P11TA2 IDENTI TSGI) :S(P11T4) 222 TSGI = EQ(STK,C) TSGI OVLAP 223 EC(NCTPRT2,1) ;SINQT2) 224 TV = 1 225 i . — TSG5 = 226 SLRL TSG5 = TSG5 STK • • STK • , • 227 TV = LT(TV,M) TV ♦ 1 :S(SLBL) 228 OUTPUT = • STK CONTENTS: • TSG5 229 -33- N C -.,*.*. ,c pmpty TRANSFER- TO BACKUP TO EARLLER CHAI N J CTHERwIsVgET "^^^CHAIN FOR PREVIOUS MATCH ^ ml 2 CHAlli££CKi h I-SGJ — PUTA &ANCHOR = I _ :SIP11T21 P11T2 TSGl BREAKC/M •/' - OVLAP = TSGl :SINXTIT)FITRYFWD) nniSu? 1 - • ERROR IN MATCH AT PUT1 ■ MENOI put, i? = 1 L IU T Af NT) IT * if 5mi EQ(USEDCHN<1.IT>,01 MNEWCHN) Otherwise back up to an earlier chain if"^. P11T5 CK = MEICKtll CK ..RAfXUP> OVLAP— - -5- " ' •• saaa &:sh™=^— : st sjseSo ■•«« »«'"»»■ '"""'is,,..., FQ(NEWCAND,0) lb = ..ua.±_i ,, R = UB + NEWCAND OUTPUT = ' END OF CANDIDATES FOR J - J Increment /if possible, else rerun if any more runs j = LT(J,NCOPS) J ♦ 1 * M ■JSSis" s "= cnSRUNsTins-iisi "^7",'" ~7?<«r ,F,TNln FRRH "output - • ERROR: NO NU CANDIDATES • «0 STK nv OUTPUT . ■ ERROR: S K t t SUSS"" ^OUTPUT"- • S'exJfEDEO AT • N«C»N _"«««>» ENC L p R CRS CETECTED DURING COKPILATICN -3H- GET MRUNS INITIALIZE NC0PS, MAXFRG, ITRKE, NFRG, FOR NEW INPUT. &ANCHOR = 1 READ INPUT COPIES INTO ITREE. PRINT ITREE. DEFINE DATA ARRAYS AND STACK. SET CANDIDATE COPY 1 = INPUT COPY LB = UB = 1 NCHN = NFRG<1> CHAIN<1,K> = ITREE<1,K> FOR K = 1,2, ...,NCHN<1> INITIALIZE FOR NEW J; NENT<2> = NFRG FOR K = 1,2, ...,NENT<2> USEDCHN<2,K> = C0PY<2,K> = ITREE F2 INITIALIZE FOR NEW I: NENT<1> = NCHN FOR K = 1,2, ..., NENT<1> USEDCHN<1,K> = COPY = CHAIN INITIALIZE STACK AND CHAINS M = 1 CK = 1 K = 1 I_* USEDCHN<1 ,K> = 1 PUSH STACK CLEAR T0P OF STACK C = 2 0VLAP = C0PY<1,K> CHAINS = '/' 0VLAP I MIDCHU V- ( TRYFWD J IT = R = NXTIT YES IT = IT + 1 YES YES YES TCHN = COPY R = 1 TCHN = 0VLAP TFRG = COPY<2,IT> &ANCHOR = TCHN = REVERSE ( C0PY<1 , IT> ) TFRG = 0VLAP YES R = 1 (PAGE 2 ] SEQ1 Flowchart - Sheet 1 of 5 -35- PAGE 2 fcANCHOR - DIFF - 0UT ■ DEISTS LEFTMOST FRAGMENT FROM TCHN. PUT IT IN SF. TCHN EMPTY YES NO 0UT - 1 YES YES •MATCH SUCCEEDS TFRG IS LONGER. gfVLAP = TFRG NC = 1 YES DELETE STFRG ADD STCHN '/' VLAP TO CHAINS •CASE 6 YES .•MATCH FAILS N>J£S_ R » "Tno ( NXTIT j ADD '/' tfVLAP TO CHAINS •CASE 3 REPLACE STFRG WITH STCHN IN CHAINS •CASE 5 R - 1 ( TRYFTO ) •MATCH SUCCEEDS TCHN IS LONGER OR - SF - DIFF SF 0VLAP « SF *-' TCHN NC - 2 YES YES CH - RIGHTMOST CHARACTER IN SF DELETE IT. DIFF * DIFF CH YES DELETE STCH* FROM CHAINS •CASE 1 DELETE STFRG FROM CHAINS •CASE U PUT ALL COMPLETE MATCHING FRAGMENTS FROM LEFT END INTO STCHN ADD STCHN TSG1 '/' 0VLAP TO CHAINS\ YES s^ REVERSE OF CHAIN \ NO DELETE CANDIDATE COPY NEWCAND = NEWCAND -1 PRINT OUT DELETED CANDIDATE PRINT OUT NEW CANDIDATE SEQ1 Flowchart - Sheet 3 of 5 -37- •POP STACK M - M-l C - STK IT - STK R = 1 IT = -IT YES J.ANCHOR = DELETE •/' 0VLAP FROM END OF CHAINS ( NXTIT ] YES CHAINS ■ CHAINS 0VLAP YES ( newchnV K - IT PRINT OUT STK IF N0TPRT2 4 1 L fcANCHOR * 1 0VLAP - STRING IN CHAIN AFTER '/' •GET PREVIOUS CHAIN CK - CK - 1 0VLAP - FOR K = LB, LB+1, .... UB IT = 1,2, .... NCHN. OUTPUT = 'END OF CANDIDATES FOR J = 'J •GET NEXT INPUT COPY. J = J+l YES •NEW DATA RUN NRUNS = NRUNS-1 yJUTPUT = 'ERR0R: STK IS FULL' ( RERUN \ OUTPUT = •MAX CAN EXCEEDED AT MAXCAN ( DATAERR L OUTPUT = 'ERR0R IN INPUT DATA' ( END J SEQ1 Flowchart - Sheet 5 of 5 -39- Form AEC -427 U.S. ATOMIC ENERGY COMMISSION « 6/68) UNIVERSITY-TYPE CONTRACTOR'S RECOMMENDATION FOR DISPOSITION OF SCIENTIFX AND TECHNICAL DOCUMENT I Sm Instruction* on Rtvrw Sid* ) 1 AEC REPORT NO. 1+16 0-1018-1219 3. TYPE OF DOCUMENT (Check onil: [3 a. Scientific and technical report 12 b. Conference paper not to be published in a journal: Title of conference Date of conference 2 title SE q UENCE DETERMINATION FROM FRAGMENT DATA Exact location of conference Sponsoring organization □ c. Other (Specify) 4 RECOMMENDED ANNOUNCEMENT AND DISTRIBUTION (Check one): Q a AEC's normal announcement and distribution procedures may be followed. "2 b. Make available only within AEC and to AEC contractors and other U.S. Government agencies and their contractors. j c. Make no announcement or distrubution. 5. REASON FOR RECOMMENDED RESTRICTIONS: 6. SUBMITTED BY NAME AND POSITION (Please print or type) John C. Schwebel Research Assistant Organization Digital Computer Laboratory University of Illinois Urbana, Illinois 6l801 Signature ^:, c.^i^L- Date November 16, 1970 FOR AEC USE ONLY AEC CONTRACT ADMINISTRATOR'S COMMENTS. IF ANY. ON ABOVE ANNOUNCEMENT AND DISTRIBUTION RECOMMENDATION: 8. PATENT CLEARANCE: LJ a. AEC patent clearance has been granted by responsible AEC patent group. LJ b. Report has been sent to responsible AEC patent group for clearance. LJ c. Patent clearance not required. MAY * * 71 W| 4