UNIVERSITY OF 
 
 ILLINOIS LIBRARY 
 
 AT URBANA-CHAMPAIGN 
 
5/u rr 
 yu), 6$f 
 
 Report No. UIUCDCS-R-7U-659 
 
 pu^ct, 
 
 1 
 
 DATA COMPRESSION FOR CHARACTER STRINGS 
 
 "by 
 
 Alfred C . Weaver 
 
 July, 197^ 
 
 AM 27«h< 
 
 DEPARTMENT OF COMPUTER SCIENCE 
 UNIVERSITY OF ILLINOIS AT URBANA-CHAMPAIGN 
 
 URBANA, ILLINOIS 
 
Report No. UIUCDCS-R-74-659 
 
 DATA COMJRESSION FOR CHARACTER STRINGS 
 
 by 
 
 Alfred C . Weaver 
 
 July, 197^ 
 
 Department of Computer Science 
 versity of Illinois at Urbana-Champaign 
 Urbana, Illinois 61801 
 
TABLE OF CONTENTS 
 
 Page 
 
 DATA COMPRESSION FOR CHARACTER STRINGS 1 
 
 1.1. The Necessity of Compression 1 
 
 1.2. A Software Solution . . . . 2 
 
 DATA COMPRESSION 3 
 
 2.1. The Transmitted Message 3 
 
 2.2. A Solution Utilizing Duplicate Character 
 Compression 3 
 
 2.2.1. The Encoding Algorithm 3 
 
 2.2.2. The Decoding Algorithm 10 
 
 2.3- A Solution Utilizing Common Phrase Detection 
 
 and Replacement 11 
 
 2.3-1. Common Phrase Detection 11 
 
 2.3.2. A Conjecture 27 
 
 2.3.3. An Example 28 
 
 2.3.^. Common Phrase Replacement kO 
 
 SUMMARY k3 
 
 -in- 
 
Digitized by the Internet Archive 
 in 2013 
 
 http://archive.org/details/datacompressionf659weav 
 
1. DATA COMPRESSION FOR CHARACTER STRINGS 
 
 1.1. The Necessity of Compression 
 
 In the field of data communications, one recurring problem is how 
 best to transmit the most information in the least time and at the least 
 cost. Within the scope of a centralized computer system, the problem is 
 often solved by "hardwiring" the devices which wish to communicate. As 
 the number and speed of the peripherals increase (by design changes or 
 equipment upgrading), previously adequate data paths begin to show signs of 
 saturation and new techniques must be employed. Three popular techniques 
 are: (l) increasing the bandwidth of the data path to achieve a greater 
 parallelism in the architecture; (2) switching the mode of operation of data 
 path controllers from bit- serial to bit-parallel, with an appropriate increase 
 in hardware; and (3) multiplexing a single high-speed line among many slower 
 speed devices. While these techniques work adequately well, they are some- 
 what dependent upon a controlled environment in which device speeds are 
 well-matched, data paths are relatively short, error rates are small, and 
 the equipment involved is well-understood. But as one leaves the central 
 system and begins to explore the world of remote computing and remote 
 conversational access in general, a new series of problems and woes arises. 
 Primary among these is the basic inadequacy of the common or "voice-grade" 
 telephone line for carrying information at high speed with acceptably low 
 error rates . 
 
 The rate at which a state-of-the-art minicomputer can produce 
 information now far outstrips the capabilities of the voice-grade line. Yet 
 the alternative, a "conditioned line, " which is capable of high speed 
 
 -1- 
 
operation, is not only expensive to install and costly to use, but denies 
 the basic advantage of telephone circuitry as a data path: the ability to 
 communicate from one point to another point, however unlikely, provided 
 each end has access to a common household telephone. 
 
 The problem gets completely out of hand when one considers more 
 ambitious projects such as remote graphics. A machine with the size, speed, 
 and capacity of an IBM 360/75, talking to, say, a Calcomp plotter, generates 
 literally millions of instructions to control the plotting of even simple 
 graphs. Drawing a 10" straight line at a U5 angle requires some 2000 
 commands, all of which must be transmitted over some data link. The expected 
 result is that remote graphics is unbearably slow, and when in use tends to 
 completely saturate the data path, thus effectively blocking use of the line 
 for any other device . 
 
 1.2. A Software Solution 
 
 It is this impasse which prompted research on the subject in question: 
 if line speed is constrained to some constant value, how can the density 
 of information transmitted best be increased? The answer appears to involve 
 manipulation of the transmitted data itself, and in some cases the addition 
 of pre-processors and post-processors to encode/decode the transmitted infor- 
 mation. While this solution also requires additional hardware, the "intelligence" 
 required is small and well within the range of minicomputers and even micro- 
 computers (processor-on-a-chip technology). Such hardware is not expensive: 
 one popular 8-bit microprocessor sells for $60 in quantity. In fact, the 
 utilization of microcomputers can be shown to represent good economy when 
 compared with the "conditioned Line" alternative. Thus, the hardware 
 capabilities for data compression/decompression exist, and what remains is to 
 determine what software methodology should be applied. 
 
 _p_ 
 
2. DATA COMPRESSION 
 
 2.1. The Transmitted Message 
 
 The purpose of this project was to determine, for a specific 
 class of "messages" which exist initially as print files on disk, how that 
 file should "be pre-processed locally, transmitted across telephone circuitry, 
 and post-processed at a remote site so as to minimize the number of char- 
 acters actually transmitted (and, hence, telephone charges). The boundary 
 conditions of the problem were: 
 
 1. All information to be transmitted was basically of 
 the same type: relay ladders for process control 
 logic (i.e., a graph). The following pages show 
 the sequence of equations, object code, and graph 
 which are to be transmitted. Only the graph is of 
 sufficient length to justify condensation. 
 
 2. The graphs were stored as line images, 1 to 80 
 characters/line, and varied in length from 30 to 
 1500 lines. 
 
 3- There was similarity, and indeed duplication, of the 
 basic "building blocks" within each line and further 
 repetition of these building blocks among lines of 
 the graph. 
 
 It seemed reasonable to assume that a detection of common phrases 
 
 of the graph, and their reduction from the physical character string itself 
 
 to either the character and a repetition factor or to a pointer into a 
 
 "dictionary" of common phrases, was a reasonable approach. The anticipated 
 
 result was a significant decrease in the total length of transmitted text. 
 
 2 .2 . A Solution Utilizing Duplicate Character Compression 
 2.2.1. The Encoding Algorithm 
 
 Tf a string of two more more contiguous characters in the message 
 are identical, replace the string with a two-character phrase where the first 
 
 ■3- 
 
******** ***************** EOU AT I ON S** ************************** 
 
 CR1 = (PBi+CRl)*PB2 
 
 MX1 = CR1 
 
 CR3 = MX1 £ PB4-NC*TR5-XX0 
 
 CR4 = MX1 £ (PB4-NC+TR5-00X)*CR5 
 
 CR5 = MX1 £ <CR3*/CR4+CR5)*(CR3+/CR4)*/CR7 
 
 CR6 = MX1 £ /CR5+CR6*CR7 
 
 TR1 = MX1 £ CP6 
 
 CR7 = MX1 £ (LS2+CR7*CR8)*CR6 
 
 CR8 = MX1 £ CR6*TR1-00X*LS3*TR4-XXG 
 
 CRIO = MX1 £ LS6*/CR13*/CR12*/CR5 
 
 CR11 = MX1 £ (PS1+CR11) */CR12*CR10*/CR5 
 
 SOLA = MX1 £ /PS1*/CR11*LS1 
 
 CR12 = MX1 £ (PSl+CRll) *FAULT+CR12*PB5 
 
 TR3 = MX1 £ WC 
 
 WC = MX1 £ CR11*LS2 
 
 CR13 = MX1 £ (CR13+TR3-OOX)*/CR5 
 
 TR4 = MX1 S CP13 
 
 CR2 = MX1 £ (CR2*/CR7+TR4-00X*CR5)*SS1 
 
 TR5 = MX1 £ CR2 
 
 ************************** PROGRAM*** ****** ******************** 
 
 1 
 
 LDA 
 
 
 
 
 2 
 
 AUX 
 
 776 
 
 
 3 
 
 AUX 
 
 775 
 
 
 4 
 
 AUX 
 
 774 
 
 
 K 
 j 
 
 STO 
 
 
 
 
 6 
 
 LDA 
 
 1 
 
 PB1 
 
 7 
 
 OR 
 
 177 
 
 CR1 
 
 10 
 
 AND 
 
 2 
 
 PB2 
 
 11 
 
 STO 
 
 177 
 
 CR1 
 
 12 
 
 STO 
 
 162 
 
 MX1 
 
 13 
 
 LDA 
 
 
 
 
 14 
 
 AUX 
 
 776 
 
 
 15 
 
 LDA 
 
 162 
 
 MX1 
 
 16 
 
 AUX 
 
 776 
 
 
 17 
 
 LDA 
 
 
 PB4-NC 
 
 20 
 
 ANDC 
 
 124 
 
 TR5-XX0 
 
 21 
 
 STO 
 
 166 
 
 CR3 
 
 22 
 
 LDA 
 
 6 
 
 PB4-N0 
 
 23 
 
 OP 
 
 12^ 
 
 TR5-00X 
 
 24 
 
 AND 
 
 176 
 
 CR5 
 
 25 
 
 STO 
 
 165 
 
 CP4 
 
 26 
 
 LDA 
 
 166 
 
 CR3 
 
 27 
 
 ANDC 
 
 165 
 
 CR4 
 
 30 
 
 OP 
 
 176 
 
 CR5 
 
 31 
 
 STO 
 
 160 
 
 TEMP 16 
 
 32 
 
 LDA 
 
 166 
 
 CR3 
 
 33 
 
 OF.C 
 
 165 
 
 CR4 
 
 34 
 
 AND 
 
 160 
 
 TEMP 16 
 
 35 
 
 ANOC 
 
 163 
 
 CR7 
 
 36 
 
 STO 
 
 176 
 
 CR5 
 
 37 
 
 LDA 
 
 173 
 
 CR6 
 
 40 
 
 AND 
 
 163 
 
 CR7 
 
 -h- 
 
4i 
 
 OPC 
 
 176 
 
 CR5 
 
 42 
 
 STO 
 
 173 
 
 CR6 
 
 43 
 
 STO 
 
 120 
 
 TR1 
 
 44 
 
 LDA 
 
 163 
 
 CP7 
 
 45 
 
 AND 
 
 172 
 
 CR8 
 
 46 
 
 OP 
 
 7 
 
 LS2 
 
 47 
 
 AND 
 
 173 
 
 CR6 
 
 50 
 
 STC 
 
 163 
 
 CP7 
 
 51 
 
 LDA 
 
 173 
 
 CR6 
 
 52 
 
 AND 
 
 120 
 
 TR1-00X 
 
 53 
 
 AND 
 
 10 
 
 LS3 
 
 54 
 
 ANDC 
 
 123 
 
 TR4-XX0 
 
 55 
 
 STO 
 
 172 
 
 CR8 
 
 56 
 
 LDA 
 
 13 
 
 LS6 
 
 57 
 
 ANDC 
 
 164 
 
 CR13 
 
 60 
 
 ANDC 
 
 175 
 
 CR12 
 
 61 
 
 ANDC 
 
 176 
 
 CR5 
 
 62 
 
 STO 
 
 171 
 
 CR10 
 
 63 
 
 LDA 
 
 4 
 
 PS1 
 
 6^ 
 
 OP 
 
 170 
 
 CR11 
 
 65 
 
 ANDC 
 
 175 
 
 CR12 
 
 66 
 
 AND 
 
 171 
 
 CR10 
 
 67 
 
 JNDC 
 
 176 
 
 CP5 
 
 70 
 
 STO 
 
 170 
 
 CR11 
 
 71 
 
 LDAC 
 
 4 
 
 PS1 
 
 72 
 
 ANDC 
 
 170 
 
 CR11 
 
 73 
 
 AND 
 
 3 
 
 LSI 
 
 7& 
 
 STO 
 
 17^ 
 
 SOLA 
 
 75 
 
 LDA 
 
 ^• 
 
 PS1 
 
 76 
 
 OP 
 
 170 
 
 CR11 
 
 77 
 
 AND 
 
 17 
 
 FAULT 
 
 100 
 
 STO 
 
 160 
 
 TEMP 160 
 
 101 
 
 LDA 
 
 175 
 
 CR12 
 
 102 
 
 AND 
 
 14 
 
 PB5 
 
 103 
 
 OP 
 
 160 
 
 TEMP 160 
 
 104 
 
 STO 
 
 175 
 
 CR12 
 
 105 
 
 LDA 
 
 167 
 
 WC 
 
 106 
 
 STO 
 
 122 
 
 TR3 
 
 107 
 
 LDA 
 
 170 
 
 CRll 
 
 110 
 
 AND 
 
 7 
 
 LS2 
 
 111 
 
 STO 
 
 167 
 
 wc 
 
 112 
 
 LDA 
 
 164 
 
 CR13 
 
 113 
 
 OR 
 
 122 
 
 TR3-00X 
 
 114 
 
 ANDC 
 
 176 
 
 CR5 
 
 115 
 
 STO 
 
 164 
 
 CP13 
 
 116 
 
 STC 
 
 123 
 
 TR4 
 
 117 
 
 LDA 
 
 161 
 
 CR2 
 
 120 
 
 ANDC 
 
 163 
 
 CR7 
 
 121 
 
 STO 
 
 160 
 
 TEMP 160 
 
 122 
 
 LDA 
 
 123 
 
 TR4-00X 
 
 123 
 
 AND 
 
 176 
 
 CR5 
 
 12<r 
 
 OR 
 
 160 
 
 TEMP 160 
 
 125 
 
 AND 
 
 16 
 
 SSI 
 
 126 
 
 STC 
 
 161 
 
 CR2 
 
 127 
 
 STO 
 
 124 
 
 TR5 
 
 *» op****"********* RELAY DIAGRAM** ***************** ********** 
 
PB1 
 0— ( )~ + 
 
 : 001 : 
 
 CP1 
 
 PB2 
 
 +-- ( >--+--( )— + 
 177 002 
 
 CR1 
 
 
 (0011)- 
 
 ' + 
 
 177 
 
 
 MX1 
 
 
 (0012)- 
 
 ■ + 
 
 162 
 
 
 MX1 
 
 — ( ) — + 
 
 162 
 
 PB^-NC TR5-XX0 
 
 +— ( )—.-.{ / ) 
 
 005 124 
 
 PB^-NO 
 + — ( )-- + 
 
 006 : 
 
 TR5-CCX : CR5 
 +--( ) — *~( ) 
 
 124 176 
 
 CR3 
 *— ( ) 
 
 166 
 
 CR5 
 +--( ) 
 
 176 
 
 CR3 
 +~ ( )-- + 
 
 166 
 
 CR4 
 ( / )-- + 
 
 165 
 
 CR4 : TEMP 
 +~ ( / )--+—< )* 
 
 165 160 
 
 CR7 
 
 ( / ) 
 
 163 
 
 CR6 
 ♦~( ) 
 
 173 
 
 CR5 
 ♦--( / ) 
 
 176 
 
 CR7 
 ( )-- + 
 
 163 
 
 CR3 
 (0021)-+ 
 166 
 
 CR4 
 (0025)-+ 
 165 
 
 TEMP 
 (0031)-+ 
 160 
 
 CR5 
 
 (0036)-+ 
 176 
 
 CR6 
 (0042)-+ 
 173 
 
 TR1 
 (0043)-+ 
 120 
 
 -6- 
 
: CP7 CR8 
 
 +— ( ) < >--♦ 
 
 : 163 172 : 
 
 : LS2 : CR6 
 
 +_._( , __ +--< ) 
 
 : 007 173 
 
 CR7 
 (0050)-+ 
 163 
 
 +-.■ 
 
 CR6 
 ( )- 
 
 173 
 
 TR1-00X 
 
 < ) — 
 
 120 
 
 LS3 
 < ) 
 
 010 
 
 TR4-XX0 
 
 -( / )— 
 
 123 
 
 CR8 
 (0055)-+ 
 172 
 
 LS6 CR13 CR12 CR5 
 
 ( ) ( / ) ( / ) ( / )-. 
 
 013 16* 175 176 
 
 CR10 
 (0062)-+ 
 171 
 
 +- 
 
 ♦- 
 
 PS1 
 
 -( )- 
 
 004 
 
 CPU 
 
 -< )- 
 
 170 
 
 PS1 
 •-( / ) 
 
 004 
 
 - + 
 
 -+- 
 
 CR12 
 -( / )- 
 
 175 
 
 CRil 
 -< / )• 
 170 
 
 CR10 
 ( )• 
 
 171 
 
 LSI 
 ( )■ 
 
 003 
 
 CR5 
 
 ( / ) 
 
 176 
 
 CR11 
 
 
 (0070)- 
 
 • ± 
 
 170 
 
 
 SOLA 
 
 
 (0074)- 
 
 ■ + 
 
 174 
 
 
 4— 
 
 PSi 
 
 -( )- 
 
 00* 
 
 CPU 
 
 -( )- 
 
 170 
 
 - + 
 
 FAULT 
 ♦ — ( )• 
 
 017 
 
 TEMP 
 (0100)-+ 
 160 
 
 +- 
 
 •»-- 
 
 C&12 
 -( )- 
 
 175 
 
 TEMP 
 ~( )• 
 
 160 
 
 PB5 
 ( ) — + 
 
 01* 
 
 CP12 
 (0104)-+ 
 175 
 
 ♦— 
 
 WC 
 ) 
 167 
 
 TR3 
 (0106)-+ 
 122 
 
 ♦ - 
 
 CR11 
 -( ) 
 
 170 
 
 LS2 
 ( )■ 
 
 007 
 
 WC 
 (0111)-+ 
 167 
 
 ♦ — 
 
 ♦- 
 
 CR13 
 ( )- 
 
 164 
 
 TP3-00X 
 -( )~ 
 
 122 
 
 -4- 
 
 CR C 
 + — i / ) 
 llf- 
 
 CR13 
 (0115)-+ 
 164 
 
 TR* 
 
CP2 
 
 «-— ( ) 
 
 161 
 
 TRA-OOX 
 +--< )~ 
 
 123 
 
 TEMP 
 ♦--< )• 
 
 160 
 
 CR7 
 
 ( / ) 
 
 163 
 
 CP5 
 ( )~ + 
 
 176 
 
 SSI 
 
 { ) 
 
 016 
 
 (0116)-+ 
 123 
 
 TEMP 
 (0121)-+ 
 160 
 
 CR2 
 
 
 (0126)- 
 
 • + 
 
 161 
 
 
 TR5 
 
 
 (0127)- 
 
 • + 
 
 124 
 
 
 XPR000I*******STEP TIME IS 
 
 1.04 SECS 
 
 -8- 
 
character is both a phrase marker and a count of the number of duplicate 
 characters and the second character of the pair is the normal bit-code of 
 the repeated character . 
 
 Thus the string: ABBBCBBBBDEEF length=13 (*) 
 
 reduces to: A(3)BC(U)BD(2)EF length-10 
 
 while the string: XYYYZXYYYZ length=10 (**) 
 
 reduces to: X(3)YZX(3)YZ length=8 
 
 Note that in string (*) the replacement of 'EE ' with '(2)E* 
 represents neither a saving nor a loss with regard to the actual number 
 of characters transmitted; the larger common phrase 'XYYYZ ' is not detected 
 in string (**) because of the contiguous duplicate character requirement. 
 
 Nevertheless, the method shows promise and features simplicity 
 of the encoding/ decoding mechanism as illustrated in the following Knuth-style 
 description of the algorithm. 
 
 Step 1 (initialization) 
 
 F «- first character of character string MSG 
 S «- second character of MSG 
 i «-l 
 
 L «- total length of MSG 
 
 Step 2 (longest contiguous string of Fs) 
 while (F=S & j < L) do 
 
 I 
 
 d *- J + i 
 
 •th 
 
 character of MSG 
 
 Step 3 (output) 
 
 if (j-i)=l then output (F) 
 else do 
 
 I 
 
 output ( j - i ) 
 output (F) 
 
 -9- 
 
Step h (halt?) 
 
 if j > L then halt 
 
 Step 5 (scan remaining characters) 
 
 J «- J + 1 
 
 F «- S 
 
 th 
 S <- j character of MSG 
 
 go to step 2 
 
 2 .2 .2 . The Decoding Algorithm 
 
 The decoding algorithm is equally simple 
 
 Step 1 (initialization) 
 
 F «- first character of encoded string MSG 
 S <- second character of MSG 
 i <-2 
 
 Step 2 (marker?) 
 
 if F is a marker then do 
 
 else do 
 
 output (F) 
 
 i <- i + 1 
 
 if i > L then do C 
 
 output (S) 
 ] halt 
 
 F «- S 
 
 S <- i character of MSG 
 
 go to step 2 
 
 output F copies of S 
 i «- i + 1 
 
 if i > L then halt 
 
 th 
 F «- i character of MSG 
 
 i <- i + 1 
 
 if i > L then do 
 
 J output (F) 
 | halt 
 
 5 <r- ±~" character of MSG 
 
 o to step 2 
 
 th 
 
 -10- 
 
Several examples showing the operation of the two algorithms, 
 and calculating the number and percentage of characters saved, are shown 
 next. The first program is a PL/l version of the encoding algorithm, 
 followed by five sample runs using data picked at random from sample graphs, 
 followed by the program for the decoding algorithms, followed by a sample 
 run showing the text expansion back into the original string. 
 
 It is interesting, and perhaps surprising, to note that this 
 rather simple-minded procedure produced transmission savings of 67.6^, 
 68.6^, 56.7%, l^-Jfo, and 71. 7$, respectively, for the five examples chosen 
 from this general class of messages. The contiguous duplicate character 
 compression algorithm shows promise based on its reasonable efficiency 
 and simplicity. 
 
 2.3- A Solution Utilizing Common Phrase Detection and Replacement 
 2.3.I. Common Phrase Detection 
 
 As noted in a previous example, the requirement of contiguous 
 duplicate characters prevents the recognition of duplicated blocks of 
 non-identical characters. For the class of graphs studied, it is obvious 
 that the strings '--( )--', '--(/)--', ' ', and others are used 
 repeatedly to build the individual lines of the graph, and not much compaction 
 is permitted by the previous algorithm when applied to these strings. This 
 immediately introduces a new question: for some string S, does there exist 
 a set P of common phrases of S whose repeated use in S could be replaced 
 with pointers to a dictionary of phrases P? Clearly, the answer is yes and 
 such an algorithm could be implemented ad hoc, given such a set of phrases P. 
 But a more interesting question is: for some string S, does there exist a 
 set P of common phrases of S whose replacement in S by pointers to P yields 
 a minimal length new string S'? 
 
 -11- 
 
/* TEXT COMPRESSION BY DUPLICATE CHARACTER COMPACTION */ 
 
 /* ENCCDING ALGORITHM */ 
 
 /* COMPUTER SCIENCE 389 PROJECT */ 
 
 /* ALFRED C . WEAVER */ 
 
 /* AN ALGORITHM TO REDUCE THE LENGTH OF A TEXT FILE BY COMPACTING */ 
 /* SUCCESSIVE DUPLICATE CHARACTERS INTO A SIGNAL BYTE, FOLLOWED */ 
 /* BY A COUNT BYTE, FOLLOWED BY THE CHARACTER ITSELF */ 
 
 COMP: PROC OPTIONS(MAIN); 
 
 /* «MSG' IS THE INPUT TEXT STRING TO BE REDUCED */ 
 
 DCL MSG CHAR(82) VAR, ( FIRST, SECOND ) CHAR(l), CH CHAP(3); 
 
 /* WHEN INPUT IS EXHAUSTED, PRINT THE STATISTICS FOR */ 
 /* THIS PARTICULAR MESSAGE */ 
 ON ENDFILE{ SYSIN) BEGIN; 
 
 PUT SKIP(2) EDIT (•STATISTICS:*, •ORIGINAL MESSAGE LENCTH:', 
 
 LC, » REDUCED MESSAGE LENGTH:', LR , • SA VI NG: • , LO-LP , 
 
 • (=•, (LO-LRHIOO/LO, • %)* ) 
 
 (A, 3 (SKIP, A, F(10)), A, F(5,l), A); 
 
 PUT PAGE; 
 
 STOP; 
 
 END; 
 
 /* 'LG« IS THE ORIGINAL LENGTH OF THE MESSAGE */ 
 /* 'LP' IS THE LENGTH OF THE REDUCED MESSAGE */ 
 
 LO,LR=0; 
 
 DO WHILE( 'I'D) ; / 
 
 /* READ, PRINT, AND COUNT THE INPUT MESSAGE */ 
 GET LIST (MSG); 
 
 PUT SKIP(2) EDIT ('ORIGINAL MESSAGE:', •••', MSG, "••) 
 (A, C0L(30), 3 A) ; 
 LC=LC + LENGTH(MSG» ; 
 PUT SKIP EDIT ('REDUCED MESSAGE:', •••») (A, C0L(30), A); 
 
 /* 'FIRST' IS THE FIRST CHARACTER OF THE INPUT STRING */ 
 /* 'SECOND' IS THE SECOND CHARACTER OF THE INPUT STRING */ 
 MSG=MSG||' '; 
 
 1=1; FIRST=SUBSTR(MSG,1,1); 
 
 J=2; SEC0ND=SUBSTP(MSG,2,1); 
 
 /* REPEAT UNTIL THE ENTIRE PHRASE HAS BEEN EXAMINED */ 
 DC WHILE (SUBSTP(MSG, I) -= "); 
 
 -12- 
 
/* FINC THE LONGEST STRING OF CHARACTER 'FIRST* */ 
 OC WHILE (FIRST=SECOND £ J<=80)5 
 
 j=J+l; 
 
 SECCND=SUBSTR(MSGi Jtl); 
 END; 
 
 /* IF J-I*l f THERE WAS NO DUPLICATE CHARACTER, SO OUTPUT IT ALONE */ 
 IF J-I=l THEN DO; 
 
 PUT EDIT (FIRST) (Ail)); 
 
 LP=LR+l; 
 
 END; 
 
 /* OTHERWISE PUT OUT A REPETITION MARKER (="*"), FOLLOWED BY */ 
 /* THE COUNT FIELD (REPETITION FACTOR), FOLLOWED BY */ 
 /* THE REPEATED CHARACTER ITSELF */ 
 ELSE DO; 
 
 PUT STRING(CH) EDIT ( J- 1 ) (F<3)>; 
 CH=TRANSLATE(CH, '0', • •); 
 PUT EDIT (»?*, CH, FIRST) (3 A); 
 LR=LR+2; 
 END; 
 I=j; 
 J«J+l; 
 
 FIPST=SECOND; 
 SECOND=SUBS T R(MSG, J,l); 
 END; 
 PUT EDIT ('•«•) (A); 
 
 end; 
 
 END CCMp; 
 
 ■13- 
 
CRIGir.AL MESSAGE: 
 REDUCED MESSAGE: 
 
 ORIGINAL MESSAGE: 
 REDUCEO M ESSAGE: 
 
 ORIGINAL MESSAGE: 
 REDUCED MESSAGE: 
 
 ORIGINAL MESSAGE: 
 REDUCEO MESSAGF: 
 
 ORIGINAL MESSAGE: 
 REDUCED MESSAGF: 
 
 ORIGINAL MESSAGE: 
 REDUCED MESSAGE: 
 
 ORIGINAL MESSAGE: 
 REDUCED MESSAGE: 
 
 ORIGINAL MESSAGE: 
 REDUCED MESSAGE: 
 
 ORIGINAL MESSAGE: 
 REDUCED MESSAGE: 
 
 ORIGINAL MESSAGE: 
 REDUCED MESSAGE: 
 
 ORIGINAL MESSAGE: 
 REDUCED MESSAGE: 
 
 ORIGINAL MESSAGE: 
 REDUCED MESSAGE: 
 
 ORIGINAL MESSAGE: 
 REDUCED MESSAGE: 
 
 ORIGINAL MESSAGE: 
 REDUCED MESSAGE: 
 
 ORIGINAL MESSAGE: 
 REDUCED MESSAGE: 
 
 ORIGINAL MESSAGE: 
 REDUCEO MESSAGE: 
 
 ORIGINAL MESSAGE: 
 REDUCED MESSAGE: 
 
 ORIGINAL MESSAGE: 
 REDUCED MESSAGE: 
 
 ORIGINAL MESSAGE: 
 REDUCED MESSAGE: 
 
 ORIGINAL MESSAGE: 
 
 PP1' 
 3S004 P61« 
 
 0— ( )--+• 
 
 CS5002-U0G3 )S002-+« 
 
 : 001 : • 
 
 :*003 S00201S5003 :' 
 
 :?009 :• 
 
 : CR1 : PB2 
 
 :?003 CR1S003 : S5003 PB22047 CR1« 
 
 +__( )__ + ( )__h __ „. 
 
 +55002- U003 )SOO2-+*0O2-C8003 ) S002-+X042- ( S5002030021 >- + • 
 
 : 177 002 : 
 
 :*003 1S0027S007 S00202SS003 :*044 1*0027' 
 
 :?0 19 :' 
 
 ?019 :%0*3 MX1* 
 
 :*019 ♦2042-(?002012)-+' 
 
 55065 162' 
 
 MX1' 
 35004 MX1« 
 
 0— ( )--+• 
 
 0X002- U003 )S5O02-*« 
 
 162 : • 
 35004 162*003 :• 
 
 J5010 : • 
 
 *010 : • 
 
 + + i 
 
 +S5009- + ' 
 
 : • 
 
 • i 
 
 •: PB4-NC TR5-XX0 
 -lk- 
 
 Cftl« 
 (0011)-+' 
 177' 
 
 MXi' 
 (0012)-*' 
 162" 
 
 CR3 1 
 
RECUC ED MtS?MiE: 
 
 •:I£002 P6^-NCS003 TP 5-^002X02045 CR3* 
 
 ORIGINAL MESSAGE: 
 REDUCED MESSAGE: 
 
 ORIGINAL MESSAGE: 
 REDUCED '-"ESSAGE: 
 
 ORIGINAL MESSAGE: 
 R EDUCED MESSAGE: 
 
 ORIGINAL MESSAGE: 
 REDUCED MESSAGE: 
 
 ORIGINAL MESSAGE: 
 p EDUCED MESSAGE: 
 
 ORIGINAL MESSAGE: 
 PEDUCED mess/gE: 
 
 ORIGINAL MESSAGE: 
 PEDUCED MESSAGE: 
 
 ORIGINAL MESSAGE: 
 REDUCED MESSAGE: 
 
 ORIGINAL MESSAGE: 
 REDUCED MESSAGE: 
 
 ORIGINAL MESSAGE: 
 REDUCED MESSAGE: 
 
 STATISTICS: 
 
 ORIGINAL MFSSAGE LENGTH: 
 PEDUCED MESSAGE LENGTH: 
 SAVING: 672 (= 67.6 * ) 
 
 +__( j_ ( / ) 
 
 +%002~<%003 )?005-( / )?045-UQ02021)- + ' 
 
 : 005 124 
 
 :S303 S00205S007 124S")48 1*0'126' 
 
 : PR4-NG' 
 :%002 PB4-N0' 
 
 +— ( I — +' 
 
 ♦ S002- (?103 JS002-*-' 
 
 : 006 : • 
 
 :*003 %0a2062003 :' 
 
 :?009 :• 
 
 : TR5-G0X : CR5 
 
 : TRf-*0020X :*003 CP5S047 CP4' 
 
 -— (0021)-+' 
 
 166' 
 
 < ( )__ 4.— { ) 
 
 + %002-<X003 )?002-t-*00 2-U00 3 ) *045- < XOO202 5 )-♦■ ' 
 
 124 176 
 
 $004 12<-S007 176*048 165' 
 
 CR4" 
 (002*)-+' 
 
 165' 
 
 992 
 320 
 
 -15- 
 
ORIGINAL MESSAGE: 
 REDUCED MESSAGE: 
 
 ORIGINAL MESSAGE: 
 REDUCED MFSSAGE: 
 
 ORIGINAL MESSAGE: 
 REDUCED MESSAGE: 
 
 ORIGINAL MESSAGE: 
 REDUCED MESSAGE: 
 
 ORIGINAL MESSAGE: 
 REDUCED MESSAGE: 
 
 ORIGINAL MESSAGE: 
 REDUCED MESSAGE: 
 
 ORIGINAL MESSAGE: 
 REDUCED MESSAGE: 
 
 ORIGINAL MESSAGE: 
 REDUCED MESSAGE: 
 
 ORIGINAL MESSAGE: 
 REDUCED MESSAGE: 
 
 ORIGINAL MESSAGE: 
 REDUCED MESSAGE: 
 
 ORIGINAL MESSAGE: 
 REDUCED MESSAGE: 
 
 ORIGINAL MESSAGE: 
 REDUCED MESSAGE: 
 
 ORIGINAL MESSAGE: 
 REDUCED MESSAGE: 
 
 ORIGINAL MESSAGE: 
 REDUCED MESSAGE: 
 
 ORIGINAL MESSAGE: 
 "TTETJOC ED MESSAGE: 
 
 ORIGINAL MESSAGE: 
 REDUCED MESSAGE: 
 
 ORIGINAL MESSAGE: 
 "RTDUtED HF5SAGE: 
 
 "ORIGINAL MESSAGE: 
 REDUCED MESSAGE: 
 
 ORI GINAL MESSAGE: 
 REDUCED "MESSAGE: 
 
 CP3 CR4» 
 £00'- CR3X007 CR4' 
 
 0-_( ). ( / ) — +i 
 
 OS002-<3!003 )*005-( / U002- + ' 
 
 : 166 165 :* 
 
 :%003 1%00262007 1651003 :• 
 
 :X019 :« 
 
 : CP5 : 
 
 :X003 CR5X013 :X043 TEMP' 
 
 TEMP' 
 
 < — ( )„ .„._+. 
 
 +X002-U003 )X012■-+^042-U00203l)-^•• 
 
 : 176 
 
 :X003 17611058 160' 
 
 : CK3' 
 :f003 CR3' 
 
 *-- ( )--+« 
 
 +1002- U003 )X002-+« 
 
 : 166 : • 
 
 :X003 1X0026X003 :• 
 
 :X0 09 :« 
 
 : CP4 : TEMP CR7 
 
 :X003 CR4X003 : X003 TEMPX006 CR7X037 CR5' 
 
 (0031)-+' 
 
 160» 
 
 « ( / )--+--( ) { / ) 
 
 +X002-( / )X002-+X002-(X003 )X005-< / ) X035- ( X002036)-*-' 
 
 : 165 160 163 
 
 :*003 165X007 160*007 163X038 176' 
 
 : • 
 • i 
 
 : CR6 CR7« 
 
 :X003 CR6X007 CR7' 
 
 4 ( J ( >__ + t 
 
 +X002-IX003 )X005-<X003 JX002-+* 
 
 : 173 163 :' 
 
 :X003 17335007 163X003 :• 
 
 CP5* 
 (0036I- + * 
 176* 
 
 0RIGIN7TL MESSAGE: 
 
 -16- 
 
 
RE cur. ED MESSAGE: »:?019 
 
 ORIGINAL MESSAGE: «: CP 1 
 
 REDUCED MESSAGE: 
 
 ORIGINAL MESSAGE: 
 REDUCED MESSAGE: 
 
 ORIGINAL MESSAGE: 
 REDUCED MESSAGE: 
 
 ORIGINAL MESSAGE: 
 REDUCED MESSAGE: 
 
 ORIGINAL MESSAGE: 
 REDUCED MESSAGE: 
 
 ORIGINAL MESSAGE: 
 REDUCED MESSAGE: 
 
 ORIGINAL MESSAGE: 
 RECUCED MESSAGE: 
 
 :*003 CR5?013 :%043 CP6« 
 
 < — ( / )— *■ -— 
 
 ♦ ?002-< / )?012-+*042-(?002042)-+' 
 
 : 176 : 
 
 :2003 176S013 :SOV* 173« 
 
 :%0\9 
 
 :?019 :S043 TR1 
 
 :?019 +?0' r 2-(?002043)--f' 
 
 S065 120« 
 
 CR6' 
 (00' 2 )- «•' 
 173' 
 
 TR1« 
 (0043)-+' 
 
 120' 
 
 STATISTICS: 
 
 ORIGINAL MESSAGE LENGTH: 1040 
 REDUCED MESSAGE LENGTH: 328 
 SAVING: 720 (= 68.6 *) 
 
 -17- 
 
PMGINAL MESSAGE: 
 RECUCED MESSAGE: 
 
 OPIGINAL MESSAGE: 
 REDUCED MESSAGE: 
 
 ORIGINAL "ESSAGE: 
 REDUCED MESSAGE: 
 
 ORIGINAL MESSAGE: 
 REDUCED MESSAGE: 
 
 ORIGINAL MESSAGE: 
 REDUCED MESSAGE: 
 
 OPIGINAL MESSAGE: 
 REDUCED MESSAGE: 
 
 ORIGINAL MESSAGE: 
 REDUCED MESSAGE: 
 
 ORIGINAL MESSAGE: 
 REDUCED MESSAGE: 
 
 ORIGINAL MESSAGE: 
 REDUCED MESSAGE: 
 
 OPIGINAL MESSAGE: 
 REDUCED MESSAGE: 
 
 ORIGINAL MESSAGE: 
 REDUCED MESSAGE: 
 
 ORIGINAL MESSAGE: 
 REDUCED MESSAGE: 
 
 ORIGINAL MESSAGE: 
 REDUCED MESSAGE: 
 
 ORIGINAL MESSAGE: 
 REDUCED MESSAGE: 
 
 ORIGINAL MESSAGE: 
 nrEDUCED MESSAGE: 
 
 ORIGINAL MESSAGE: 
 REDUCED MESSAGE: 
 
 ORIGINAL MESSAGE: 
 "TTFOUCfD MESSAGE: 
 
 ORIGINAL MESSAGE: 
 REDUCED MESSAGE: 
 
 ORIG INAL MESSAGE: 
 ftEDUCTDMESSAGE: 
 
 CR7 CR8' 
 
 *004 CR7*007 CP8' 
 
 +--( ) ( »--♦• 
 
 +*002- 1*003 )S005-( 35003 )*002~+' 
 
 : 163 172 :• 
 
 :*003 163*007 172*003 :• 
 
 :*019 :• 
 
 : LS2 : CR6 
 
 :*003 LS22013 :*003 CR6*037 CR7' 
 
 CR7' 
 
 + — ( ) , + — ( )„„ . (0050)-+' 
 
 +X002- 1*003 )*012-+*002-U003 ) *035- ( 1002050)-+ • 
 
 : 007 173 
 
 :*003 *00207*017 173*038 163' 
 
 CR6 TPl-OOX LS3 TR4-XX0 
 
 :*003 CR6*005 TR 1-*0020X*005 LS3*005 TR4-*002XQ*025 CR8' 
 
 163' 
 
 CR8' 
 
 i — ( , ( > ( ) ( / )_„ (0055)-+' 
 
 +*002-<*003 )*005-(*003 )*005-I*003 )*005-< / ) *025- (*0020*0025)-+« 
 
 : 173 120 010 123 
 
 :*003 173*007 120*007 010*007 123*028 172' 
 
 : LS6 CR13 CR12 CP5 
 
 :*003 LS6*007 CR13*006 CR12*006 CR5*027 CR10' 
 
 ^ — ( ) ( / j ( / j ( / ) 
 
 +*002-(*003 )*005-( / )*005-< / )*005-< / ) *025-(*002062)-+' 
 
 : 013 164 175 176 
 
 :*003 013*007 164*007 175*007 176*028 171' 
 
 : PS1' 
 :*003 PS1' 
 
 * — ( )--+• 
 
 +*002-(*003 )*002-+' 
 
 : 004 : • 
 
 :*003 5100204*003 : ' 
 
 172' 
 
 CR10' 
 (0062)-+' 
 171' 
 
 ORIGINAL MESSAGE: 
 
 -18- 
 
REDUCED MESSAGE: 
 
 • :*009 
 
 ORIGINAL MESSAGE: 
 REDUCED MESSAGE: 
 
 ORIGINAL MESSAGE: 
 REDUCFD MESSAGE: 
 
 GPIGINAL MESSAGE: 
 REDUCED MESSAGE: 
 
 ORIGINAL MESSAGE: 
 REDUCED MESSAGE: 
 
 ORIGINAL MESSAGE: 
 REDUCED MESSAGE: 
 
 ORIGINAL MESSAGE: 
 REDUCED MESSAGE: 
 
 ORIGINAL MESSAGE: 
 REDUCED MESSAGE: 
 
 CR11 : CR12 CRIO CR5 
 
 :£0r>3 CR*002l*OO2 : *003 CR 12*006 CR 10*006 CR5*027 CR*0021' 
 
 CR11« 
 
 ^ — ( ) — + — ( / ) { ) ( / ) — ^ __ (0070)-+' 
 
 +*002-<*003 )*002-+?002-( / )*005-<S003 )S005-( / ) *02 5- U002C7r.)- + • 
 
 : 17C 175 171 176 
 
 :*003 170*007 175*007 171*007 176*028 170' 
 
 : PS1 CR11 LSI 
 
 :*003 PS1*007 CP*0021*006 LSl*037 SOLA' 
 
 170' 
 
 SOLA* 
 
 +*002-< / )*005-< / )*005-(*003 J*035- < S00207M-+' 
 
 004 170 003 
 *004 *00204*007 170*007 *00203*038 174» 
 
 17^' 
 
 STATISTICS: 
 
 ORIGINAL MESSAGE LENGTH: 
 REDUCED MESSAGE LENGTH: 
 SAVING: 657 (= 56.7 *) 
 
 1157 
 
 500 
 
 -19- 
 
ORIGINAL MESSAGE: 
 REDUCED MESSAGE: 
 
 ORIGIN/L MESSAGE: 
 REDUCED MESSAGE: 
 
 ORIGINAL MESSAGE: 
 REDUCED MESSAGE: 
 
 ORIGINAL MESSAGE: 
 REDUCED MESSAGE: 
 
 ORIGINAL MESSAGE: 
 REDUCED MESSAGE: 
 
 ORIGINAL MESSAGE: 
 REDUCED MESSAGE: 
 
 ORIGINAL MESSAGE: 
 REDUCED MESSAGE: 
 
 ORIGINAL MESSAGE: 
 REDUCED MESSAGE: 
 
 ORIGINAL MESSAGE: 
 REDUCED MESSAGE: 
 
 ORIGINAL MESSAGE: 
 REDUCED MESSAGE: 
 
 ORIGINAL MESSAGE: 
 REDUCED MESSAGE: 
 
 ORIGINAL MESSAGE: 
 REDUCED MESSAGE: 
 
 ORIGINAL MESSAGE: 
 REDUCED MESSAGE: 
 
 ORIGINAL MESSAGE: 
 PEOUCEO MESSAGE: 
 
 ORIGINAL MESSAGE: 
 -RTOuCED MESSAGE: 
 
 ORIGINAL MESSAGE: 
 REDUCEO MESSAGE: 
 
 ORIGINAL MESSAGE: 
 "REDUCED MESSAGE: 
 
 ORIGINAL MESSAGE: 
 REDUCED MESSAGE: 
 
 ORIGINAL MESSAGE: 
 EDUCTD MESSAGE: 
 
 PS I' 
 *00<v PS1' 
 
 +--< )— + « 
 
 + *002-(*003 )*002-*« 
 
 004 : • 
 :*003 100204*003 :• 
 
 :*009 :• 
 
 : CPU : FAULT 
 
 :*003 CR*0021?002 : *002 FAULT*046 TEMP' 
 
 4 — ( )-.-+—.< ) ... 
 
 +*002-(*003 )*002~+*002-(*003 ) *045-( 0120020)- + ' 
 
 : 170 017 
 
 :*003 170*007 017*048 160' 
 
 : CP12 PB5' 
 
 :*003 CR12*006 PB5' 
 
 + __( , ( j — +• 
 
 +*002-(*003 )*005-J*003 )*002-+' 
 
 : 175 014 :• 
 
 :*003 175*007 014*003 :• 
 
 TEMP' 
 (0100)-+' 
 
 ieo« 
 
 :*019 :• 
 
 : TEMP : 
 
 :*003 TEMP*C12 :*043 CR 12' 
 
 ^ — ( ) «. 
 
 +*002-<*003 )*012-+*042-(01041-+' 
 
 : 160 
 
 :*003 160*058 175' 
 
 :*004 WC*057 TR3' 
 
 4 ( , 
 
 ♦X002-U003 )*055-( 0106)-+* 
 
 : 167 
 
 :*003 167*058 1*0022' 
 
 CR12' 
 (0104)-*' 
 175' 
 
 TR3' 
 (0106)-*' 
 
 122' 
 
 ORIGINAL MESSAGE: 
 
 -20- 
 
RECUCED MESSAGE: •:• 
 
 ORIGINAL MESSAGE: •: CPU LS2 WC« 
 
 REDUCFO MESSAGE: ':2003 CR200212006 LS22048 WC 
 
 ORIGINAL MESSAGE: •*-'-< ) ( ) :——-.- . (0111)- + 
 
 REDUCED MESSAGE: «+2002-(2003 )2005-(2003 12045- ( 0200 31 )-*•• 
 
 ORIGINAL MESSAGE: • 17C 007 167' 
 
 REDUCED MFSSAGE: '2004 1702007 2002072048 167' 
 
 STATISTICS: 
 
 ORIGINAL MESSAGE LENGTH: 949 
 REDUCED MESSAGE LENGTH: 280 
 SAVING: 669 <= 70.3 2) 
 
 -21- 
 
OFIGINAL MESSAGE: 
 REDUCED MESSAGE: 
 
 OPIGINAL MESSAGE: 
 REDUCED MESSAGE: 
 
 ORIGINAL MESSAGE: 
 REDUCED MESSAGE: 
 
 OPIGINAL MESSAGE: 
 REDUCED MESSAGE: 
 
 ORIGINAL MESSAGE: 
 REDUCED MESSAGE: 
 
 ORIGINAL MESSAGE: 
 REDUCED MESSAGE: 
 
 ORIGINAL MESSAGE: 
 REDUCED MESSAGE: 
 
 ORIGINAL MESSAGE: 
 REDUCED MESSAGE: 
 
 OPIGINAL MESSAGE: 
 REDUCED MESSAGE: 
 
 ORIGINAL MESSAGE: 
 REDUCED MESSAGE: 
 
 ORIGINAL MESSAGE: 
 REDUCED MESSAGE: 
 
 ORIGINAL MESSAGE: 
 REDUCEO MESSAGE: 
 
 OPIGINAL MESSAGE: 
 ~R EDUCED MESSAGE: 
 
 ORIGINAL MESSAGE: 
 REDUCED MESSAGE: 
 
 ORIGINAL MESSAGE: 
 R EDUCED" ffE5 SAGE: 
 
 WrGlNAL MESSAGE: 
 REDUCED MESSAGE: 
 
 ORIGINAL MESSAGE: 
 REDUCED HE55AGT: 
 
 "ORIGINAL MESSAGE: 
 REDUCED MESSAGE: 
 
 ORIGINAL MESSAGE: 
 REDUCED MESSAGE: 
 
 ORIGINAL MESSAGE: 
 
 CP 13' 
 2004 CR13' 
 
 < — < ) — +• 
 
 +2002- (2003 J2002-+' 
 
 : 164 : • 
 
 :2003 1642003 : ■ 
 
 :2009 :• 
 
 : TP3-00X : CR5 
 
 : TR3-20020X :2003 CR52047 CR13' 
 
 + — ( )__+ — ( / ) — 4 
 
 +2002-UOQ3 )2002-+2002-< / ) 2002-+2042-< 0200215 )- + ' 
 
 : 122 176 : 
 
 :2003 120022*007 1762003 :2044 164« 
 
 :2019 :■ 
 
 :2019 :2043 TR4" 
 
 :2019 +2042- { 0200216)-* • 
 
 :2064 123' 
 
 CR2 CR7 
 
 :2003 CR22007 CR72047 TEMP' 
 
 * — ( ) ( / ) 
 
 +2002-<2003 )2005-( / ) 2045- < 0121)-+' 
 
 : 161 163 
 
 :2003 1612007 1632048 160' 
 
 : TR4-00X CR5' 
 
 : TR4- 20020X2005 CR5' 
 
 i — ( j ( j — +• 
 
 ♦2002- (2003 )2005-<2003 12002-+' 
 
 : 123 176 :• 
 
 :2003 1232007 1762003 :• 
 
 CP13' 
 (0115)-+' 
 164' 
 
 TR4' 
 
 (0116)-+' 
 123' 
 
 TEMP' 
 (0121)-+' 
 160' 
 
 -22- 
 
 
REDUCED MESSAGE: 
 
 ORIGINAL MESSAGE: 
 REDUCED MESSAGE: 
 
 ORIGINAL MESSAGE: 
 REDUCED MESSAGE: 
 
 OPIGINAL MESSAGE: 
 R EDUCED "ESSAGE: 
 
 ORIGIN/L MESSAGE: 
 REDUCED MESSAGE: 
 
 ORIGINAL MESSAGE: 
 REDUCED MESSAGE: 
 
 ORIGINAL MESSAGE: 
 REDUCED MESSAGE: 
 
 ORIGINAL MFSSAGE: 
 REDUCED MESSAGE: 
 
 STATISTICS: 
 
 ORIGINAL MESSAGE LENGTH: 
 REDUCED MESSAGE LENGTH: 
 SAVING: 867 (= 71.7 
 
 •:?019 :• 
 
 ! TEMP 
 
 :?003 TEMPS012 
 
 : SSI 
 :S003 .S002SK037 CR2' 
 
 + *002«U003 )%012-+?002-(*003 ) S002-+%032- ( 0126 J- + • 
 
 : 16C 
 
 :S003 160*017 016X003 
 
 :S029 :• 
 
 :?029 :S033 TP5» 
 :S029 +?032-(0i27}-+ , 
 *065 12V 
 
 1207 
 340 
 
 016 : 
 :?034 161' 
 
 CR2 1 
 
 (0126)-+ 
 
 161' 
 
 TR5 
 
 (0127)- + ' 
 
 124' 
 
 -23- 
 
/* TEXT EXPANSION FROM AN ENCODED MESSAGE */ 
 
 /* ENCODING MECHANISM: CONTIGUOUS DUPLICATE CHARACTER COMPRESSION */ 
 
 /* COMPUTER SCIENCE 389 PROJECT */ 
 
 /* ALFRED C. WEAVER */ 
 
 DECODE: PROC OPT IONS { MA IN) ; 
 
 DCL CODE CHAR(80) VAP INITCXM, (F,S) CHAR(l); 
 
 /* REPEAT FOR EACH ENCODED MESSAGE */ 
 DC WHILE(C0DE -.= • • ) ; 
 GET LIST (CODE); 
 
 PUT SKIP(2) EDI"! - ('ENCODED MESSAGE IS:», CODE, 
 •RECONSTRUCTED MESSAGE IS:', » •) 
 (A, COL(30), A, SKIP, A, C0L(29), A); 
 
 1 = 1; 
 
 /* SCAN ACROSS THE ENTIRE MESSAGE */ 
 DO WHILfcCI <= LENGTH(CGDE) ); 
 F» SUBSTRICOOEfI.fi'); 
 
 /* DETERMINE IF A CHARACTER IS REPEATED */ 
 IF F='%* THEN DO; 
 
 GET STRING(SU3STR(C0DE, 1+1,3) ) EDIT (K) (F<3)); 
 S=SUBSTR(CQDE, 1+4,1); 
 
 /* REPEAT THE CHARACTER «K« TIMES */ 
 DO J=l TO k; 
 
 PUT EDIT (S) (A); 
 END; 
 1 = 1+5; 
 END; 
 
 /* OUTPUT A SINGLE CHARACTER */ 
 ELSE DO; 
 
 PUT EDIT (F) (A); 
 1 = 1+1 ; 
 END; 
 FMD; 
 END; 
 END DECODE; 
 
 -2k- 
 
ENCODED MESSAGE I?: 
 RECONSTRUCTED MESSAGE IS: 
 
 ABC*105XA&C 
 mPCXXXXXABC 
 
 ENCODED MESSAGE T S: 
 PECGNSTF'JCTFD "ESSAGE IS: 
 
 *0O2A3SOC5e*OlDC*02ODX 
 AABbBBBCCCCCCCCCCDDDQDDDDDDDOODDDDDDDX 
 
 ENCODED MESSAGE IS: 
 RECONSTRUCTED MESSAGF IS: 
 
 *010A*010;i 
 AAAAAAAAAABBPBR8RBBB 
 
 ENCGDFD MFSSAGE I S: 
 RECONSTRUCTED MESSAGE IS: 
 
 t 8CDEFGHI JKLMNCPQ&STUVWXYZ 
 ABCDEFGHI JKLMNOPQPSTUVWXY? 
 
 ENCODED MESSAGE IS: 
 PECCN5 T cuCTtD MESSAGE IS: 
 
 XOG-i- CP 7*007 CRB 
 C°7 CPR 
 
 ENCPDFD MESSAGE I S: 
 RECONSTRUCTED MESSAGE IS 
 
 +*G02~<*003 )*00?-<*003 1*002-+ 
 ♦ — ( ) ( ) — ♦ 
 
 ENCODFD MESSAGE I S: 
 RECONSTRUCTED MESSAGE IS: 
 
 :*003 163*007 172*003 
 : 163 172 : 
 
 ENCODFD MESSAGE IS: 
 RECONSTRUCTED MESSAGE IS: 
 
 :*019 
 
 ENCODED MFSSAGE IS: 
 RECONSTRUCTED MESSAGE IS: 
 
 :*003 LS2S013 :?O03 CR6*037 CR7 
 : LS2 : CR6 
 
 CR7 
 
 ENCODED MESSAGE TS: 
 RECONSTRUCTED MESSAGE IS: 
 
 +*002-<*003 )*012-+*002-<*003 ) *035- ( *002O50)-+ 
 ♦__ ( , + — , , 
 
 (0050)-* 
 
 ENCODED MFSSAGE IS: 
 RECONSTRUCTED MESSAGE IS 
 
 X003 *C0207*017 173*038 163 
 007 1 73 
 
 163 
 
 ENCODED MESSAGE IS: 
 RECCNSTFUCTFD MESSAGF IS: 
 
 ENCODED MESSAGE IS: 
 RECONSTRUCTED MESSAGE IS: 
 
 :*003 CR6*005 tr l-*O02OX*005 LS3*00? TR4-*002 XC*025 CR8 
 CPc TRl-Cnx LS3 TRA-XXO 
 
 CP8 
 
 ENCODED MESSAGE IS: 
 RECONSTRUCTED MESSAGE IS: 
 
 +*002-( J003 >*005-<*003 )*005-(*003 )*005-< / ) *025- { *0020*002 c )-♦ 
 
 «.__( ) , » ( , < / ) <0055)-+ 
 
 ENCODED MESSAGE IS: 
 RECONSTRUCTED MESSAGE IS: 
 
 :*003 173X007 120*007 123*028 172 
 173 120 123 
 
 172 
 
 ENCODED MESSAGE IS: 
 RECONSTRUCTED MESSAGE IS: 
 
 ENCODED MESSAGE IS: 
 RECONSTRUCTED MESSAGE IS: 
 
 ENCODED MESSAGE IS: 
 RECONSTRUCTED MESSAGE IS: 
 
 :*003 LS6*007 CR13*006 CR12*006 CR5*027 CR10 
 : LS6 CR13 CR12 C P 5 
 
 + *002-<*003 )*005-< / )*005-( / l*005-( / ) *025- < *002062 »-♦ 
 «.__( ) ( / , { / ) ( / ) 
 
 CR10 
 (00621- + 
 
 ENCODED MESSAGE IS: 
 RECONSTRUC T ED MESSAGE IS: 
 
 *003 013*007 164*007 175*007 176*028 171 
 013 164 1/5 i 76 
 
 -25- 
 
 171 
 
Again it is clear that a set of phrases of S can be generated 
 by enumeration. Now, given a set P of common phrases in S, what do we 
 know about them? 
 
 (1) If S is of length N, then no phrase p € p can be of 
 length > L N/2 J , else it could not be repeated and, hence, 
 could not be "common"; 
 
 (2) A string of length N will have: 
 
 N phrases of length 1, 
 (N-l) phrases of length 2, 
 (N-2) phrases of length 3, 
 
 and (N - LN/2J + l) phrases of length LN/2J . 
 
 Would the replacement of all phrases p with pointers to P yield 
 a minimal length S'? Not unless there is an ordering to P, for replacement 
 of p.-AB' and p.= 'ABC, applied in the order (i, J) to S='ABCABC', yields 
 S'=(p. )C(p. )C, with length= J +, while the order (j,i) yields S'=(p.)(p.) with 
 
 length=2 . 
 
 Suppose P were to be ordered by length of the p. such that replace- 
 
 ments always removed the longest phrases first. This is not sufficient since 
 p.= 'ABCD' and p.= 'CDEAB', applied in the order (j,i) since |p.|>|p. |, to 
 S= 'ABCDEABCD ' , yields S '= 'AB(p . )CD ' with length=5, while the order (i,j) 
 
 J 
 
 yields S'= ' (p. )E(p. ) ' with length=3- 
 
 So length of common phrases alone does not establish an order on P. 
 
 Does frequency of usage affect the order of P? In addition to all phrases 
 
 p. e P, also keep a count of how many times that phrase appears in the 
 
 message, f . . Now order P by decreasing f., i.e., largest number of uses first. 
 
 -26- 
 
This method also fails since for 
 
 p.= 'ABC f.=3 
 1 1 applied to S= 'ABCXABCYABC ' 
 
 p.= 'XABY' f.=l 
 3 3 
 
 in the order (i,j) because f. > f . yields S'=(p. )X(p. )Y(p. ) with length=5 
 while the ordering (j,i) yields S'=(p. ) (p .) (p. ) with length=3- 
 
 J- o ■*■ 
 
 But now let us combine these two methods to account for the effects 
 of both phrase length and frequency in suggesting a possible solution to 
 the problem of how to pick and order a set P. 
 
 2.3.2. A Conjecture 
 
 Given a string of characters S of length N, let P be the set of 
 
 phrases of S such that each element of P is a phrase p., where p. is the j 
 
 phrase of length |p.|, and p. has frequency of occurrence f . > 2 in S. 
 
 Then for each phrase p. define a reduction factor r. = f. (Ip.l - l) 
 
 3 3 3 ' 3 ' 
 
 which represents the number of characters saved when |p.| characters of 
 
 text are replaced with one one-character phrase identifier (pointer) for each 
 
 of the f . occurrences of p . . 
 
 3 3 
 
 Construct the set of phrases P', which are the p. sorted into 
 descending order by r . . Within groups of phrases with equal r value, sort 
 
 J 
 
 again by descending length of phrases. Now replace, in order, every occurrence 
 
 of pi in S with the appropriate reference pointer until all p '. have been 
 J J 
 
 examined, thus yielding a new string of characters and/or phrase pointers, S'. 
 Then S' is the minimal length text string which can be created from S. 
 
 .27- 
 
2.3-3- An Example 
 
 Consider the (arbitrary) message 'ABCXABCYABCZXABCY ' . The 
 set of unique phrases appearing at least twice include: 
 
 phrase # phrase 
 
 1 
 
 'ABCY' 
 
 2 
 
 8 
 
 2 
 
 •XABC' 
 
 2 
 
 6 
 
 3 
 
 *ABCY' 
 
 2 
 
 6 
 
 h 
 
 'ABC 
 
 k 
 
 8 
 
 5 
 
 'XAB' 
 
 2 
 
 k 
 
 6 
 
 'BCY' 
 
 2 
 
 k 
 
 7 
 
 'AB* 
 
 1+ 
 
 h 
 
 8 
 
 »BC' 
 
 1+ 
 
 k 
 
 9 
 
 'XA' 
 
 2 
 
 2 
 
 10 
 
 r CY' 
 
 2 
 
 2 
 
 The ordering of replacement of the phrases with r=8 is not 
 
 arbitrary; the longest such phrase (XABCY) must be applied first. Replacing 
 
 'XABCY' first we derive: 
 
 ABC XABCYA BCZ XABCY length=17 
 
 ABC(1) ABCZ (1) length=9 
 
 (4)(1)(U)Z(1) length=5 
 
 while replacing 'ABC' first derives: 
 
 ABCXABCYABCZXABCY length=17 
 
 (k)x(k)Y(k)z(h) length=7 
 
 The algorithm described has been programmed and commented to accept 
 a message string, determine all possible phrases which occur repeatedly, 
 sort them by descending r values, and replace common phrases with pointers 
 into phrase table P. 
 
 -28- 
 
/* OPTIMAL PHRASE EXTRACTION FROM TEXT STRINGS */ 
 /* COMPUTER SCIENCE 389 PROJECT */ 
 /* ALFRED C. WEAVER */ 
 
 /* AM ALGORITHM TO EXTRACT SUBPHRASES FROM A MESSAGE • S» */ 
 
 /* SUCH THAT REPLACEMENT OF SUBPHRASES WITH "PHRASE POINTERS" */ 
 
 /* YIELDS A MINIMUM LENGTH MESSAGE */ 
 
 PHRASES: PROC OPT IONS ( MA IN ) ; 
 
 /* »S' IS THE ORIGINAL MESSAGE (CHARACTER STRING) */ 
 /* «P« IS THE ARRAY CF SUBPHRASES OF S */ 
 /* 'F« IS THE ARRAY OF FREQUENCY COUNTS */ 
 /* «R» IS THE ARRAY OF REDUCTION FACTORS */ 
 DCL S CHAF(IOO) VAR, ( P ( 500 ), PHRASE ) CHAR(50) VAR, CH CHAR(3), 
 (F(500) ,R (500), A, I ,J,K,L,M,N) FIXED BIN(31); 
 
 /* REPEAT UNTIL INPUT STREAM IS EXHAUSTED V 
 ON ENDFILE1 SYSIN) STOP; 
 FOREVER: DO WHILEPl'B); 
 
 GET LIST( S); 
 
 N = LENGTH(S); 
 
 K = 0; 
 
 F = 0; 
 
 R=0; 
 
 PUT SKIP(2) EDITi 'ORIGINAL STRI NG : • , • • • • , S , • • ■ • , • SUBPHRASE • , 
 
 •FREQUENCY*, 'REDUCTION FACTOR • )( A ,COL( 30 ) , 3 A, SKIP, 
 
 A,C0L(20) ,A,C0L(30),A) ; 
 
 /* SEARCHING FOP THE LONGEST SUBPHFASE FIRST IS */ 
 
 /* ESSENTIAL TO PREVENT A RESOFT LATER ON PHRASE LENGTH */ 
 
 DO L = FL0PR(N/2) TO 2 BY -l; 
 
 /* EXTRACT EACH OF THE SUBPHRASES OF LENGTH 'L 1 IN 'S« */ 
 DC I = 1 TO N-L*l; 
 
 PHPASE=SUBSTR(S,I ,L) ; 
 /* THIS LOOP DISCARDS COMMON SUBPHFASES */ 
 /* THIS STEP IS NOT ESSENTIAL, BUT SAVES TIME */ 
 /* IN THE REPLACEMENT STEP */ 
 DO A = 1 TO k; 
 
 IF PHRASE=P(A) THEN GOTO X; 
 END; 
 /* »M« POINTS Tn THE NEXT POSSIBLE OCCURANCE OF •PHKASE* */ 
 
 m=I4-l; 
 
 -29- 
 
IF M>N THEN GOTO X; 
 J=INDEX( SUBSTB(SfM), PHRASE); 
 /* DETERMINE WHETHER 'PHRASE 1 OCCURS AT LEAST TWICE */ 
 IF J>0 /* FREQUENCY > 1 */ THEN DO; 
 K=K+l; 
 
 P(K)=PHRASE; 
 F(K)=2; 
 m=m+j+l-1; 
 /* FIND ALL OCCURENCES OF 'PHRASE* IN 'S» */ 
 DO WHILE (J>0) ; 
 
 IF M+L > N THEN GOTO X; 
 J=INDEX( SUBSTR(StM), PHRASE); 
 IF J>0 THEN DO; 
 M=M+J+L-l; 
 F{K)=F(K)*l; 
 END; 
 END; 
 END; 
 X: END; 
 END; 
 
 /* ESTABLISH THE REDUCTION FACTOR FOR EACH SUBPHRASE */ 
 DC A = 1 TO k; 
 
 P(A) = F(A) * (LENGTH(P(A)) - 1); 
 
 PLT SKIP EDIT l»" , ,PIA),"",F(A),R(A)) 
 
 (3 A,COL(20),F(4),COL(30),F(8) ); 
 
 fnd; 
 /* now sort the phrases into descending order by r(i) */ 
 /* a simple jump-down sort will do */ 
 
 DO 1=1 TO K-l; 
 
 DO J=I+1 TO K; 
 
 IF RIIXR1J) THEN /* INTERCHANGE*/ DO; 
 PHRASE=P(I); L=F(I); M=P(I); 
 P(I)=P(J); F(I)=F(J); R(I)=R(J); 
 P(J) = PHRASE; F(J)=U R(J) = M; 
 
 end; 
 end; 
 end; 
 
 /* MOW LABEL AND PRINT ALL SUBPHRASES IN THEIR OPTIMAL ORDER */ 
 PUT SKIP<2) EDIT ('SUBPHRASES IN OPTIMAL ORDER FOR REPLACEMENT', 
 •PHRASE* SUBPHRASE', 'FREQUENCY', * REDUCT ION FACTOR ' ) 
 (A, SKIP, A, C0L(20), A, COL(30), A); 
 DO A = L TO K; 
 
 PUT SKIP EDIT (A, "", P(A), "", F(A), R(A)J 
 (F (5), COL (11) ,3 A,COL(20),F(A),C0L(30), F(8)); 
 END; 
 
 /* REPLACE ALL SUBPHRASES WITH PHRASE REFERENCES */ 
 DO 1=1 TO k; 
 
 DC WHILE(INDEX(S,P( I) )>0); 
 J = INDEX(S,P(D) ; 
 
 PUT STRING(CH) EDIT (I) (F(3)); 
 CH=TPANSLATE(CH,'0' ,' '); 
 (NOSTPINGRANGE): S = SUBSTR ( S , 1 , J~l ) II •«• I I CH I I 
 
 -30- 
 
SUBSTMS, J+LENGTH(P( I))); 
 
 END; 
 
 end; 
 
 PUT SK!P(2) EDIT («THE MINIMAL STRING IS: 1 , S) 
 (A, COL (30), A); 
 
 END FCREVEP; 
 END PHRASES; 
 
 -31- 
 
ORIGINAL STRING: • ABC XABCYABC7.XABCY* 
 
 SUBPHRASE FREQUENCY REDUCTION FACTOR 
 
 •XABCY* 2 8 
 
 • XABC • 2 6 
 
 •ABCY* 2 6 
 
 •ABC* A 8 
 
 •XAB« 2 A 
 
 •BCY» 2 A 
 
 •AB» A A 
 
 • BC» 4 A 
 
 »XA» 2 2 
 
 •CY« 2 2 
 
 SURPHPASES IN OPTIMAL CPDER FOR REPLACEMENT 
 
 PHRASE* SUBPHRASE FREQUENCY REDUCTION FACTOR 
 
 1 «XABCY» 2 8 
 
 2 »ABC* A 8 
 
 3 »ABCY« 2 6 
 A 'XABC 1 2 6 
 
 5 «XAB» 2 A 
 
 6 »BCY« 2 A 
 
 7 •AB* A A 
 
 8 »BC* A A 
 
 9 »XA« 2 2 
 10 'CY» 2 2 
 
 THE MINIMAL STRING IS: *Q02$001£002ZS001 
 
 ORIGINAL STRING: 'XXXXXXXXXX* 
 
 SUBPHRASE FREQUENCY REDUCTION FACTOR 
 
 •XXXXX* 2 8 
 
 »XXXX« 2 6 
 
 •XXX« 3 6 
 
 •XX« 4 A 
 
 SUBPHPASES IN OPTIMAL ORDER FOR REPLACEMENT 
 
 PHRASE* SUBPHRASE FREQUENCY REDUCTION FACTOR 
 
 1 'XXXXX» 2 8 
 
 2 «XXXX» 2 6 
 
 3 »XXX» 3 6 
 A • XX f A A 
 
 THE MINIMAL STRING IS: $0013001 
 
 ORIGINAL STRING: « ABCCABCDABCDEF' 
 
 SUBPHRASE FREQUENCY REDUCTION FACTOR 
 
 •ABCD» 3 S 
 
 •RCDA» 2 6 
 
 •CDAB» 2 6 
 
 •DABC« 2 6 
 
 •ABC 1 3 6 
 
 •BCD* 3 6 
 
 •CDA» 2 A 
 
 •D*B« 2 A 
 
 •AB» 3 3 
 
 •BC* 3 3 
 
 •CD' 3 3 
 
 •DA» 2 2 
 
 -32- 
 
SUBPHRASES IN OPTIMAL ORDER FOR REPLACEMENT 
 PHRASE* SUBPHRASE FREQUENCY REDUCTION FACTOR 
 
 9 
 6 
 6 
 6 
 6 
 6 
 4 
 4 
 3 
 3 
 3 
 2 
 
 1 
 
 •ABCD* 
 
 3 
 
 2 
 
 •BCDA 1 
 
 2 
 
 3 
 
 •CDAB* 
 
 2 
 
 4 
 
 •DARC 
 
 2 
 
 5 
 
 •ABC* 
 
 3 
 
 6 
 
 •BCD' 
 
 3 
 
 7 
 
 •CDA« 
 
 2 
 
 8 
 
 •DAB' 
 
 2 
 
 9 
 
 •AB« 
 
 3 
 
 10 
 
 •BC« 
 
 3 
 
 11 
 
 »CD» 
 
 3 
 
 12 
 
 «DA» 
 
 2 
 
 THE MINIMAL STRING IS: 
 
 *001*001*001EF 
 
 ORIGINAL STRING: 
 
 SUBPHRASE 
 
 • ABCDEF* 
 
 •ABCDE' 
 
 •BCDEF» 
 
 1 ABCD» 
 
 • RCDE' 
 
 •CDEF' 
 
 •ABC 
 
 •BCD' 
 
 •CDE« 
 
 •DEF' 
 
 •RST* 
 
 •AB' 
 
 •RC» 
 
 •CD« 
 
 •DE« 
 
 •EF« 
 
 •RS« 
 
 •ST« 
 
 •ABCDEFABCDEFRSTRSTABCRST* 
 FREQUENCY P. EDUCTICN FACTOR 
 
 2 10 
 
 2 8 
 
 2 8 
 
 2 6 
 
 2 6 
 
 2 6 
 
 3 6 
 2 4 
 2 4 
 
 2 4 
 
 3 6 
 3 3 
 3 3 
 2 2 
 2 2 
 
 2 2 
 
 3 3 
 3 3 
 
 SUBPHRASES IN CP T IMAL ORDER FOR PEPLACFMENT 
 
 PHRASE* 
 
 1 
 
 2 
 
 3 
 
 4 
 
 5 
 
 6 
 
 7 
 
 8 
 
 9 
 10 
 11 
 12 
 13 
 14 
 15 
 16 
 17 
 18 
 
 SURPHRASE 
 • ARCOEF* 
 • ART)F' 
 •RCDEF« 
 • A B C ■ 
 •RCDE' 
 •CDEF* 
 •ARC 
 •RST' 
 •CDE« 
 •DEF* 
 •BCD' 
 •AB» 
 •RC 
 •PS' 
 •ST' 
 •EF» 
 •CD' 
 •DE» 
 
 FREQUENCY 
 2 
 2 
 2 
 2 
 2 
 2 
 3 
 3 
 2 
 2 
 2 
 3 
 3 
 3 
 3 
 2 
 2 
 2 
 
 REDUCTICN 
 10 
 8 
 8 
 6 
 6 
 6 
 6 
 6 
 4 
 4 
 
 3 
 3 
 3 
 3 
 2 
 2 
 2 
 
 FACTOR 
 
 THE MINIMAL STRING IS: 
 
 *ooi*oomoB*oo8*oo7*DDa 
 
 -r*- 
 
ORIGINAL STRING: »XXXXX» 
 
 SUBPHPASE FREQUENCY REDUCTICN FACTOR 
 
 •XX 1 2 2 
 
 SUBPHFASES IN OPTIMAL ORDER FOR REPLACEMENT 
 PHRASE* SUBPHPASE FREQUENCY REDUCTION FACTOR 
 1 •XX* 2 2 
 
 THE MINIMAL STRING IS: *001*001X 
 
 ORIGINAL STRING: «XXXXXX» 
 
 SUBPHRASE FREQUENCY REDUCTICN FACTOR 
 •XXX* 2 4 
 
 •XX' 2 2 
 
 SUBPHPASES IN OPTIMAL ORDER FOP REPLACEMENT 
 PHRASE* SUP PHRASE FREQUENCY REDUCTION FACTOR 
 
 1 »XXX« 2 4 
 
 2 'XX» 2 2 
 
 THE MINIMAL STRING IS: $001*001 
 
 ORIGINAL STRING: 'XXXXXXX* 
 
 SUBPHRASE FREQUENCY REDUCTICN FACTOR 
 •XXX 1 2 4 
 
 •XX* 3 3 
 
 SUBPHPASES IN OPTIMAL ORDER FOR REPLACEMENT 
 PHRASE* SUBPHRASE FREQUENCY REDUCTION FACTOR 
 
 1 'XXX 1 2 4 
 
 2 'XX* 3 3 
 
 THE MINIMAL STRING IS: 35001*00 IX 
 
 ORIGINAL STRING: »XXXXXXXX« 
 
 SUBPHRASE FREQUENCY REDUCTION FACTOR 
 •XXXX» 2 6 
 
 •XXX 1 2 4 
 
 •XX» 3 3 
 
 SUBPHPASFS IN OP T IMAL ORDER FOR REPLACEMENT 
 PHRASE* SUBPHPASE FREQUENCY REDUCTICN FACTOR 
 
 1 »XXXX f 2 6 
 
 2 »XXX« 2 4 
 
 3 'XX 1 3 3 
 
 THE MINIMAL STRING IS: *00i*001 
 
 ORIGINAL STRING: «XXXXXXXXX» 
 
 SUBPHRASE FREQUENCY REDUCTICN FACTOR 
 •XXXX 1 2 6 
 
 •XXX« 2 4 
 
 •XX f 4 4 
 
 SUBPHPASES IN OPTIMAL ORDER FOR REPLACEMENT 
 PHRASE* SUBPHRASE FREQUENCY REDUCTION FACTOR 
 
 1 »XXXX« 2 6 
 
 2 «XXX' 2 4 
 
 -34- 
 
3 'XX' % 
 
 THE MINIMAL STPING IS: 
 
 S0012001X 
 
 ORIGINAL STRING: 
 
 SURPHRASE 
 
 • XXXXX 1 
 
 •XXXX» 
 
 •XXX* 
 
 « XX 1 
 
 FPEOUENCY 
 2 
 2 
 3 
 
 •XXXXXXXXXX* 
 
 REDUCTION FACTOR 
 8 
 6 
 6 
 
 3UBPHPASFS IN OPTIMAL ORDER FOP REPLACEMENT 
 
 PHRASE* 
 1 
 2 
 3 
 
 SURPHRASE 
 • XXXXX' 
 •XXXX* 
 •XXX 1 
 
 «xx* 
 
 FREQUENCY 
 2 
 2 
 3 
 
 REDUCTION 
 8 
 6 
 6 
 /. 
 
 FACTOR 
 
 THE "INI MA L STRING IS 
 
 % DO 1*00 1 
 
 ORIGINAL STRING: 
 
 SURPHRASE 
 
 • XXXXX* 
 
 •XXXX* 
 
 • XXX' 
 
 • XX* 
 
 FPEOUENCY 
 2 
 2 
 3 
 
 •XXXXXXXXXXX* 
 PEDUCTICN FAC 
 
 8 
 
 6 
 
 6 
 
 'OP 
 
 SUBPHPASFS IN OPTIMAL ORDER FOR REPLACEMENT 
 
 PHRASE* 
 1 
 2 
 3 
 
 i 
 
 SUBPHRAS E 
 • XXXXX' 
 •XXXX' 
 •XXX* 
 • XX' 
 
 FP EOUENCY 
 2 
 2 
 3 
 
 REDUCTION 
 8 
 6 
 6 
 5 
 
 FACTOR 
 
 THE MINIMAL S T PING IS: 
 
 * 001? 00 IX 
 
 ORIGINAL STPING: 
 
 SURPHRASE 
 
 • XXXXXX' 
 
 • XXXXX* 
 
 • XXXX* 
 
 •XXX* 
 
 • XX* 
 
 FPEOUENCY 
 2 
 2 
 ? 
 3 
 5 
 
 • XXXXXXXXXXXX • 
 
 REDUCTICN FACTOR 
 
 10 
 
 . 8 
 
 6 
 
 c 
 
 SUBPHPASES IN TP^IMAL OPUEP FOP REPLACEMENT 
 PHRASE* SURPHRASE FPEOUENCY PEDUCTICN FACTOR 
 
 1 
 2 
 3 
 4 
 
 • XXXXXX • 
 •XXXXX' 
 
 • XXXX' 
 
 • XXX' 
 •XX' 
 
 HE MINIMAL STRING IS: 
 
 10 
 8 
 
 6 
 6 
 5 
 
 35001*001 
 
 -35- 
 
ORIGINAL STRING: 
 SUBPHRASE 
 
 — I >~ ---• 
 
 -( I™ ■ 
 
 ( ) • 
 
 — < )— ■ 
 
 -( ) ■ 
 
 ( )— — * 
 
 ) • 
 
 — ( ) 
 
 -( ) 
 
 ( 1™ 
 
 )-— 
 
 )— 
 
 — ( ) 
 
 -( )-« • 
 
 ( ) 
 
 ) 
 
 > 
 
 j 
 
 — ( ) 
 
 -< )-— 
 
 { ,_„..,. 
 
 ) — 
 
 )- 
 
 ) .- 
 
 — ( ) — 
 
 ( ) ,- 
 
 ) — — — 
 
 ) 
 
 ) 
 
 — < )- 
 
 -( ) — 
 
 ( ) 
 
 ) , 
 
 )— 
 
 ) 
 
 I 
 
 — ( ) 
 
 ( ) — 
 
 > 
 
 ) 
 
 ) 
 
 — < 
 
 -( ) 
 
 ( )- 
 
 ) — 
 
 ) 
 
 ) 
 
 ) 
 
 FREQUENCY 
 2 
 2 
 2 
 2 
 2 
 2 
 2 
 2 
 2 
 2 
 2 
 2 
 2 
 2 
 2 
 2 
 2 
 2 
 2 
 2 
 2 
 2 
 2 
 2 
 2 
 2 
 2 
 2 
 2 
 2 
 2 
 2 
 2 
 2 
 2 
 2 
 2 
 2 
 2 
 2 
 2 
 2 
 2 
 2 
 2 
 2 
 2 
 2 
 2 
 2 
 2 
 2 
 2 
 2 
 2 
 2 
 
 •0— ( ) 
 
 REDUCTTCN 
 26 
 26 
 26 
 24 
 24 
 24 
 24 
 22 
 12 
 22 
 22 
 22 
 20 
 20 
 20 
 20 
 20 
 20 
 18 
 18 
 18 
 18 
 18 
 18 
 18 
 16 
 16 
 16 
 16 
 16 
 16 
 16 
 16 
 14 
 14 
 14 
 14 
 14 
 14 
 14 
 14 
 12 
 12 
 12 
 12 
 12 
 12 
 12 
 12 
 10 
 10 
 10 
 10 
 10 
 10 
 10 
 
 FACTOR 
 
 — + t 
 
 36 
 

 •» 2 
 
 10 
 
 •-< 
 
 2 
 
 8 
 
 •( )• 
 
 2 
 
 8 
 
 1 )-l 
 
 2 
 
 8 
 
 1 l—l 
 
 2 
 
 8 
 
 • J — • 
 
 2 
 
 8 
 
 1 ) 1 
 
 2 
 
 8 
 
 
 2 
 
 8 
 
 • { • 
 
 2 
 
 6 
 
 • ) • 
 
 2 
 
 6 
 
 • )-• 
 
 2 
 
 6 
 
 1 )—1 
 
 2 
 
 6 
 
 f )•- .- • 
 
 2 
 
 6 
 
 • 1 
 
 4 
 
 12 
 
 
 2 
 
 4 
 
 • )• 
 
 2 
 
 4 
 
 1 )-. 
 
 2 
 
 4 
 
 •)— •• 
 
 2 
 
 4 
 
 1 1 
 
 6 
 
 12 
 
 1.. I 
 
 9 
 
 9 
 
 . )• 
 
 2 
 
 2 
 
 1 )-. 
 
 2 
 
 2 
 
 SUBPHR4 
 
 SES IN OPTIMAL ORDEP 
 
 FCR REPLACEMENT 
 
 'PHRASED 
 
 SUBPHRASE FREQUENCY 
 
 PEDUCTICN FACTOR 
 
 1 
 
 • — < >~- ' 
 
 
 2 
 3 
 
 2 
 
 26 
 
 2 
 
 1 f !.._..,„.-__... 1 
 
 26 
 
 2 
 
 26 
 
 4 
 
 ■-- ( |— — • 
 
 
 5 
 
 2 
 
 »-< ) ■ 
 
 2 
 
 24 
 
 24 
 
 6 
 
 , ( , , 
 
 
 
 2 
 
 24 
 
 7 
 
 
 
 2 
 
 24 
 
 8 
 
 •-- < ) ' 
 
 
 
 2 
 
 22 
 
 9 
 
 •-< ) • 
 
 
 
 2 
 
 22 
 
 LO 
 
 • ( , • 
 
 
 
 2 
 
 22 
 
 11 
 
 • ) 1 
 
 
 12 
 
 2 
 
 22 
 
 2 
 
 22 
 
 13 
 
 •--< ) ■ 
 
 
 
 2 
 
 2C 
 
 14 
 
 •-( ) ■ 
 
 
 
 2 
 
 2C 
 
 15 
 
 • ( , . 
 
 
 
 2 
 
 2C 
 
 16 
 
 • ) i 
 
 ■s 
 
 
 2 
 
 2C 
 
 17 
 
 i ) t 
 
 
 
 2 
 
 20 
 
 18 
 
 i ) i 
 
 
 -37- 
 
-- ( )- 
 
 ( ) — 
 
 ) — ,. 
 
 ) 
 
 ) ,„. 
 
 ) 
 
 — ( )• 
 
 -( )~ 
 
 ( )— 
 
 ) 
 
 ) 
 
 ) 
 
 — ( )- 
 
 -( )~ 
 
 { )»-■ 
 
 )—— 
 
 )- — »'• 
 
 ) 
 
 ~-( ) 
 -{ )- 
 ( )— 
 
 )— - 
 
 ) -.- — 
 
 ) — 
 
 )~~ • 
 ) — .1 
 
 2 
 
 2 
 
 2 
 
 2 
 
 2 
 
 2 
 
 2 
 
 2 
 
 2 
 
 2 
 
 2 
 
 2 
 
 2 
 
 2 
 
 2 
 
 2 
 
 2 
 
 2 
 
 2 
 
 2 
 
 2 
 
 2 
 
 2 
 
 2 
 2 
 2 
 2 
 2 
 2 
 2 
 2 
 2 
 4 
 6 
 2 
 2 
 2 
 
 2C 
 
 18 
 
 18 
 
 18 
 
 18 
 
 18 
 
 18 
 
 18 
 
 16 
 
 16 
 
 16 
 
 16 
 
 16 
 
 16 
 
 16 
 
 16 
 
 14 
 
 14 
 
 14 
 
 14 
 
 14 
 
 14 
 
 14 
 
 14 
 12 
 12 
 12 
 12 
 12 
 12 
 12 
 12 
 12 
 12 
 10 
 1G 
 10 
 
 •38- 
 
55 
 
 i ) « 
 
 » 2 
 
 56 
 
 i )—.—._ < 
 
 ' 2 
 
 57 
 
 » — - — - i 
 
 2 
 
 58 1 
 
 »— 1 ' 
 
 • 2 
 
 59 
 
 '-( ) ' 
 
 2 
 
 60 « 
 
 i— — i 
 
 9 
 
 61 ' 
 
 i )— i 
 
 2 
 
 62 
 
 i )-— i 
 
 2 
 
 63 ' 
 
 i ) _-.«. i 
 
 2 
 
 64 
 
 
 2 
 
 65 « 
 
 '-< • 
 
 2 
 
 66 « 
 
 »( )• 
 
 2 
 
 67 « 
 
 i )-i 
 
 2 
 
 68 
 
 i )— i 
 
 2 
 
 69 j 
 
 i ) 1 
 
 2 
 
 70 
 
 '< » 
 
 2 
 
 71 ' 
 
 ' )• 
 
 2 
 
 72 ' 
 
 i )~i 
 
 2 
 
 73 « 
 
 i )-• 
 
 2 
 
 74 ' 
 
 ')--• 
 
 2 
 
 75 ' 
 
 i f 
 
 2 
 
 76 ' 
 
 • )• 
 
 2 
 
 77 ' 
 
 1 ) • 
 
 2 
 
 78 ' 
 
 )- • 
 
 2 
 
 10 
 
 10 
 
 10 
 
 TTJ" 
 
 10 
 
 9 
 
 8 
 
 8 
 
 8 
 
 "8" 
 8 
 8 
 8 
 6 
 6 
 6 
 6 
 
 6 
 u 
 
 4 
 4 
 4 
 2 
 2 
 
 THE MINIMAL STRING IS: 
 
 0S001S0CU060* 
 
 ORIGINAL STRING: 
 SUBPHPASE 
 
 — _ I 
 
 FREQUENCY REDUCTICN FACTOR 
 
 2 12 
 
 2 10 
 
 2 8 
 
 3 S 
 A 8 
 7 7 
 
 SUBPHPASES IN OPTICAL ORDER FOR REPLACEMENT 
 PHRASE* SURPHPASE FREQUENCY D EOUCTICN FACTOR 
 
 1 
 2 
 3 
 4 
 5 
 6 
 
 
 _— • 
 
 THE MINIMAL STRING IS: 
 ORIGINAL STRING: 
 
 2 
 
 12 
 
 2 
 
 10 
 
 3 
 
 9 
 
 2 
 
 8 
 
 A 
 
 8 
 
 7 
 
 7 
 
 
 $001*001- 
 
 
 • *•_( / ) 
 
 ( / ) 4-1 
 
 -39- 
 
2.3-^. Common Phrase Replacement 
 
 Assuming now that we have selected by hand, or generated 
 automatically using the method described, a particular set of phrases 
 P which are the set of phrases to be used for replacement in S, we can 
 now attack the question of how to best recognize and replace a common 
 phrase in a general string. Clearly, the optimal way is not the linear 
 search which was used for convenience in the previous program. Let us 
 attach the problem somewhat mathematically. 
 
 Assume that a set of common phrases is given. Only those phrases 
 may be referenced (replaced) within messages. The problem then is to 
 
 discover for each message that "parse" into nonover lapping phrases which 
 
 \ 
 minimizes the new message's length. Let a general character string (message) 
 
 to be transmitted be described by the following context-free grammar: 
 
 <message> : := <phrase reference> <message> | 
 
 <character string> <message> | 
 <end marker> 
 
 <phrase reference> 
 <end marker> 
 <character string> 
 <string> 
 <number> 
 
 P <number> 
 
 = E 
 
 = C <number> <string> 
 
 (any string of 1 to 256 printable characters) 
 (any integer in the range to 255) 
 
 The <number> occurring in a <phrase reference> indicates which of 
 256 phrases (maximum) is intended for substitution (<number> is the index 
 into phrase table P) ; the <number> in <character string> is a count of the 
 characters in the <string>, minus 1. 
 
 Assuming that no string exceeds 256 characters in length, the space 
 
 requirements for each component of the message are: 
 
 phrase reference - 2 units 
 
 end marker - 1 unit 
 
 character string - 2+L, where L is the number of characters 
 
 in the string. 
 
 -kO- 
 
This encoding scheme is not space optimal - in fact, the storage 
 requirements for the character string of length L could be reduced to L, but at 
 the cost of having the decoding mechanism examine every character in the string 
 looking for a phrase reference (P) or end marker (E). Since this is 
 supposed to be an algorithm for application purposes, this appeared to be 
 unacceptably slow, hence, the addition of two extra characters [C <number>] 
 allowing the direct application of the IBM 360 "Move Characters" instruction 
 to move the string all at once without examining individual characters. 
 
 An efficient algorithm for producing space-optimal parses has been 
 developed by using this strategy. 
 
 Consider one message as a simple character string. Number the 
 
 character positions from 1 through N. Suppose that one can compute the 
 
 function 
 
 f(j) = least space necessary to store characters 
 j, j'+l, ..., N of the given message for 
 I < j < N 
 
 Then f(l) will be the space-optimal parse length for the entire message. 
 
 Let P be the set of all phrases. For each p e P let |p| = length 
 of p. Let ST(j,p) be a predicate which is true when phrase p matches 
 character positions j, j+1, ..., j 4- |p I - 1 of the given message string. 
 ST(j,p) is false when p is not a phrase or when p does not match the string 
 beginning at position j in the message. 
 
 To define f(l), let 
 
 P(I) = (p|ST(l,p)) 
 
 F(l) = min (F(l+|p| + 2, F(l+l)+l)} f or 1 < I < N . ' 
 
 Assume by induction that f(j)=F(j), for I < j < N. Assume that phrase 
 p e P(l) is used in the parse at I - it will match characters I, 1+1, ..., 
 I + |p J - 1 and that storage space can be reduced to two characters 
 
 -Ul- 
 
(the phrase marker and the phrase reference number). Then the remainder 
 of the message, characters I + |p|, I + |p | + 1, . .., N, will require 
 f(l + |p | ) characters for storage. But f(l + Ipl) = F(l + Ipl) by the 
 induction hypothesis. Now assume that no phrase could be used at I. 
 Then the one -character string at I can be stored followed by the optimal 
 parse of characters I + 1, 1+2, . .., N. Since a one-character string 
 requires one character of storage, f(l + l) + 1 = F(l + l) + 1. Now 
 simply minimize all alternatives at each I and set f(l) = F(l). 
 
 Finally, the search for phrases in P is accelerated by using 
 a "hash table" techniques. Since this algorithm will incur a cost of two 
 characters overhead for each <phrase reference>, only Ipl > 3 are considered 
 for replacement. The characters I, I + 1, and I + 2 of the message are then 
 hashed to accelerate the search for phrases "beginning with that three 
 character segment . 
 
 -k2- 
 
3 • SUMMARY 
 
 This study has illustrated the necessity of data compression 
 in a telecommunications environment. Two techniques have been presented 
 which accomplish data compression hy very different methods - duplicate 
 character compression and common phrase replacement. For the type of 
 data under consideration, both work quite well. 
 
 The success of the duplicate character compression method is due 
 in large part to the specific type of data being transmitted, which did, 
 in fact, have many occurrences of contiguous duplicate characters. The 
 common phrase detection and replacement method is more general, will apply 
 to a large number of situations, but incurs a much larger overhead. Thus, 
 the method selected is, predictably, a strong function of the transmitted 
 me s sage . 
 
 -U3- 
 
BIBLIOGRAPHIC DATA 
 SHEET 
 
 1. Report No. 
 
 UIUCDCS-R-7U-659 
 
 4. Title and Subtitle 
 
 Data Compression for Character Strings 
 
 3. Recipient's Accession No. 
 
 5. Report Date 
 
 July, 1974 
 
 7. Author(s) 
 
 Alfred C. Weaver 
 
 8. Performing Organization Rept. 
 No - UIUCDCS-R-74-659 
 
 9. Performing Organization Name and Address 
 
 University of Illinois at Urbana- Champaign 
 Department of Computer Science 
 Urbana, Illinois 618OI 
 
 10. Project/Task/Work Unit No. 
 
 11. Contract/Grant No. 
 
 12. Sponsoring Organization Name and Address 
 
 University of Illinois at Urbana-Champaign 
 Department of Computer Science 
 Urbana, Illinois 61801 
 
 13. Type of Report & Period 
 Covered 
 
 14. 
 
 15. Supplementary Notes 
 
 16. Abstracts 
 
 Two approaches to data compression in a telecommunications environment are 
 examined: contiguous duplicate character compression and common phrase 
 detection/replacement. Algorithms for each method are presented. Each method 
 is shown to be useful for a given class of transmitted messages. 
 
 17. Key Words and Document Analysis. 17a. Descriptors 
 
 Data Compression 
 Telecommunications 
 Common Phrase Detection 
 
 17b. ldcntif iers/Open-Knded Terms 
 
 17c. ( OSA II lie Id /Group 
 
 18. Availability Statement 
 
 Release Unlimited 
 
 19. Security Class (This 
 Report) 
 
 UNCLASSIFIED 
 
 20. Security Class (This 
 
 Page 
 UNCLASSIFIED 
 
 21. No. of Pages 
 kk 
 
 22. Price 
 
 FORM NTI3-3B (10-70) 
 
 USCOMM-OC 40329-P7I 
 
AUG 27 1974 
 
FEB 1 7 1981