LI B HAHY 
 
 OF THE. 
 
 UNIVERSITY 
 
 Of ILLINOIS 
 
 5I0.B4- 
 I46r 
 no.355-3G0 
 
 cop.2* 
 
The person charging this material is re- 
 sponsible for its return on or before the 
 Latest Date stamped below. 
 
 Theft, mutilation, and underlining of books 
 are reasons for disciplinary action and may 
 result in dismissal from the University. 
 
 UNIVERSITY OF ILLINOIS LIBRARY AT URBANA-CHAMPAIGN 
 
 NOV 1 1971 
 
 OCT 1 3 
 
 Re Co 
 
 L161— O-1096 
 
Digitized by the Internet Archive 
 in 2013 
 
 http://archive.org/details/probabilisticlan355elli 
 
, ^,V Report No. 355 
 
 tu.sss' 
 
 C00-11469-01U9 
 
 PROBABILISTIC LANGUAGES AND AUTOMATA 
 
 by 
 Clarence Arthur Ellis, Ph.D. 
 
 October 1969 
 
 (* 
 
 N K # 
 
 DEPARTMENT OF COMPUTER SCIENCE 
 UNIVERSITY OF ILLINOIS AT URBANA-CHAMPAIGN 
 
 URBANA, ILLINOIS 
 
 W LIBRARY OF THE 
 
 i 
 
Report No. 355 
 PROBABILISTIC LANGUAGES AND AUTOMATA* 
 
 by 
 Clarence Arthur Ellis 
 
 October 1969 
 
 Department of Computer Science 
 University of Illinois 
 Urbana, Illinois 6l801 
 
 This report was supported in part by grant U. S. AEC AT( 11-1)1^69 and 
 by the Department of Computer Science, and was submitted as a Doctoral 
 thesis to the Graduate College of the University of Illinois by the 
 Department of Computer Science, October 1969. 
 

 Report No. 355 
 
 PROBABILISTIC LANGUAGES AND AUTOMATA* 
 
 by 
 Clarence Arthur Ellis 
 
 October 1969 
 
 Department of Computer Science 
 University of Illinois 
 Urbana, Illinois 6l801 
 
 * This report was supported in part by grant U. S. AEC AT(ll-l)lU69 and 
 by the Department of Computer Science, and was submitted as a Doctoral 
 thesis to the Graduate College of the University of Illinois by the 
 Department of Computer Science, October 1969. 
 
ACKNOWLEDGMENT 
 
 Profuse thanks are due to Professor D. E. Muller for his 
 advice and encouragement during the preparation of this thesis. The 
 author is also indebted to the Department of Computer Science, University 
 of Illinois, for its support and to Miss Barbara Hurdle for her typing 
 of this thesis. 
 
 Finally, the author extends his appreciation to his wife, 
 Anna, whose patience and understanding assistance were of great value. 
 
PROBABILISTIC LANGUAGES AND AUTOMATA 
 
 Clarence Arthur Ellis, Ph.D. 
 
 Department of Computer Science 
 
 University of Illinois, 1969 
 
 The concept of a probabilistic language is defined and 
 investigated. The motivation for the definition stems from the hope 
 of using this tool to investigate programming languages and their 
 translators. A probabilistic language over a vocabulary T is defined 
 as a class C of words formed from T together with a probability measure 
 on C. The classes T* of finite strings, T^ of infinite strings, T of 
 
 CO 
 
 finite trees, and T of infinite trees are considered. Context Free 
 Probabilistic Languages are characterized in terms of (l) Probabilistic 
 Grammars, (2) Probabilistic Tree Automata. 
 
TABLE OF CONTENTS 
 
 1. INTRODUCTION 1 
 
 2. BASIC DEFINITIONS AND NOTATION 2 
 
 3. PROBABILISTIC GRAMMARS AND LANGUAGES k 
 
 h. PROBABILISTIC AUTOMATA 9 
 
 5. CONTEXT FREE GRAMMARS l6 
 
 6. PROBABILISTIC TREE AUTOMATA 38 
 
 T. SUMMARY AND CONCLUSIONS 59 
 
 LIST OF REFERENCES 6l 
 
 APPENDIX 
 
 A. APPROXIMATION OF PROBABILISTIC TURING AUTOMATA 
 
 BY PROBABILISTIC PUSHDOWN AUTOMATA 63 
 
 B. EXAMPLES OF REGULAR TREE EXPRESSIONS 72 
 
 VITA fk 
 
1 
 
 1 . INTRODUCTION 
 
 In recent years, much work has been done on extensions of the 
 theory of finite automata to obtain models of acceptors and translators 
 of programming languages . Examples are pushdown store automata , 
 stack automata , minimax automata , and balloon automata . There 
 are many others. The purpose of this thesis is not to simply introduce 
 another type of automaton, but to describe a general concept which can be 
 adapted to any of the automata present in the literature. 
 
 It is quite natural to assign probabilities (or frequencies) 
 to the strings of a language to try to get some quantitative measure of 
 "efficiency" of grammars and translators. The model obtained by doing 
 
 this is called a probabilistic language, which may be considered a 
 
 [27] 
 fuzzy set , containing all valid sentences of the language together 
 
 with a grade-of-membership function for these sentences . Acceptors and 
 
 generators for these probabilistic languages are defined as Probabilistic 
 
 Automata and Probabilistic Grammars, respectively. Specifically, Context 
 
 Free Probabilistic Languages are explored in depth in this thesis. 
 
 This investigation does not consider how one would find the 
 
 "best" grammar or automaton for a language, or how to improve a given 
 
 grammar. Indeed, the meaning of "best" is open to many interpretations. 
 
 The related idea of finding good approximation grammars for languages is 
 
 also unexplored. It is hoped that the tools developed here will lead to 
 
 quantitative analysis in these and other areas . 
 
2 
 
 2. BASIC DEFINITIONS AND NOTATION 
 This section presents notation and concepts which have been 
 previously defined in the literature and are heavily used in this paper. 
 Then these definitions are altered to form probabilistic analogues. A 
 language over a set T of terminal symbols is a subset of the set T* 
 of all strings over T. A phrase structure grammar over a set T is a 
 system (N, P, S) in which N is a finite set of nonterminal symbols, P 
 is a set of rules (called productions) of the form (¥-»-£) where Z, is 
 any string of symbols of T (J N (denoted z, e(T (J N)*) and *f is any non- 
 empty string of symbols of T U N, (denoted f e(T UN)). * is called the 
 generatrix of the production, and z, is called the replacement string. 
 S e N is the initial nonterminal. 
 
 Notation: Hereafter, when discussing languages and grammars, 
 A, B, and C will always denote elements of N, while X, Y, and Z denote 
 strings over N. Similarly, a, b, c e T; x, y, z e T*; a , 6 » y e TUN; 
 
 Y, x> S e(TUN)*. If x = a a. ..a then the length of x is i(x) = n. 
 
 A 
 
 The null string is denoted by \(z{\) = 0) and the empty set is <f> . L 
 
 A A 
 
 always denotes a language, G is a grammar, and A is an automaton. I 
 denotes the set of positive integers, II the rationals, and R the reals. 
 Let G = (N, P, S) be a grammar. If x B 1 V ^ 2 and (¥ -> Z,)e P, then we 
 
 write x ■*■ *-, £ V?' If ^ strings r, £ . ..£ such that 5. -*■ z, . , then 
 
 we write z,^ => z, and we say there is a derivation of Z, from t with 
 On n 
 
 A A a _ 
 
 respect to G. The language L generated by a grammar G is 
 
 A A 
 
 L(G) = {x S => x, x e T*} . If L is generated by a grammar with all 
 
A 
 
 productions (Y ■+ c)» V £ N, then L is a context free language. If 
 
 further, £ is of the form aB for some a e T, Be N U {A} in all 
 
 a A 
 
 productions of G, then L is a regular language. 
 
k 
 
 3. PROBABILISTIC GRAMMARS AND LANGUAGES 
 Definition: A Probabilistic Language (P language) over T is a system 
 
 A 
 
 L = (L, u) where L is a class of words formed from T and 
 y is a measure on the set L. If y is a probability 
 
 -A 
 
 measure, then L is a Normalized Probabilistic Language 
 
 (NP language) . 
 
 Definition: A Probabilistic Grammar (P grammar) over T is a system 
 
 G = (N, P, a) where N is the finite set of nonterminals, 
 
 A,, A^ , . . . A , A is an n-dimensional vector, (6_...6 ) with 
 1' 2' n In 
 
 5. being the probability that A. is chosen as the initial 
 
 nonterminal, and P is a finite set of probabilistic 
 
 p i 1 + + 
 
 productions, ¥. — ±*»s . , with V.e(N U Tj , z, . e(N U T) , and 
 
 p. e R (p. J 0). If A is stochastic, if < p <_ 1, and if 
 
 -'-J ■*■ J 
 
 E p. . = 1 for every generatrix ¥. contained in productions 
 
 A 
 
 of P, then G is a Normalized Probabilistic Grammar 
 (NP grammar) . 
 If all productions of G are of the form A-2* aB or A-*»- a, 
 
 A 
 
 AeN, BeN, a e T, then G is called a left linear P grammar . The 
 probability of a derivation of C from t is defined as 
 
 k k. 
 pr(c * £ ) = £ J_[ P. where k is the number of derivations of C 
 
 P ii 
 from £_, k. is the number of derivation steps, C. — "■ &. . used in. 
 
 1 !» J~J- - 1 - » J 
 
 the i-th derivation, and p. is the probability associated with the 
 j-th step of the i-th derivation. The derived probability of a terminal 
 
+ A 
 
 string x e T with respect to a left linear grammar G is y(x) = 
 
 n 
 
 I (6. pr(A. =>x)) where N = {A n , A., . . .A } , A = U 6 ...6 ). 
 . , v i i 1 2' n ' 12 n 
 
 The P language generated by G is L = (T , y) where y(x) = the derived 
 
 r o I 
 probability of x. An admissible P grammar (see Greibach ) is a 
 
 grammar in which there exists a derivation of some x e T from each 
 
 A e N. A generalized admissible P grammar is one in which there exists 
 
 a production with A in the generatrix for each A e N. 
 
 Theorem 1: 
 
 Proof: 
 
 Every normalized left linear admissible P grammar G 
 
 generates a normalized P language. 
 
 Define an (n+l) x (n+l) matrix U = [u ] as follows: 
 
 u. . = 
 
 u. 
 
 Ij 
 
 I pr(A. ^aA.),i£n,j<_n 
 
 a e T 1 J 
 
 (A.-^aA )eP 
 J 
 
 E pr(A +b), i<n, J s n + l 
 
 b e T 
 (A.—»-b)eP 
 
 u. . = 0. i = n + l,j<n 
 
 Li ' — 
 
 u 
 
 "n+l* n+l 
 
 = 1 
 
 ii 
 
 I 'n+l 
 
 is by definition the total probability of a derivation from A. of 
 
 a terminal string of length 1. Considering powers of the matrix U, 
 
 u. , gives the total probability of derivation from A. of a string of 
 
 length <_ k. If U is pre -multiplied by the row vector A augmented by 
 
 zero, A' = (6. 5_...6 ,0)then the (n+l)-st element in the resulting vector 
 1 d n 
 
 represents the sum of the derived probabilities of all x e T > i(x) ^ k. 
 
6 
 
 Finally, £ y(x) = lim T. y(x) = lim (A'»u) . Since 
 
 5 x e T 
 Jl(x) < k 
 
 x e T" k -► °° x e T" k -*- °° 
 
 G is normalized, U is a stochastic matrix; and since G is admissible, 
 
 — k 
 
 j k e I > u. + - | > for i = 1, 2...n+l. Thus, using the theory of 
 
 *x 
 
 [6" X k k 
 
 Markov Chains , U =| t J where each row vector t. approaches a 
 
 t k 
 n+1 
 
 steady state vector t as k approaches infinity, t = (0 O...Ol)V k e I 
 
 implies t = (0 0...0 l) and lim (A'-IT^) = A' lim (Jp) = A'-jtr). 
 
 ]^ -»• oo k ~>" °° 
 
 /t\ n+1 
 
 the (n+l)-st element is (A'jt ) = Z (6.) = 1. QED. 
 
 w 1=1 1 
 
 A P language which is generated by a left linear P grammar is called a 
 regular P language. 
 
 Theorem 2: There exists a regular language L with a probability 
 y(x) assigned to each x e L such that no left linear 
 P grammar generates (L, u). 
 
 Proof: The proof will consist simply of exhibiting such a 
 language . 
 
 (1) Let T = {a}; then T is the set of strings {a |n e I}. 
 
 (2) Assign probabilities to these strings y(a ) = , n > 0, 
 
 'T 
 
 n 
 
 whe 
 
 2i 
 
 re t = h t t. = smallest prime 3 t. > max (t , 2 ) for i > 1. 
 
(3) Assign y(a) = 1 - E — t-" - • This guarantees that £ y(a ) = 1. 
 
 i=l / n n=l 
 
 Next we show that no left linear P grammar generates the language 
 
 (T + , y). 
 
 (1) Suppose the grammar G = (N, P, A) is alleged to generate (T , y). 
 Then all y(a ) are in the field of numbers generated by the 
 rationals with field extensions p. where p. is the probability- 
 associated with the i-th production of P if < i <_ |p| , and p. 
 is the probability 6. in the vector A if i = |p| + j. This field 
 is denoted "fl (p .. .p) , where k = |p| + |n|. 
 
 1 K. 
 
 (2) If all p. are in the field II or are algebraic extensions of it, 
 
 then the total extension is of finite degree. Consider the 
 extension ( — e=—. — z=— » . . • ) • This may be written as a union of 
 
 fields each of which is a finite extension of degree 2 of the 
 
 previous field. Thus, U H ( — ■==- , — /==->. ••> — T == ~^ is a fi e l d 
 
 n=l Al rz /n 
 
 whose degree must be infinite. Thus all of these irrationals cannot 
 
 be within the finite degree algebraic field extension 1(p . ..p ). 
 
 X K 
 A 
 
 Since all derived probabilities under the grammar G of finite strings 
 are expressible as finite sums of products, these derived probabilities 
 must be within "" (p. . . ,p ) . Thus (T , y) cannot be derived using G. 
 
 (3) If some of the p. are transcendental extensions, then 1f(p n ...p, ) 
 
 l 1 k 
 
 can be obtained by a pure transcendental extension ^(p., • . .p^) = Q 
 
followed by an algebraic extension of finite degree Q(p ...p ). 
 In this case, — 4 ""(p^ +1 - • -P k ) implies .— 4 Q(p £+1 . . .p fc ) 
 
 2 1 
 
 by the following argument. Let the polynomial f(x) = x - — 
 
 be irreducible over Tf-. = 1f(p_ ...p.) but reducible over 
 
 l ^1 l 
 
 ^-.i = MPijJi where p.,, is transcendental (i < i) . Then 
 l+l i ^l+l *i+l — 
 
 f(x) = (x - a)(x + a), a e IF., (p. ). a is expressible as 
 
 g(P i+1 ) K g 2 1 
 
 —, r where ** is in reduced form and not in 1T (*■) = — e R. 
 
 h(p i+1 ) h i- h t 
 
 2 1, 2 
 g - ^h =0. But this equation implies that p is algebraic 
 
 over 1i. which is a contradiction => <=. Thus if f(x) is 
 
 irreducible over "L , then it is irreducible over A. , , This 
 
 i l+l 
 
 can be applied not 1 but H times to yield 
 r 4 *" s» ; " t V(p , . .p. ) . Using the previous part (2) of 
 
 this proof for the algebraic elements p„ .....p, . we get 
 
 ft+1 k 
 
 3- x 3 -7^- 4 «n(p 1 ...p k ). QED. 
 
k. PROBABILISTIC AUTOMATA 
 
 The idea of the probabilistic finite automaton was originally 
 
 [17] 
 conceived by Rabin. Basically, if an automaton is in some state q, 
 
 and receives an input a, then it can move into any state, and the 
 
 probability of moving into state q f is p(q, a, q'). Rabin requires that 
 
 E p(q, a, q' ) = 1 (called type 1 normalization in this paper) for 
 
 all q in the set of states Q, and for all a e T. Practical motivation 
 
 for this requirement is that these automata can model sequential circuits 
 
 which are intended to be deterministic, but which exhibit stochastic 
 
 behavior because of random malfunctioning of components. Thus p(q, a, q' ) 
 
 is interpreted as the conditional probability of q' given q and a, 
 
 pr(q'|q,a), so by the theorem of total probability, Z pr(q'|q,a) = 1. 
 
 q' e Q 
 
 Other interpretations may give rise to other normalizations. For example, 
 
 in performing the state identification experiment with a probabilistic 
 
 automaton, one might interpret p(q, a, q' ) as pr(q, q'|a). This implies 
 
 a normalization by summing over all possible q, q' values. 
 
 E p(q, a, q' ) = 1. In fact, eight different types of 
 q £ Q q' e Q 
 
 probabilistic automata can be defined by the various interpretations 
 listed in the following table. 
 
10 
 
 Normalizations for Probabilistic Finite Automata 
 
 a, p 
 
 © 
 
 TYPE INTERPRETATION NORMALIZATION 
 
 1 pr(q'|q, a) I p(q, a, q' ) = 1 v q e Q, a e T 
 
 q' e Q 
 
 2 pr(q|a, q' ) E p(q, a, q' ) = 1 v q' , v a 
 
 q e Q 
 
 3 ~pr(a|q, q» ) I p(q, a, q' ) = 1 y q, V q' 
 
 a e T 
 
 k pr(q', a|q) I E p(q, a, q' ) = 1 y V. 
 
 a e T q' e Q 
 
 5 pr(q, a|q') E E p(q, a, q' ) = 1 v q» 
 
 a e T q e Q 
 
 6 pr(q, q'|a) I T, p(q, a, q' ) = 1 y a 
 
 q £ Q q' e Q 
 
 T pr(q, a, q'|) ZEE p(q, a, q* ) = 1 
 
 q a q" 
 
 pr(|q, a, q*) p(q, a, q' ) = 1 v q, a, q' 
 
 One of the important theorems concerning finite automata, which 
 was first proved by Kleene in 1956 states that for every left linear 
 grammar, there exists an automaton which accepts all and only the strings 
 generated by the left linear grammar and conversely, there is a left 
 linear grammar which generates all and only the strings accepted by any 
 
 finite automaton. Surprisingly, an identical theorem was proved by 
 
 [3] 
 Chomsky and Schutzenberger in 1963 concerning context free languages 
 
 and pushdown store automata. The analogous problems for probabilistic 
 
 automata are attacked in this paper. If the symbols a e T are interpreted 
 
 as outputs instead of inputs, then the automaton becomes a generator similar 
 
 to a grammar. In this case, type k normalization must be chosen so that an 
 
 NP grammar will correspond to an NP automaton . 
 
11 
 
 Definition: A Probabilistic Automaton (P automaton) over T is a 
 
 a 
 
 system A = (Q, M, S, H) where Q is a finite set of states, 
 
 S is a finite set of storage tape symbols, - is an initial 
 state vector and M is a function, called a probabilistic 
 transition function, which has associated with it a second 
 function p. The specific nature of these functions determines 
 the type of P automaton defined. If 5 is a stochastic vector 
 
 A A 
 
 and if A is constrained to some normalization type, then A is 
 a Normalized Probabilistic Automaton (NP automaton). Cases in 
 
 A 
 
 which S = <j> will be simplified to A = (Q, M, E). Particular 
 classes of automata are obtained by attaching constraints to 
 the general definition. The following table lists some of the 
 automata definable, and their range (i?(M)) and domain (£>(M)) 
 constraints on the mapping M, and their normalization constraints 
 
 Types of Automata 
 
 1. Deterministic Finite Automaton 
 
 Norm Constraints: Type 1 
 
 RD Constraints: D(M) = Q x T, i?(M) C Q 
 
 2. Nondeterministic Finite Automaton 
 
 Norm: Type 
 
 RD: D(u) = Q x T, 2?(M) C P(q) 
 
 3. Probabilistic Rabin Automaton 
 
 Norm: Type 1 
 
 RD: £(M) = Q x T, i?(M) C P(Q) 
 
 k. Probabilistic Ellis Automaton ■ 
 Norm: Type k 
 
 RD: D(U) = Q x T, R(U) C?(q) \j {X} 
 
12 
 
 5. Probabilistic Tree Automaton 
 
 Norm: TyP e ^ 
 
 RD: Z?(M) = Q x T, i?(M) CP(q«) 
 
 6. Probabilistic Pushdown Store Automaton 
 
 Norm: Type k 
 
 RD: Z?(M) = Q x S x T, 2?(M) CP(Q x S) U {Xj 
 
 Note': P(E) for any set E denotes the power set of E. 
 
 In the case of a type k normalized finite P automaton, 
 
 M(q, a) = A is used to designate termination. Any 
 
 q' e Q » M(q' , a) = X is called a terminable state. If 
 
 for all q e Q, there exists a terminable state q' accessible 
 
 from q, (i.e., } a sequence of states 
 
 q = V q l J " q m 9 q i+l £ M(q i' a i } ' and X £ M(q m' ^ 
 
 for some sequence of inputs a^ a n a_...a e T ) then 
 
 12m 
 
 A 
 
 A is a terminatable P automaton . A transition is a change from 
 some state q. e Q under an input a e T to some state q. e Q 
 such that q. e M(q.,a), and will be written (q. ,a) -> q.,-,- 
 X e M(q. ,a) will be written (q. ,a) -> halt. Associated with each 
 transition is a probability; the product of these transition 
 probabilities is the probability p of the sequence q q . ..q . 
 
 A mapping M(q, a) = cj> has probability zero associated with 
 it, and designates that a transition out of the state q 
 under input a is disallowed. The probability of acceptance 
 
 m 
 
 of a string x = a, ...a is Z ^(q^) p. p . where m is the 
 In . , i ni 
 
 i=l 
 
 number of sequences q q,...q such that q. e M(q* , a.), 
 
 .1 = 1, 2...n-l, and X e M(q , a ), C(q) is a function whose 
 
13 
 
 value is the probability of starting in state q, and 
 
 p . is the probability of the terminating transition from 
 
 the last state in the i-th sequence. The P language 
 
 A a . + 
 accepted by a P automaton A is L= (T , p) where 
 
 y(x) = the probability of acceptance of x. 
 
 Theorem 3: 
 
 Proof: 
 
 Every finite P automaton accepts a P language which is 
 
 generated by some left linear P grammar and conversely, 
 
 every left linear P grammar generates a P language which 
 
 is accepted by some finite P automaton. 
 
 (a) Consider any left linear P grammar G = (N, P, A) over 
 
 T. The equivalent automaton is constructed as follows: 
 
 A = (Q, M, 5) where Q = N, H = A, and for each (A. ■*■ aA )e P 
 
 J- J 
 
 we define q. e M(q. , a) where q. = A. , q 1 = A - For each 
 
 (A. ■*■ b)e P, we define X e M(q. , b). The probability of 
 
 each of these transitions is defined as the probability 
 
 associated with the corresponding production, 
 
 P i 
 A. *-£.. All other transitions are of the form M(q, a) = <j> 
 
 and have probability zero. For each derivation of x with 
 
 A 
 
 respect to G, there is a set of transitions which accepts 
 
 A 
 
 x using A, and by construction, probabilities are the same. 
 
 Also, each 6. e A is equivalent to £(q.), so the derived 
 
 n m 
 
 probability of x = E 6. pr(A. •> x) = E £(q. )p. p . = 
 
 i=l i=l 
 
Ik 
 
 the probability of acceptance of x. Thus, the P language 
 
 A A 
 
 generated by G and the P language accepted by A are the same, 
 (b) The construction of a P grammar from an automaton is as 
 follows: If A = (Q, M, S), then construct G = (N, P, A) 
 where N = Q, A = 5, and for each q. e M(q. , a) add a 
 
 P i1 
 production A. ** aA to P; for each A e M(q. , a) add a 
 
 P i 
 production A. » a, where p. . and p. are respectively the 
 
 probabilities associated with the corresponding transitions 
 (q. , a) ■* q. and (q. , a) ■* halt. By the argument used in 
 
 _L J X 
 
 part (a) of this proof, the P languages generated and accepted 
 must be the same. 
 Corollary 3.1: 
 
 A 
 
 Every finite normalized P automaton A accepts a P language 
 which is generated by some left linear normalized P grammar 
 
 A AAA 
 
 G and conversely, V normalized G, J normalized A » L(G) = 
 
 A A A A 
 
 L(A) where L(G) means the P language generated by G and L(A) 
 
 A 
 
 means the P language accepted by A. 
 Corollary 3.2: 
 
 A 
 
 Every finite terminating P automaton A accepts a P language 
 which is generated by some left linear admissible P grammar 
 
 A AAA 
 
 G and conversely, V admissible G, 3 terminating A 3- L(G) = 
 L(A). 
 Proof: These corollaries follow immediately from the construction 
 in the proof of Theorem 3. 
 
a, 1/3 
 
 15 
 
 b, 1/3 
 
 Example 1. 
 State Diagram of NP Automaton 
 
 Corresponding NP Grammar 
 A^aA 
 A^aB 
 B^bB 
 
 2/3 
 B lii.bC 
 
 cUl cC 
 
 SZi, 
 
 A = (5 A , 6 B , 6 C .) = (1, 0, 0) 
 
16 
 
 5. CONTEXT FREE GRAMMARS 
 
 A 
 
 If all productions of a P grammar G are of the form 
 (A-^-s), A e N, p £ R, ce(N U T) , then G is called a context free 
 P grammar . The definitions of derivation and derived probability are 
 the same as for left linear grammars except that the replacement string 
 may now consist of more than one nonterminal, so all derivations must 
 
 be performed by operating upon the left-most nonterminal at each step to 
 
 .a. 
 avoid undesirable ambiguities. The definition of the P language L(G) 
 
 A A A 
 
 generated by G is unchanged; if G is a context free P grammar, then L(G) 
 is called a context free P language . 
 
 Theorem h: Every admissible context free NP grammar can be transformed 
 into an equivalent NP grammar in Chomsky Normal Form, which 
 means all productions are of the form A-»-BC or A-*-b, 
 (A, B, C e N, b e T). Equivalent P grammars are ones which 
 generate the same P language. 
 
 Proof: The proof is a constructive one. 
 
 A 
 
 (a) Given an admissible context free NP grammar G = (N, P, A), 
 
 Pit 
 we first eliminate ail production B. — — **B. by constructing 
 
 a matrix u whose rows and columns are labeled by nonterminal 
 
 symbols B. . As the element in the B. row and B column, we 
 1 1 J ' 
 
 P ij 
 take p. . if B. ^B . is a production in G. Otherwise, the 
 
17 
 element is zero. Construct a matrix V whose rows are 
 labeled by nonterminals and whose columns are labeled by 
 the strings C. 4 N which appear as replacement strings in 
 
 J 
 
 productions in P. The element in row B. and column £ is 
 
 -L J 
 
 p! 
 
 ii A 
 
 p' if there is any production B. *-C, in G. Otherwise, 
 
 id i J 
 
 the element is zero. u V is a matrix with q . , in the row 
 
 labeled B. and column labeled z, , where q. . is the 
 
 probability of a derivation B. -> B, -*...-*■ B„ -*■ X, , of 
 
 l Is. I J 
 
 length n + 1. Thus 
 
 ( Z u ) v is the total probability matrix for B. ^> £.. To 
 n=0 x J 
 
 show normalization, we must show that 
 
 ( Z u )v is a stochastic matrix. This is true for the 
 n=0 
 
 ,u V N 
 combined square matrix (--), and so it is true for 
 
 n 
 n 
 
 f UV ) n = £L_ 
 { 1i } ^ 
 
 <i=0 ^ 
 
 ), n = 1, 2,.... Since lim u = 0, 
 
 n -*■ °° 
 
 ( Z u )v exists and must be a stochastic matrix. 
 n=0 
 
s 18 
 
 (b) After all eliminations of type a are completed, all 
 productions whose replacement strings are of length n(> 2) can 
 be reduced to productions with replacement strings of length 
 n-1 by the following procedure: replace A-^-a 6 ¥ by A-^-a D, 
 D— M$ V* where D is a nonterminal not in N of the old grammar 
 and a H is a string of length n, so M has length n-1. By 
 repeating this procedure, the maximum length can be reduced 
 to 2. 
 
 (c) Replace all productions A-^.a a where at least one a. 
 
 is in T by A— =i*.B, B^ where B. = a. if a e N, and B. is a new 
 12 l i i l 
 
 nonterminal if a. e T with the production B. — ►a. inserted into 
 i ii 
 
 the grammar. The same strings of terminals are generated by a 
 
 grammar before and after steps a, b, c, and d, and the derived 
 
 probabilities are unchanged. Thus, the new grammar is 
 
 equivalent to the old because exactly the same strings with 
 
 the same probabilities are generated. 
 
 Theorem 5: Every admissible context free NP grammar can be transformed 
 
 into an equivalent NP grammar in Greibach Normal Form-(GNF), 
 
 which means all productions are of the form 
 
 A^bC, C....C (A,C....C e N, b e T, n > 0). 
 12 n ' 1 n — 
 
19 
 
 A 
 
 Algorithm: Given any admissible context freeNP grammar G = (N, P, A), 
 
 eliminate all productions of the form A-E*B by the technique 
 given in the proof of Theorem k. Then define the set of 
 handles of G as H(G) = {ot|a is the first symbol of some 
 
 replacement string of P} . In this case, a is called a 
 handle. M(G) = {a|a is the generatrix of some production 
 with handle in N} . The requirement for an admissible P 
 
 A 
 
 grammar G to be in Greibach Normal Form is that r, must be a 
 string of nonterminals for each production, A-q^-a^ and 
 (l) T _)H(G) or equivalently (2) M(G) = <j>. Note that if any 
 8 within ^ is a terminal, it can be replaced by a new 
 nonterminal C and a production C-*>3 added to P. Thus 
 our goal will be to obtain productions of the form 
 
 A -*■ aR ft ...8 with each 8. e (NU T). The method 
 
 1 2 n ! 
 
 of proof which is analogous to the method used for 
 
 roi 
 nonprobabilistic grammars by Greibach is to employ 
 
 an iterative technique which generates at each step a 
 
 A 
 
 new grammar G = (N , P , A ) which has one less A e N 
 
 in the set of handles. New nonterminals are created, 
 but the new productions added are such that no new 
 symbols ever appear as a handle. Eventually, all handles 
 become members of T. The construction and proof which 
 follow are illustrated by an example. 
 
20 
 
 T = (a, b), N = {A, C} 
 P = {A H. a , A^Lca 
 
 A = (1, 0) C-HlibA, C i^iACC} 
 
 Example 2. 
 
 For any A e N M(G), the following procedure eliminates A from M(G) 
 and from H(G) : 
 
 (a) First construct a finite directed graph. One node s is labelled 
 by A = A . For each node s. labelled A. , and each production 
 
 A. — isi-A. 3 n B . . .6 where A. , A. e N, 3-. , 3 . . .3 e N U T, (n > 0) 
 l j 1 2 n ij 12 n ^'_ 
 
 create a node s. labelled A. (if one does not already exist) and 
 J J 
 
 create an arc from s. to s labelled 3, 3«...3 • For each 
 
 x j x 2 n 
 
 production A. — i*.c 3-, 3^.. .3 where c z T, create a new node 
 l 1 2 n 
 
 labelled t e I and create an arc from s . to t labelled 
 
 l 
 
 c 3„ 3^...3 . At times, nodes will be denoted by their 
 1 2 n 
 
 labels since no two nodes have the same label. Numbered nodes 
 
 are terminal and are not connected to any other node. Each arc 
 
 is also labelled with the probability p of the corresponding 
 
 production. Repeat this process until all nodes accessible from 
 
 A and their arcs have been created. P is a finite set so this 
 
 process will terminate after a finite number of steps . 
 
21 
 
 Example 2: Taking A = A we construct the following graph. 
 
 a, 1/2 a, 1/2 bA, 2/3 
 
 (b) From this graph, a set of productions is obtainable. Put all 
 
 productions of P into P except those with generatrix A. Put the 
 
 productions of the form A—Vz; and ^B.-^W, which are created below into P ft . 
 
 For each terminal node t, and for each simple path from 
 
 s. to t, call it s. s_ s....s , t = s , one can write a production 
 1 2 n n e 
 
 n 
 
 A "i. r r . . . r , q = II p. where £. and p. are respectively 
 CT^Si Si-1 ^1' ^ . 1 y i l *i 
 
 i=l 
 
 the string of symbols and the probability associated with the arc from 
 
 s. to s., i = 1, 2,..., n. In general, there may be several arcs from 
 
 some s. to s. so s s ...s actually specifies a set of productions, 
 
 one for each possible sequence of arcs connecting the sequence of nodes. 
 Note: For any finite directed graph, one can always find all simple 
 paths from any node A. to any node A. because one need only consider 
 
 all sequences s^ s_...s with s^ = A. such that n is no greater than 
 1 n i 
 
 the number of nodes in the graph. 
 
 Consider a particular simple path from s to some terminal node 
 
 t. Any nodes s. in this path s^ s, s^. . .s , (s = t), such that there is 
 i 012 nn 
 
 a path from s. to s. not containing any s. with j < i is said to fulfill 
 
22 
 
 the loop condition with respect to the given path. Define one new 
 
 nonterminal B. for each such s.. For each possible combination 
 1 i * 
 
 s., s.,..., s of one or more nodes fulfilling the loop condition, 
 1 j k 
 
 and for each sequence of arcs connecting s , s ,..., s , a production 
 
 r 
 must be written of the form A. — ^-C l, n . . . r B, c, . . . c B C . . . 
 
 n n-1 ^k+1 k k j+1 j j 
 
 ^I+I^a C. ••• ^p^i wnere ^n i s "the label of the arc connecting s to 
 
 s n (Z = 1, 2...., n), B^ is allowed, but not B , and r will be defined 
 a n 
 
 later, 
 (c) For each s. fulfilling the loop condition, productions must be created 
 
 to describe all paths from s. to s.. Erase all nodes s^ of s^ s.,...s 
 ^ 11 j 1 n 
 
 such that j < i and all arcs connected to these s.. Then split s. 
 
 i 
 
 1 2 1 
 
 into two nodes, s. and s.. s. has those arcs of s. leading out, and 
 ill l 
 
 2 
 s. has those arcs leading into s. with associated strings and 
 
 probabilities unchanged. Productions with generatrix B ± must be 
 
 1 2 
 
 constructed and put into P_ for each simple path from s. to s. using 
 
 the technique of parts (b) and (c) of this proof with s. = s and 
 
 2 
 s. = s . These productions, which are called B. - loop generating 
 
 productions, are constructed to allow looping,, so at the 
 
23 
 
 right-end of the right-hand side of each, B must occur to allow 
 
 the loop to repeat. Finally, for each of these productions 
 B . » C B. where p is derived from part (b), another production, 
 
 called a B. - loop terminating production, must he added to P n 
 
 1 
 to allow the loop to terminate, B.-£»- C 
 
 1 X " r i 
 
 where p = p( ). r. is defined below. This procedure is 
 
 r. i 
 
 l 
 
 recursive because steps (b) and (c) may need to be repeated many 
 times if the path contains many loops within loops. Furthermore, 
 the whole process is repeated for each simple path from s to a 
 terminal node. The probability of a production (A — *.-£)e P n is 
 
 defined recursively: 
 (l)- If £ = C C _-|***^l» " ttien r = 0. which was previously defined. 
 
 (2) If L, contains new nonterminals B. B....B, , then £ = Z, C ..... 
 
 i j k n n-1 
 
 ? k + i \ \ ^k-r-'Vi B j V- - ? i+ i B i— c i« ma r = 
 
 r. r r, 
 
 q( )( " — ) ( ) where r is the sum of probabilities 
 
 1-r. 1-r. 1 - *\ 
 i J k 
 
 over all B - loop generating productions of P of 
 pr(B^ -> c B^), I = i, j,..., k. 
 
2k 
 
 A 
 
 G_ for example 2: 
 
 A^liLbAa, A ililLbAaB, 
 
 , 1/2 . 1/10 ,, 
 
 A — L -»-a, A— - — »-aB, 
 
 1/6 
 B — £-*.CCaB (B-loop generating production) 
 
 B »»CCa (B-loop terminating production) 
 
 2/3 1/3 
 
 The recursion terminates because simple loops from B. to B. are 
 
 sequences of length at most |u| . A simple loop is a path s s ...s 
 
 such that s_ = s , and s n s_...s is a simple path. At the second 
 On 1 2 n x x 
 
 level of recursion, we consider simple loops emanating from some 
 
 node within sequences from B. to B. . These may be expressed as simple 
 
 (2) (2) 
 
 loops from some B. to B. . Notice that these paths do not have B. 
 11 l 
 
 in them, so they are of length <_ |n| - 1. Similarly at the (m+l)-st 
 
 level of recursion, simple paths are of length _<_ | W | - m because 
 
 nodes B. , B. . . . , B. cannot occur. This is shown by the following 
 
 ill 
 
 argument. If s. f s_ then the algorithm erases the initial node, and the 
 final node, by construction, has no arcs going out of it, so it cannot 
 contribute to any loops ; if s. = s then it was constructed by splitting 
 some node, and so it has no incoming arcs and therefore no loops. At 
 recursion level |n| , simple paths are of maximum length 1. There can be 
 no further loops, so part (l) of this recursive definition applies. 
 
 Lemma 5.1: The P grammar G is normalized. 
 
Proof: 
 
 Case 1: Suppose C e N, C 4 A . Then the productions C-E*-5 of N Q 
 
 A 
 
 are exactly those of N. So "by the normalization of G, 
 the sum of their probabilities must equal one. 
 Case 2: Suppose C e N, C = A . If there are no loops, then a 
 
 proof by induction on the maximum number of nodes in a 
 simple path from A to terminal nodes can be given. 
 
 (a) Let n = 1 be the number of nodes minus one (i.e., 
 maximum path has two .nodes, s_ and s ), then the following 
 
 diagram shows the situation. 
 
 A A *± 
 
 In this case, G and G. have the same productions, A^ *-r . , 
 
 *i 
 
 (b) Let n > 1 be the maximum number of nodes minus one in 
 
 A 
 
 the graph for G, and let the lemma hold for all NP grammars 
 with graphs having no more than n nodes in every simple path 
 and having no loops. There is a one-to-one correspondence 
 
 .A 
 
 between productions A^ -> £ t, . . . . C-, of the P grammar G_ 
 
 On n-1 1 U 
 
 and derivations A. =*> r, c, ,...£. with respect to *G such that, 
 U n n-l 1 
 
 by construction, the probabilities are the same. Thus the sum 
 
26 
 
 A 
 
 of probabilities of productions for G_ with generatrix A 
 
 |N|-1 
 is Z pr(A =»a^) a z p ( £ pr(A .*>&$)) + 
 
 a e T ■ J=l e 
 
 C e(N U T)* 5 e (N \J T) 
 
 * 
 
 pr(A -»■ a) where all derivations are with respect 
 a e T 
 5 e(N t T)* 
 
 to G, and p • = Z P r ( A n "*" A -?)> Since there are no 
 
 J C e(N SJ T)* ° J 
 
 loops, each A . =*• aX, corresponds to a path with no more than 
 J 
 
 n nodes, so by the induction hypothesis, Z pr(A => az, ) = 
 
 a e T y 
 
 C e(N SJ T)* 
 
 for j = 1, 2,..., |n|-1. Then by the normalization criterion, 
 
 the remaining sum equals one: 
 
 |n|-i 
 
 Z pr(A =s>ac) = I Z pr(A -> A c ) + 
 
 a e T j=UE(Ny T)* J 
 
 C e(N U T)* 
 
 Z pr(A ■> a?) = 1. 
 
 a e T 
 C e(N (J T)* 
 
 (c) For the most general case, we use induction on h, the 
 
 maximum number of nodes fulfilling the loop condition. The 
 
 lemma has been proven by (a) and (b) for h = 0. Next assume 
 
 the lemma is valid for all values 1, 2,..., h-1 with h > 0. 
 
 By omitting a node s. fulfilling the loop condition in the path 
 
 with the maximal number of nodes fulfilling the loop condition, 
 and by adjusting the probabilities of the remaining sequences 
 
 A 
 
 of arcs out of s . , the graph now represents a new NP grammar G' 
 
21 
 
 \onder the algorithm of Theorem 5 
 
 Graph for G 1 
 
 a^ 
 
 V is the probability of the j-th path from s to s.. u . 
 
 is the probability of the j-th path from s. to a terminal, 
 
 m 
 
 (t in the picture represents all terminal nodes. In G 1 , Z u. = 1, 
 
 j=l 
 
 m 
 
 A — A 
 
 and in G, Z u. + r = 1 by the normalization of G, and by the 
 0=1 J 
 
 construction of G' taking u 
 
 -_A 
 
 'j 1 - r 
 
 By the induction 
 
 hypothesis, the productions of G' with A as generatrix sum 
 
 to one. These are productions of the following form 
 
 u.V. 
 A — J » 'l f .<[), for paths passing through node s. and their 
 
 J k 1 
 
 m £ I 
 
 probabilities sum to Z Z u V = Z V . The analogous 
 
 J=l k«l J k k=l k 
 
 A 
 
 productions of G are of the form 
 
 UV uV(r~) 
 
 A — a-i-Y 4 <j>. and A^ ^ — ^L+y B 41 a where r is, by definition, 
 o j Y k -j jk r k 
 
 m 
 
 1 - E u.. Summing over these two types of productions yields 
 j=l ° 
 
28 
 
 £ (u V + u V (--£ )\ = I V C I u + rl = E V, , which 
 
 j=l k=l ^ k J k * k=l k j-l J w. k ' 
 
 m £ * m & 
 
 r 
 
 m 
 
 2 u. 
 
 A 
 
 is identical to the sum for G' , so the total probability over 
 
 A 
 
 all productions of G with generatrix A sum to one. This 
 
 technique can be applied to other paths not passing through 
 node s. so that the above proof applies if several paths have 
 
 the maximum number of nodes fulfilling the loop condition. 
 Case 3: Suppose C = B. e N - N. By definition, the B. - loop 
 
 generating productions sum to r. , and by construction the 
 
 1-14 
 
 B. - loop terminating productions sum to r. ( ). The 
 
 i l r . 
 
 l 
 
 two sums together give a total of 1. 
 
 A A 
 
 Lemma 5-2: G and G generate the same P language. 
 
 A A 
 
 Proof: G differs from G only in rules with generatrix A, and new 
 
 rules. The same set of strings is still generated from A 
 and other members of N are not affected. A has zero entries 
 
 for all new nonterminals so that no new generations occur, 
 and the other old nonterminals retain their same value as in 
 A. A is a stochastic vector because A is stochastic. We 
 
29 
 
 need only show that any derivation of a terminal string x 
 
 A A 
 
 has the same probability under G and G . Consider any 
 derivation which corresponds to a path from node A n to a 
 terminal node, i.e., A =* a C . The general case can be 
 
 A A 
 
 inferred from this. We claim that using G or G , the 
 
 probability of this derivation is the same. To use 
 induction on the recursion level of the path (i.e., the 
 number of simple loops within simple loops), first assume 
 path s_ S....S is simple. The probability of this path 
 
 A 
 
 using G is the product of the probabilities of derivation 
 
 P i P n 
 
 steps. If A. _ — ^A. ?. (0 < i < n) , and A , — >- a X, , 
 l-l l l n-1 n 
 
 n 
 
 A 
 
 then the probability is II p.. By the construction of G , 
 
 i=l 
 
 there exists a production A_— ^.a c, c; .,...£, with 
 
 On n-1 1 
 
 n 
 q = II p. , so the probability of the derivation is the same 
 i=l x 
 
 A A 
 
 using G or G . 
 
30 
 
 Consider a derivation with recursion level m, i.e., loops 
 nested up to m times "within other loops. Then, at the 
 
 m-th level of recursion, there are only simple loops 
 
 ( m) 
 emanating from any B. . Suppose in this derivation the 
 
 B. node is traversed k+1 times within a B. ~ loop. 
 i 1 v 
 
 A 
 
 The probability of this using G is 
 
 I m. 
 
 II t J where there may be many simple loops emanating 
 j=l J 
 
 ( m) 
 from B. and I is the number of different simple loops 
 
 traversed, with m as the number of times the j-th loop 
 J 
 
 was traversed, and t . as the probability associated with 
 
 J 
 
 A 
 
 the j-th loop. In G , there is a production 
 
 B (m) — J^ cB ( m ) for the j_ th loop> r^ k _ th B (m) loQp 
 
 is the final one and uses a loop terminating production 
 r. 
 
 fmi t i ( r 1) ri * m i 
 
 B: m; — J 1- % c. Thus the product will be II t . J . 
 
 1— r i 
 
 By the induction hypothesis, the probability is the same 
 
 A A 
 
 using G or G for any path of recursion level m-1, so let 
 q_ be the probability of the path under discussion when all 
 
 B. loops are removed. q_( IT t , J ) is the probability 
 1 J=l J 
 
using G of the path obtained when the k loops of B 
 are added. Under G , the addition of the k loops 
 
 ( m *1 ^ r\ 
 
 means that a production B. —+J, £> , ...C, must be 
 
 i n n-1 1 
 
 1-rn 
 
 31 
 
 (m) 
 
 (*_!) P(i" L ) 
 
 replaced by B^ =».?_•••?, B. C. ...C^. The 
 
 A 
 
 probability using G then becomes 
 
 1-r. r. I m I m 
 
 (q.( ~ )) (t- 1 — n t. J ) = q II t .^ which is identical 
 
 r i X - r i 3-1 J j-l J 
 
 A 
 
 to the probability using G. This result can be used, and 
 
 the above argument repeated to add further B^ loops, and 
 
 J 
 
 A A 
 
 show that the probability using G or G is the same for any 
 path of recursion level m. 
 
 A A 
 
 Several things can be mentioned about G . AX M(G ), i.e., 
 
 ■p A 
 
 all productions A-*-£ have handles in T; N n H(G )e N, i.e., no new 
 
 symbol is a handle. If A is a handle of any productions C— *-A ¥, then the 
 following substitution changes this handle to a set of terminals as handles 
 Suppose the productions with generatrix A in P are 
 
 P l P 2 P n q 
 
 A — »-a, C-, » A — >-a^ ?„»...« A »»a £ . Then C -VA C can be replaced by 
 
 112 2 n n 
 
 P Ql PoQ- PI A 
 
 C— i^ a , I. V, A »a £ V,..., C— ^-*-a c *• Thus A\ H(G n ). 
 11 2 2 n n ^ 
 
 Final G for example 2: 
 
 1/3 
 The P grammar contains C — i ~*-ACC with handle A. Using the 
 
 productions with generatrix A, we replace this production by 
 
32 
 
 1/9 !A5 
 C -±i4*bAaCC, C *bAaBCC 
 
 c iV^ aCCs c^-^aBCC. 
 
 Proof (of Theorem 5): 
 
 Let G = G = (N, P, E). Consider G. . If N n M(G. ) = <J>, 
 
 we stop, if not, the algorithm is used to find 
 
 A A , v , A v M A 
 
 ' = G i+1 = (N i+l' 5 i+l } such that L(G i ) = L(G i + l } ' 
 
 M (G i+1 )n N C M (G.), N. +1 n H(G. +1 )C N, N. +] _ H(g\ +1 ) C 
 
 A A 
 
 M(G ). At each step, one member of M(G. P N is 
 
 eliminated and new members are never added. No new non- 
 terminal symbols ever become handles. Thus, since N is 
 finite, we eventually reach an n such that 
 H(G )n N M. Consider M(G ). If M(G ) = <j> , we are 
 
 A 
 
 finished. Suppose C e M(G ). Then there exists a production 
 
 n 
 
 C-^*»A Q. By construction A cannot be a new symbol, i.e., 
 
 A A a 
 
 A e N. Then A t M(G ) since M(G)n N = 4> and A e H(G ); 
 
 n n n 
 
 A A 
 
 but by construction H(G )fl N c M(G ). This 
 J n — n 
 
 A A 
 
 contradiction implies C i M(G ): thus M(G ) = <f> and 
 
 n n 
 
 A A 
 
 H(G ) C T, so G is the sought after P grammar, 
 n — ' n D 
 
 A 
 
 Theorem 6: There exist context free languages L with probabilities 
 
 A 
 
 y(x) attached to strings of L which cannot be generated by 
 any context free P grammar. 
 
33 
 Proof: This theorem is the context free analogue of Theorem 2. 
 The example and proof given there are still valid if the 
 P grammar is generalized to be any context free P grammar. 
 The context free extension of Theorem 1, however, is false. 
 The following example of a normalized P grammar which 
 generates an un-normalj zed P language was suggested by 
 D. E. Muller. 
 
 A^AA, 
 
 A m a 
 
 Example 3. 
 This NP grammar generates a, 1/3; aa, 2/27; aaa, 8/243; and 
 
 oo 
 
 the total probability is E (a ) = 1/2. 
 
 n=l 
 
 Theorem 7". There exist context free NP grammars over T which do not 
 
 generate NP languages over T*. 
 Proof: First a general criterion for an NP grammar to generate an 
 
 NP language will be developed. Then the proof of the theorem 
 
 is simply to observe that example 3 does not fulfill the 
 
 criterion. 
 
 A 
 
 Let G = (N, P, A) be an admissible context free NP grammar. 
 
 By theorem 4, G can be put into Chomsky normal form (A. -*• A A or 
 
 l j Tc 
 
 A. •*■ a) . For any particular A. define a matrix Z. with entries 
 i * * i i 
 
 p. = pr(A. -*■ A. A, ). Also define a column vector Y(t) with entries y. (t) = 
 
 1 JK 1 J K. 1 
 
 the probability of derivation of a terminal string within t derivation steps 
 
3k 
 
 given a starting nonterminal of A., 
 
 y.(i) = q. = pr(A. ■+ a). The total probability of a derivation 
 
 1 a e T 1 
 
 |i| 
 
 of length < t is a(t) = E 6. y.(t). A derivation A. => x of length 
 — . , 1 x 1 
 
 i=l 
 
 less than or equal to t+1 can be obtained by (l) a production A. ■+ a 
 or (2) A. ■+ A. A^ and both A. and A yield derivations of length at 
 most t. Thus we can write the equation y. (t+l) = 
 
 |n| |n| 
 
 E E r>. y (t) v, (t) + q.. In matrix form, this is 
 j=l k=l x J k J k 
 
 y ± (t+l) = Y T (t)Z i Y(t) + y ± (l). 
 
 Lemma 7.1: y. (t) <_ 1, i = 1, 2,..., |n| implies y. (t+l) <_ 1, 
 
 i = 1, 2 , . . . , j N| . Assume y . (t ) <_ 1 for all i , for some t . 
 
 Then y (t+l) = Y T (t)Z j , Y(t) + y i (l) 
 
 < (1, 1,..., 1) Z. (1, 1,..., 1) T + q. 
 
 = E E p. + y. (l). Thus y. (t+l) <_ 1. 
 J k ljk X X 
 
 Lemma 7-2: y. (t+l) >_y.(t) 
 
 This is easily shown by induction on t. 
 Case t=l: y.(2) = Y T (t) Z ± Y(t) + y ± (.l) >y.(l). 
 
 Case t=m > 1: 
 
 Suppose y (t+l) > y.(t) for t = m-1. Then 
 i — l 
 
35 
 y. (m+1) = I I p. ,. y Am) y (m) + y. (l) 
 
 y.(m) = j j p y^m-l) ^(m-l) + y^l). 
 
 Each component y (m) >_y (m-l), 
 J J 
 
 y (ra) >_y (m-l) by the induction hypothesis, so the sum 
 
 is greater, y.(ra+l) > y.(m). 
 i — i 
 
 Lemmas 7-1 and 7.2 show that y.(t) must approach some fixed point as t 
 
 approaches infinity, (a bounded, monotone increasing sequence converges) 
 
 T 
 and (l, 1,..., l) is a fixed point vector. Suppose lim Y(t) = Y; 
 
 t -*■ °° 
 then Y must be a fixed point vector. This means 
 
 y . = Y T Z. Y + q. » 
 
 |N| |N| |N| 
 
 y i = l E y i P iik y k + Z y i p iii y i 
 
 1 j=l k=l J 1JK k j=l J ljl * 
 
 j^i k^i j?i 
 
 |N| 
 + * y i P iik y k + y i P iii y i + q i 
 
 .K—J_ 
 
 k^i 
 
 2 l N l l N | |N| 
 
 y i (P iii } + y i ( 4 y £ (p iU + P i£i } - 1} + { .l ± k l ± P ijk y j y k + V = ° 
 J#i J*i k^i 
 
 Solving this quadratic for y. yields 
 
 JN 
 1 - 
 
 2,-1 / V- 
 
 l?i / Jl^i 
 
 e y (p.. + p. .) + / e y (p.. + p. .) - lV 
 
 Yi = 
 
 , N I l N l 
 - ^p,,,('z E p, , v y, y v + q,) 
 
 Jfo k^i 
 
 111 1=1 k=l ijk j k 
 
 2p. . . 
 
36 
 
 or if p. . . =0 then a simpler linear equation results. 
 111 
 
 T 
 This system of |N| equations has a solution Y = (l, 1,..., l) . 
 
 |n| |n| 
 
 This solution implies lim a(t)= £ 5. y. = E 6. =1. 
 
 . . -, 1 1 . , 1 
 
 t ">■ °° 1=1 1 = 1 
 
 A 
 
 Criterion: An NP grammar G generates an NP language iff the equation has 
 no solution vector smaller than (l, 1,..., l) where a smaller solution 
 vector Y means yy., <_ y. £ 1 and 3 y. 3 <_ y. < 1 (i = 1, 2,..., |N|). 
 
 1*1 
 In case this Y solution exists, lim o(t)= E 6. y. < 1 is the 
 
 , ... l i 
 
 t -*- oo 1=1 
 
 A 
 
 probability of a terminal string being generated by G since the 
 smallest solution is always the one approached. It can be shown that 
 this is the case by the following argument. 
 
 Since both Z. and Y must have all non-negative components, 
 
 (because G is normalized) y. = Y Z. Y + q. > q. for any solution Y. 
 
 l i l—i 
 
 Replacing the unit vector (l, 1,..., l) by the smallest solution vector 
 
 Y in lemma 7-1 yields a proof that y. (t) <_ y. for all t; so y.(t) t y. . 
 
 2/3 1/3 
 In the example A — L -»-AA, A *a, we have A = A, q = 1/3, 
 
 P lll = 2/3> S ° 
 
 _ ^-Sii q i _ l±/L-U(2/3)(l/3) solutic 
 
 UJ .ons are 1, 1/2. Thus 
 
 2p... 2(2/3) 
 
 in 
 
 |n| 
 
 y( a n ) = Z 6. y. = 1/2. This example, therefore, does not fit the 
 n=l i=l X 1 
 
 criterion. Furthermore, by solving this equation for the more general 
 
37 
 
 _ 1 + /l-U p + Ur> 
 lea p... = p, and q. = q = 1-p, we get y\ = — — '— t 
 
 Solutions are 1, q/p. Tims A-^-AA, A -i-a yields 
 
 00 
 
 E u(a ) = q/p if p >_ q and 1 if p < q. 
 n=l 
 
 The significance of this result is that P grammars may have 
 nonzero probability of an endless cycle of generation. This leads to 
 the concept of nonterminating P grammars which generate infinite 
 strings. Define T as the set of all infinite strings of symbols of T. 
 Definition: A generalized P grammar G is strictly nonterminating iff (l). 
 it is in Greibach normal form; (2) it contains no productions 
 of the form A-=-*-a (these are called terminating productions). 
 
 A strictly nonterminating P grammar must be considered to 
 generate strings of T . By introducing tree automata in the next chapter, 
 an analogue for context free P grammars of theorem 3 will be proven. The 
 following theorems will be an immediate consequence of the theorems for 
 tree automata. 
 
 Theorem 8: Every strictly nonterminating left linear NP grammar generates a 
 P language (T , y) such that y is normalized, (i.e., y is a 
 probability measure). 
 
 Theorem 9: Every strictly nonterminating context free NP grammar 
 
 generates a P language (TT , y) such that y is normalized. 
 It will also be shown that P grammars which generate strings over 
 T U IT can be considered as a sub-class of P grammars over t . 
 
38 
 6. PROBABILISTIC TREE AUTOMATA 
 Definition: A tree is a set D C I* where I* is the set of all finite 
 strings of positive integers and where D satisfies the 
 following three requirements: 
 Let d = (n n . ..n ), n. e I, then dn = (n n...n n), n £ I, 
 
 X £_ K. 1 X d. iC 
 
 1. dn e D, n > 1, => d(n-l)e D 
 
 2. dn e D, n = 1, => d e D 
 
 3. If set {n|dneD,neI}^<f), then 
 
 max (n) < °° 
 dn e D 
 
 Not 
 
 e: (n . ..n ) in the case k = is the empty string A which is 
 
 considered as the root of the tree. Each d e D is a node of the tree 
 
 and if n = max (n) then node d goes down into n other nodes 
 dn e D 
 
 dl, d2,..., dn , from left to right: 
 
 + + 
 
 If {n|dn e D} = <j> , then n is defined as n =0. In this case, d is a 
 
 terminal node of D. 
 
 Definition: A valued tree over a finite alphabet T is a pair (v,D) 
 
 where D is a tree, and v is a function, v:D -*■ T. Valued tree 
 
 will sometimes be abbreviated to tree when there is no 
 
 possibility of confusion with the previous definition of tree, 
 
 Define A(n n . . .n ) = k. The length of a tree is sup £(d). 
 
 d £ D 
 
39 
 
 Define the m-th level of a tree to be the set (d c D|H(d) = m). If 
 
 ( 3 m e I)(V d c D) U(d) < m] , then (v,D) is called a finite valued tree. 
 
 Define a subtree of (v,D) to be (v* , D 1 ) where D' C d is itself a tree, 
 
 and the function v 1 is v on the restricted domain D' , v' = v|D. The 
 
 set of all finite valued trees over T is denoted by T . If d e D, *dl e D, 
 
 then (v,D) is a full infinite valued tree. T denotes the set of all full 
 
 infinite valued trees over T. 
 
 Definition: A Probabilistic Tree Automaton (PTA) over T is a triple 
 
 (Q, M, s) where Q is a finite set of states, 5 is the 
 
 initial state distribution vector, and M is the next state 
 
 function, M: Q x T ■+ P(Q*). Associated with each q e Q, 
 
 a e T is a function p 3 p(p ) = M(q,a), i?(p ) = R. 
 
 *q , a -q,a * *q,a 
 
 (Transition probability). 
 
 An elementary transition x i s a change consistent with M 
 
 K 
 
 from state q under input a to a sequence of st?tes & = q %. * • • 
 
 q . (Consistent with M means e iM(q,a)). Thus a transition out of 
 
 any state may go into many states rather than into a single next state. 
 
 The nonzero probability of this elementary event is p(i ) = p (Qi,)« 
 
 k q } a K 
 
 A probabilistic tree automaton is normalized (NPTA) iff for all states q, 
 
 the total probability of leaving q is one (i.e., £ Z p (On) = 
 
 a e T QL e M(q,a) q,a 
 
 l), < p (0 ) < 1, and 5 is a stochastic vector (i.e., 
 
 q,a k — 
 
 £ C(q) = 1 where ^ is a function: Q ->■ R which assigns to each state q, its 
 
 q. e Q 
 
 proper probability value from the vector E). 
 
Definition: A PTA is complete iff (Vq. e Q)(3 b e T) [M(q,b) 4 0], 
 Thus, complete means there is always a positive 
 probability of existing from any state q. 
 
 Definition: A PTA A is strictly non terminating if 
 
 (.1; A is complete. 
 
 (2) (V q £ Q)(V a e T)[X f. M(q,a)]. 
 Any complete PTA can be altered to form a strictly non- 
 terminating PTA by adding a new state q if there are elementary 
 
 transitions X e M(q,a), and replacing these by elementary transitions 
 q e M(q,a) with probabilities equal to that of the omitted transitions. 
 
 Next add a new blank symbol b to the alphabet and the transition 
 
 M(q t , b) = { q t }, p (q t ) = 1. 
 
 It is useful and convenient to represent these tree automata 
 using a modification of the directed graphs usually used for state 
 diagrams. A context free state diagram consists of a set of vertices, 
 represented by small circles, each of which may be connected to others 
 oy incoming arcs (lined with arrowheads) and outgoing cables (heavy 
 lines). The vertices denote states, and cables denote possible elementary 
 transitions. Each cable is labelled with the input symbol (e T) which 
 produces the elementary transition and the probability associated with 
 the transition. Each cable splits into one or more arcs each of which goes 
 to a state. The initial states (all q 3- £(q) ^ 0) are designated by 
 incoming labelled arcs with no source states, and no inputs. In the case 
 of a terminal transition, X z M(q,a), the automaton literally goes into no 
 
fcl 
 
 state (it stops). In drawing state diagrams, this can be represented 
 by drawing an arc from the vertex q to a dead-end vertex q which has 
 
 no cables out of it and which does not correspond to any state in Q. 
 
 This is a state diagram of the automaton 
 
 Q = <v V' 
 
 5 = UU A ), CU B )) = (1. 0), 
 
 M(q A ,a) = {q A , q A q fi } , M^b) = {q B , q A q^ , 
 
 P q A ,a V = P a' P q A ,a N V = P a' 
 
 P q B ,b V = V P qB ,b N V = P V 
 This automaton over T = {a,b} "accepts" among others the following trees 
 
 Tree 1, 
 
 Tree 2, 
 
 a 
 
 D 1 = {1 |n > 0} 
 
 (d) = a v d e D 
 
 a 
 
 a /X b 
 
 a /X b V 
 
 / \ \ \ 
 
 A \ \ \ 
 
 D 2 = D i U{l11 2 ^l 11 ' m - 0} 
 
 v 2 (d) = a if d e D x 
 
 v (d) = b otherwise 
 
 Example 5 
 
U2 
 
 A . 
 
 Definition: A run of A on a tree, denoted r(.v,D), is a function r: 
 
 D->Q3VdeD, (r(dl),..., r(dn + )) e M(r(d), v(d)) where 
 (a) 5 (r(A)) /0 and (b) if n + = then \ e M(r(d), v(d)). If 
 condition (b) is omitted in the case &(d) = length of the 
 tree (v,D) < », then r(v,D) is a p re-run . The set of all 
 runs on (v,D) is denoted by Rn(v,D). 
 A transition t is defined as the sequence of elementary transitions 
 
 T used in a run to go from some level n of a tree to level n+1. The 
 
 probability of this event is p(t) = n p(0- The sequence 
 
 of transitions used in a run is denoted by t(r). The 
 
 response function of a run (or pre-run) r is 
 
 defined as the product of the probabilities associated with transitions 
 
 used in the run ( or pre-run) and the initial state probability 
 
 rf(r) = g(r(A)) II p(t). 
 
 t e t(r) 
 
 Definition: A k-prefix of a tree (v,D) is the subtree (v, , D. ) where 
 — * IS, K. 
 
 D = {x e d|&(x) £ k} and v = v|D . In defining acceptance 
 
 of trees, it is desirable to accept a prefix as part of a 
 longer tree without requiring the machine to terminate. 
 Thus pre-runs on prefixes are necessary, and likewise the 
 following definitions: An elementary pre-transition t 
 out of state q under input a is the set of all elementary 
 transitions from state q under input a. The probability of 
 
 this event is p(i) I p (Q^.). I f M(q,a) = <j>, 
 
 ^ g M(q,a) q ' a k 
 
then p(x) = 0. The final transition t of a pre-run on 
 
 (v , D ) is the sequence of elementary pre-transitions used 
 
 K. K 
 
 to leave the k-th level. The probability of this event 
 
 is p(t) = IT P^O* 
 
 t e t 
 k 
 
 Definition: The behavior of A is the set of all trees (v,D) over 
 
 T 3 3 at least one run of A on (v,D), B(A) = {(v,D)| 
 
 Rn(v,D) ? <f>}. The k-behavior of A , B (A) , is the set of 
 
 """ K. 
 
 all k-prefixes (v , D ) over T. 
 
 K K 
 
 Definition: The probability of acceptance of a tree^ , y(v,D) is the 
 
 sum of the response functions of all runs on (v,D), 
 
 y(v,D) = I rf(r). 
 
 r e Rn(v,D) 
 
 Define the k-probability of a tree as y, (v,D) = y(v , D ) = 
 
 E rf(r) where RriAV , D ) is the set of all pre-runs on 
 
 r 6 Rn(v k , D k ) k k 
 
 1 k 
 
 (v , D ). Two trees over T are defined to be k-equivalent . (v , D ) = 
 
 K. K 
 
 2 ? 1 ? 12 
 
 (v , D ) iff D, = D. and v = v . 
 
 k k k k 
 
 Theorem 10: = is an equivalence relation. 
 
 Proof: It is easy to verify that k-equi valence is reflexive, 
 
 symmetric, and transitive. 
 
 Theorem 
 
 Proof: 
 
 11: (v 1 , D 1 ) = (v 2 , D 2 ) =>y, (v 1 , D 1 ) = n (v 2 , D 2 ) over T 
 
 k 
 
 12 12 11 
 
 Let D, = D, = A , and v. = y = v, . then y (v , D ) = 
 k k k' k k k' k 
 
Theorem 12: If A is a normalized PTA, then £ u(v. , D ) = 
 
 (v k , D k ) £ B k (A) k * 
 
 1, k - 0, 1, 2,... 
 
 Proof: Let £ rf ( r k ) denote E rf(r) = 
 
 r e Rn(v k , D fc ) 
 
 (v k , D k ) E B k a> 
 
 Z y( v v' D i )• ^ e P ro °f is 'by induction on k. 
 
 <vV e V*> 
 
 (1) Case k=0: For each pre-run, there is only a single transition which 
 is a final transition consisting of one elementary pre-transition 
 because the 0-th level of a tree is only one node, A. Given any 
 
 q e Q, suppose 3m inputs a. e T 3 M(q,a.) ^ <J>. Then 3 m prefixes 
 
 a., , a^,..., a which each have a pre-run of the form r: A ■+ q, 
 12m 
 
 and for the tree a. , the response function is 
 
 £(q) I p (Q.). The probability of these m response 
 
 q, e M(q,a.) q > a i J 
 
 functions is E £(q) E P (Q.) which by 
 
 a. £ T ' Q £ M(q,a.) q ' a i J 
 
 1 J J- 
 
 normalization of A reduces to £(q). Finally, summing over all 
 
 states q, we get E £(q) E E P (Q.) = E ^(Q.) 1 
 
 q e Q a. £ T Q e M(q,a. ) q ' a i J q e Q 
 l j i 
 
 by normalization criteria. 
 
 (2) Case k > 0: Assume E . y(v, , D ) = 1. The 
 
 A 
 
 set of (k-l) - prefixes of trees in B(A) partitions the set of 
 
 k-1 
 k-prefixes via the equivalence relation ==. This is indeed a 
 
 partition because each (v , D ) has some (.Vj., D k _]_) as prefix, so 
 
h5 
 v . includes all prefixes (v . 
 
 X K. K 
 
 U E. includes all prefixes (v,_, D, ) where E 4 is an 
 
 i=l 
 
 equivalence class of k-prefixes all having the same (k-l) prefix. 
 
 k— 1 
 I is the number of equivalence classes created by = . Furthermore, 
 
 E. f| E = <i> because each (v , D, ) has a unique (k-l) - prefix. For 
 1 j Y k' k 
 
 every class E. , all members have the same (k-l) - probability by 
 theorem 10. For a particular pre-run of some (v , D )e E. , the 
 
 K K 1 
 
 transition from level k-l to level k of the tree yields a sequence of 
 states, q q. All possible transitions emanating from this 
 
 set of states to possible k+1 level trees have a sum probability of 
 
 m m m 
 
 Z Z E n p p. ...p. where m. = Z |M(q.,a)| 
 
 k =1 k =1 k =1 k l k 2 K n x a e T X 
 
 12 n 
 
 and each value of p, is associated with one of the transitions 
 
 *i 
 
 out of state q. . Thus summing over all pre-runs of all k-prefixes 
 
 m m n 
 
 the result is I rf(r, ) = E(rf(r )) I . . . 1° II p, ). The inner 
 
 k k " X k =1 k =1 i-1 k i 
 
 1 n 
 
 m m m 
 
 sums can be written Z p, ( Z p, . . . ( Z p )...) = 1 since by 
 
 k =1 1 k =l 2 k =1 n 
 12 n 
 
 m. 
 normalization the total probability of leaving state q. = E p. = 1. 
 
 k.=l l 
 
 i 
 
 Thus Z . u(v, , D ) = Z rf(r, ) = Z rf(r, ) = 
 
 (v k , D k ) . B k (A) 
 
 Z y(v , D ) and by the induction hypothesis 
 
 1 » p(v k 1> D k l' = 1 > so Z a "(v , D ) = 1. 
 
 ( vi- \-i^ \-i tA) <vV«V A > 
 
U6 
 Theorem 13 (Kolmogorov) : 
 
 A 
 
 Let B (A) be probability spaces, k = 0, 1, 2,...; let 
 
 A 
 
 all of these spaces be consistent. Then B(A) forms a 
 probability space consistent with the B ( A). (In other 
 
 K. 
 
 words, a consistent specification of all y determines 
 
 y uniquely.) 
 Lemma 13.1: Given any normalized PCF tree automaton A over T , the 
 
 A 
 
 set B (A) forms a probability space for each k e I. 
 
 Proof: A probability space consists of a set ft, a sigma field 
 
 § of allowable events which can always be chosen as 
 P(ft), the power set of ft, if ft is countable, and a 
 probability measure v. 
 
 Assign (ft , v ) as follows: 
 
 ft k = B k (A) 
 
 t = P(\) 
 
 v (B) = E y(v, , D ) for all B CB(A) 
 
 (v k , D k ) e B 
 
 The events e. of ft are sets of k-prefixes . By the previous theorem 
 
 V, (ft, ) = E . u(v , D, ) = 1. Obviously, v, (B U B' ) = 
 
 kk/p.x^/Vxkk k 
 
 ( V V e B k (A) 
 
 E y(v , D ). Assuming Bfl B' = <j> , we can write 
 
 (v k , D k ) e BU b' k k 
 
 MB U B«) = E y(v , D } + E y(v D ) = v (B)' + 
 
 <vV- B c v D k )eB ' 
 
hi 
 
 v k (B'). Thus v^ is finitely additive. B (A) has only a finite 
 
 number of subsets, so finite additivity is equivalent to countable 
 additivity. These conditions verify that v is indeed a probability 
 
 me as ure . 
 
 A 
 
 Lemma 13.2: The spaces B (A) are consistent in the sense that if 
 k < £,v (B) = v (B») for any B C R (A) where 
 
 K Si — K 
 
 B' C B (A) is the set of all trees in 
 
 A 
 
 B £ ( A ) " 3 (v £ , D ) has a k-prefix (v , D ) in B. 
 Proof (of Lemma 13.2): 
 
 (a) Assume % = k+1. The trees of B can be extended by 
 
 a single transition to trees of length k+1. By theorem 12, 
 
 (v D ) ' B (1) y(Vk+1 ' Dk+l) = 
 
 •Vi' D k+i j c B k+i (A) 
 
 2 a y(v , D ). An analogous argument 
 
 (v k , D k ) e B k ( A ) k >= 
 
 A 
 
 establishes that by replacing the set B (A) by B' , the 
 
 theorem still holds. Thus, v (B) = v (B* ) 
 
 k £ 
 
 (b) In case °° > I > k+1, apply part (a) I - k times. 
 
 Proof (of Theorem 13): 
 
 00 
 
 Choose Q = T . The set of all trees k-equivalent to (v,D) 
 is called the Borel cylinder over (v , D ). Define the 
 
 probability of this Borel cylinder as v { (v 1 , D')|(v', D f ) = 
 (v,D)} = u (v,D). Choose as t the smallest sigma field over 
 
T containing all Borel cylinders. Specification of 
 probabilities of all Borel cylinders for all k completely 
 specifies v on T . Since k-equi valence is an equivalence 
 relation (theorem 9)» the definition does not depend upon 
 the particular (v,D) in the equivalence class as 
 representative. This and consistency of all B (A) assure 
 
 that v is well-defined. It yields values for all Borel 
 
 cylinders and therefore for all measurable subsets of T°°. 
 
 Finally v really is a probability measure because v is 
 
 countably additive and v(fi) = v(B(j£)) = U-jB, (A)) = 1. 
 
 K k 
 
 There is a correspondence between PTA's and Greibach normal 
 
 form P grammars which becomes apparent by considering state diagrams. 
 
 Each vertex q corresponds to a nonterminal A except for q_^ . Each cable 
 
 leaving q corresponds to a production with A as generatrix. The 
 
 various states to which the arcs go, correspond to the nonterminals in 
 the replacement string of the production. The terminal symbol and 
 probability of the production are the labels of the corresponding cable. 
 In the example given of a state diagram, the corresponding P grammar 
 would be G = (N, P, A) where N = (A, B}, A = ( 6 , 5„) = (l, 0), and 
 
 A D 
 
 p p 1 P b V-L ' 
 P = (A — a **aA, A— ^aAB, B — »*bB, B HdAB). By changing the requirement 
 
 of all derivations proceeding from left to right to a requirement that 
 
 derivations from all nonterminals in a string occur simultaneously, the P 
 
 grammar in Greibach normal form becomes a generator of trees where a 
 
production A-i».aB B ...B generates a node with n branches 
 
 (a, p) 
 
 n branches 
 The P language generated is accepted by the PTA given in the above 
 correspondence. This correspondence also shows that tree automata can 
 be viewed as acceptors of strings. If a set B C T is In the behavior 
 
 of A, the corresponding set of strings, x e T , is found by forming 
 
 [2] 
 parenthesized strings from trees (see Brainerd ) and then removing 
 
 the parentheses. Alternatively, the process can be described as 
 
 follows: For each (v,D)e T , add zeroes to the right end of each 
 
 d' e D until it has length £(d' ) = max £(d). Consider these strings 
 
 d e D 
 
 as integers and order them d , d , ..., d so that d < d. for 
 
 1 K 1—1 1 
 
 i = 1, 2,..., k. Then form x = (v(d ), v(d ),..., v(d )) as the 
 
 corresponding string. The following is an example of this process. 
 Example 5: 
 
 Valued Tree Underlying Tree 
 
 b A 
 
 A 
 
 ^b slC ^2 
 
 x x \ 
 
 b' N c 11 12 
 
 b 111 
 
 D = {A, 1, 2, 11, 12, 111} 
 
 v(A) = b, v(l) = c, v(2) = b 
 
 v(ll) = b v(!2) = c, v(lll) = b 
 
50 
 Expanding D yields 000, 100, 200, 110, 120, 111 . 
 Ordering this set yields 000, 100, 110, 111, 120, 200 . 
 
 X 
 
 The corresponding string is ("b 
 
 * i ♦ * ♦, 
 
 c d b c b) 
 
 Thus , the sets of strings accepted by tree automata over 
 are exactly the context free languages. Next we state this 
 correspondence formally. 
 
 . A 
 
 Theorem 14: For every P grammar G in Greibach normal form, there is a 
 
 A A 
 
 PTA A which accepts the P language generated by G and 
 
 A A 
 
 conversely, for every PTA A, there exists a G in GNF which 
 
 A 
 
 generates the P language accepted by A. 
 Proof: The proof is a construction identical to that of theorem 3 
 using 
 
 (1) Q = N 
 
 (2) £ = A 
 
 (3) q B q B ...q e M(q A , a) with p (q q ,..q ) - p 
 
 1 2 n *' 1 2 n 
 
 iff (A-4-aB, B„...B )e P 
 12 n 
 
 (4) A e M(q,a) with p (x) = p iff (A^-a)e P 
 
 q,a 
 
 (5) M(q,a) = <J> iff there are no productions A-^aX in P 
 for all X e N*. 
 
 Corollary 14.1: 
 
 V normalized P grammar G in Greibach Normal Form (GNF), 3 
 normalized PTA A 3 L(.G) = L(A) and conversely, V normalized 
 PTA A, 3 normalized G in GNF 3 L(A) = L(G). 
 
ST 
 Corollary 3.U.2: 
 
 V admissible P grammar G in GNF, 3 terminating PTA A 
 
 , A. . A A 
 
 3 L(G) = L(A) and conversely, V terminating PTA A, 
 3 admissible G in GNF 3 L(A) = L(G) . 
 Corollary lU.3: 
 
 A 
 
 V generalized admissible G in GNF, J complete 
 
 A A . .A . A 
 
 A 3 L(G) = L(AJ and conversely, y complete A, 
 3 G in GNF 3 l(A) = L(G). 
 Proof: These corollaries follow immediately from the construction 
 in the proof of theorem lU. 
 Define the P language accepted by the (strictly nonterminating) 
 
 A 00 
 
 probabilistic tree automaton A a s the probabilistic language (T , y) 
 
 A A 
 
 where A determines y for all k, and these determine y on B(A); define 
 
 it 
 
 .oo A . 
 
 y(T - B(A) = 0. 
 
 Definition: The union with weighting vector w of the P languages 
 L. = (T , y.), i = 1,..., n is L = (T , y) where 
 
 n 
 y = £ w. y . . 
 i=l X X 
 
 Theorem ik: The union with weighting vector w of P languages L , 
 
 A A -A 
 
 L^,..., L forms a P language. Furthermore, if L. is 
 2 n do 1 
 
 normalized, i = 1,..., n, and w is a stochastic vector, 
 
 a n A 
 
 then the union L = U w. L. is normalized. 
 
 i=l X X 
 
52 
 
 Proof: 
 
 (a) Let B , B ,... "be a countable number of disjoint measurable sets of T°°, 
 
 oo n. oo 
 
 y(U B .)=z w -y-(U B .) "by definition of y. 
 
 j=l J i=l X x j=l J 
 
 n oo 
 
 = i w - Z y.(B.) by the countable additivity 
 
 i=l X j=l X J 
 
 of p i . 
 
 oo n 
 
 = Z E w. y.(B.) by algebraic manipulation. 
 
 j=l i=l X X J 
 
 = Z y(B.) by definition. 
 
 j=l J 
 
 oo A 
 
 Thus y is a measure on T . Thus L is a P language. 
 
 (b) y(oJ = i w. y . (pj by definition. 
 
 i=l X X 
 
 n 
 = Z w. since y.(ft) = 1, i = 1, 2,..., n, 
 
 i=l 
 
 = 1 since w is stochastic* 
 
 If (v^Dj) and (v , D ) are trees, then ^((v , D ), (v , Dg)) is 
 
 a tree whose root has value a and with two branches going out to (v , D ) 
 
 and (v p , D ) respectively. 
 
 (v^ D 1 ) (v^D 2 ) 
 The generalization of this operation is formalized. 
 

 Definition: The concatenation under b of P languages L. = (T , y J, 
 
 i = 1,..., n, is L = (T , y), defined as follows: for 
 
 1 1 A n n 
 
 each combination of (v , D )e L ,..., (v , D )e L , 
 
 k k 
 
 define a tree (v,D). D = {A,} U ID, j . . . ynD where kD 
 
 means all strings in D prefixed by k z I; and define 
 
 k k 
 
 v(a) = b, v(kd) = v (d) where d e D , k e I. Define 
 
 u k (v,D) = n y^_ x (v 1 , D 1 ), k = 1, 2,..., 
 i=l 
 
 and 
 
 y Q (v,D) = 1. Also define y k ((v,D) + (v\ D')) = 
 P k (v,D) + U k (v', D'), provided (v,D) ± (v 1 , D' ). This 
 guarantees that y is finitely additive. By simply 
 omitting the definitions for y , concatenation under b 
 can be defined for nonprobabilistic languages. 
 
 A A 7\ 
 
 Theorem 15: The concatenation under b of P languages L , L , . .., L , 
 
 A 
 
 forms a P language for any b e T. Furthermore, if L. is 
 
 Proof: 
 
 normalized, i = 1,... n, then the concatenation 
 
 A A A 
 
 L = bo(L n ,..., L ) is normali ze d . 
 1 n 
 
 The k-probabilities of trees, y (v,D), were defined as 
 
 n 
 
 II y ,(v , D ), so consider the total sum for some k. 
 i=l k_1 
 
 If k = 0, the sum is 1. If k > 0, then y n (ft) = y(ft., ) = 
 
 k k 
 
5h 
 
 y(v, , D ) since y was previously defined 
 
 (v k , D k ) e ^ 
 
 so as to "be finitely additive on the finite set o, . u,(n) = 
 
 ( \-r D k-i } e <£■: 
 
 / n r^n s n 
 
 ( Vi* \-i } £ n k-i 
 
 n 
 
 i=l 
 
 n 
 
 = n 
 
 n 
 
 E . . . = n i = 1 
 
 1=1 ( \-i' D k-i } e n k-i ^ (v Lr D k-i } i=1 
 
 assuming each y is a probability measure. Thus each 
 
 (ft , y, ) forms a probability space. By theorem 13, a 
 k k 
 
 A 
 
 consistent specification of all y determines y. L is 
 therefore an NP language. 
 
 ->. AAA 
 
 Definition: The direct y-sum of the PTA's A. , A , . . . , A , where 
 
 12 n 
 
 A. = (Q. , M. , 5.), is the PTA A = £pU. = (Q, M, S) 
 i ill \^v/ i 
 
 n 
 where Q = U Q. (assuming without loss of generality that 
 i=l x 
 
 all Q. fl Q. = <t> if i ?* .1), M(q,a) = M. (q,a) and p = p 1 
 
 i j i q,a q,a 
 
 for all q e Q. , (p is the probability associated with 
 
 l q,a 
 
 M. ) and c(q. ) = w. S.(q. )• 
 i l ill 
 
 AAA A 
 
 Theorem l6: If A , A A ,..., A are normalized PTA's, then A =(-h") A. 
 1' 2 ' n VIW/ l 
 
 Proof: 
 
 is normalized. 
 
 One only needs to show that 5 is a stochastic vector, 
 
55 
 5 " (w l 6l<*u } ' w i ^(a 12 ),..., w i 5 1 Cq 1 fc ), 
 
 W 2 V^l^*" 4 n (q n k ^ Where q i1 £ \ and 
 
 n J 
 
 k. = |Q.| ,1-1, 2,..., n. 
 
 n n 
 
 Then |s| - e ^(q) - Z S v. 5.(q ) = Z 
 
 q e Q i=l q id e Q. 1J i=l 
 
 n 
 
 w.( z e. (q.- •)) ■ £ w. = i. 
 
 q. . e Q. 1=1 
 
 H iJ i 
 
 Definition: The direct b-product of A , A.,..., A , b e T, where A. is the 
 PTA, L = (Q., M., H.Xis i S (W\ = (Q» M > 5) where 
 
 = u Q t u {q Q } with qLq i Q x » and Q i n Q. = 4>, 
 
 n 
 
 U 
 
 i=l 
 
 i,j=l,..., n,i/j, $(a) = 1, £(q) = if q ^ q Ql 
 
 M(q,a) = M.(q,a) and p a (Q R ) if q e Q i ; M(q Q , a) = <j> if a 7* b; for 
 all possible combinations q q ...q such that q e Q 
 and C i (q i ) > 0, Q.-L <1 2 • • • 0^ e M (q Q > b ) and 
 
 H i=l 
 
 Theorem IT: If A , L..., A are normalized PTA's, then A = A£) \ 
 
 is normalized. 
 Proof: By definition H is a stochastic vector, and since the state 
 
 sets Q. are disjoint, each q. e Q satisfies the normalization 
 
 l ii 
 
56 
 
 criterion because A. is normalized. Finally, 
 
 £ p U . ..q ) = 
 
 q 1 ...q n e M(q b) V* X ^ 
 
 n 
 ^ n £.(q.) = 
 
 q x e Q x q 2 e Q 2 a^ e Q n i= 
 
 =1 
 
 l l 
 
 I 
 
 5 i( qi )( i c 2 U 2 )..-( i 5 n ( % )...) = i 
 
 -L n _ E (3- n e Q 
 
 11 «2 e Q 2 <*n e Si 
 
 since E £.(q. .) = 1 for each Q.. 
 
 Theorem 18: 3 an operator homomorphism h from the set of strictly 
 
 nonterminating PTA's over T into the set of P languages of 
 the form (T , y) under the operations (r4f) 5 v£/ ) and. 
 
 (\y) , |0J ) where \w) means union with weighting vector w 
 
 and lOy means concatenation under b. 
 b 
 
 Proof: 
 
 (.a) Define for every PTA A, h(Aj as the language (T , y) which is 
 accepted by A. The restrictions of complete and strictly non- 
 terminating guarantee that A accepts some set of infinite trees. 
 
57 
 
 these distributions are equivalent on all Borel cylinders, so 
 
 1 2 
 they are equivalent on all measurable subsets of T 00 . p = yi» 
 
 so every PTA accepts exactly one language. 
 
 (b) If A. accepts (T°°, / ) , (T°°, y ) = \v) (T°°, /), i = 1,..., n, then 
 
 n 
 by definition y (v,D) = £ w. y (v,D) for all trees (v,D). The 
 
 i=l 
 probability of acceptance by( 
 
 Z rf V r ) 
 
 r £ RnA(r k> D ) 
 
 A. of an arbitrary Borel cylinder 
 fw/ i 
 
 IS 
 
 n 
 = £ w. £ rf^(r) 
 
 i=l X r e RnA (v fcJ D fc ) 
 i 
 
 n 
 = £ w £ C(r(A)) n P^ (t) 
 
 i=l r e RnA {v^ D fc ) t e t(r) i 
 
 i 
 
 n 
 = £ 
 
 rfA ( r ) 
 
 i=l X r e RnA ( v , d ) A i 
 i 
 
 = Z W, yJ(v,D). 
 
 i=l 
 
 i V 
 
 Since this calculation was carried out for an arbitrary tree (v,D) this 
 implies that the direct sum of P automata,£-.rHA. , accepts the union 
 
 with weighting vector w of the P languages generated by the P automata, 
 (c) Suppose (v , D ) is an arbitrary cylinder which has a prerun on the PTA 
 
 A = I Y i A. ,i=l,...,n. Any prerun has a first transition from 
 
 q_ to q q~...q» q. e Q. The probability of this is 
 
58 
 
 n 
 
 n £.(q.). After this, the other transitions out of state q. 
 i=l 1 x X 
 
 t 
 
 correspond to a prerun on A. , so the prerun r. must accept a prefix 
 
 (v, , , D, _, ) contained in L. . The probability of this prerun on A. 
 k-1 k-1 1 1 
 
 is rfA (r.). The total probability associatiated with the prerun is 
 
 i 
 
 H £.(<!.)» n rf/s (r.). Summing over all preruns yields 
 . , l l . , A. l 
 i=l i=l l 
 
 n 
 y(v , D ) = n y (y , D ). This is, by definition, the probability 
 
 K. K . n K—X K.— 1 
 
 1 = 1 
 
 associated with the prefix (v , D ) contained in L = bo(L , L , ..., L ). 
 
 k. k i c. n 
 
 Thus, (b) and (c) have shown that h( Q><h) A. ) = \wj (h(A. )), and 
 
 h( 
 
 1) = /a) (h(A.)) 
 
 i b i 
 
59 
 
 7. SUMMARY AND CONCLUSIONS 
 General definitions have been given of the concepts of 
 Probabilistic Language, Probabilistic Grammar, and Probabilistic Automaton. 
 The set of context free probabilistic languages has been completely 
 characterized in terms of (l) probabilistic context free grammars, 
 and (2) probabilistic tree automata. The motivation for this 
 work, namely, to take a first step toward developing 
 
 a quantitative tool for analyzing programming languages and their translators, 
 was explained in the Introduction. We would like to conclude by mentioning 
 some applications of and problems related to the present investigations. 
 
 In the area of compiler writing for programming languages, there 
 are various syntax oriented methods available which effectively derive code 
 from the grammar which specifies the language. By keeping a frequency count 
 of the amount of usage of the various "sentences" of the language, where the 
 sentences in this case are programs, a probabilistic language can be defined 
 and it can be determined which of several probabilistic grammars generates a 
 better approximate language. Furthermore, it can be speculated that an 
 adaptive compiler (i.e., a learning compiler) which modifies itself to 
 translate high frequency sentences best may be obtainable by improving 
 (altering) the approximating grammar as more and more frequency data is 
 obtained. 
 
 Probabilistic languages are also particularly well adapted to 
 problems such as library information retrieval, in that a machine based upon 
 the concepts of probabilistic languages will assign a probability to the 
 relevance of any description to any document. This machine can then be 
 asked for documents satisfying certain requirements with the user assigning 
 
6o 
 
 weights to the importance of each requirement . The machine can then give 
 a list of documents as output in order of probability that they will satisfy 
 the desired requirements. Such a machine would save the user much time 
 when the bibliographies were very large. Also, the machine would list the 
 documents most likely to be relevant, even if this likelihood were extremely 
 low. A machine using nonprobabilistic languages would often, in similar 
 cases, list no documents at all. 
 
 One of the interesting areas related to sets of trees concerns 
 decidability questions. The effective procedure which can be given for 
 checking equality of two tree automata implies that it is always decidable 
 whether the sets of trees generated by two context free grammars are the 
 same or not. If the sets of trees generated by two context free grammars 
 are not the same, this does not imply that the sets of strings generated by 
 the grammars are not the same; similarly, if the two sets of strings generated 
 are identical, this does not imply that the sets of trees generated are the 
 same. Indeed, this question is known to be undecidable for context free 
 sets of strings. Thus, we say that the degrees of unsolvability of these 
 two questions are incomparable . 
 
 Finally, let us mention one possible generalization of the present 
 investigation. Can one define a Probabilistic Turing machine in an analogous 
 fashion to previous definitions? What are the characteristics and applications 
 of this machine and the probabilistic language accepted by it, if defined? A 
 first step toward answering these questions appears in the Appendix, but this 
 area is otherwise unexplored. 
 
LIST OF REFERENCES 
 
 Ash, R. B. Math. U68 Classnotes (Probability Theory), University 
 of Illinois, Urbana, Illinois, September 1967. 
 
 Brainerd, W. S. Tree Generating Systems and Tree Automata , 
 Purdue University, Ph.D. Thesis, June 1967. 
 
 Chomsky, N. and Schutzenberger , M. "Algebraic Theory of CF 
 
 Languages," Computer Programming and Formal Systems , North 
 Amsterdam Publishing Company, pp. Il8-l6l, 1963. 
 
 Feller, W. An Introduction to Probability Theory and Its 
 
 Applications , Vol. 1, Wiley and Sons, Inc., New York, New York, 
 1957. 
 
 Ginsburg, S. Algebraic Theory of Context Free Languages , 
 McGraw-Hill^ New York, New York, 1966. 
 
 Ginsburg, Greibach, and Harrison "Stack Automata and Compiling," 
 JACM, lU (1), pp. 172-201, January 1967. 
 
 Ginzburg, A. Algebraic Theory of Automata , Academic Press, New 
 York, New York, 1968. 
 
 Greibach, S. A. "A New Normal Form Theorem for Context-Free 
 
 Phrase Structure Grammars," JACM , 12 (l), pp. 1+2-52, January 1965 
 
 Haines, Leonard H. Generation and Recognition of Formal Languages , 
 M. I. T., Ph.D. Thesis, June 1965 . 
 
 Hopcroft, J. E. and Ullman, J. D. "An Approach to a Unified Theory 
 of Automata," Bell Laboratories Tech. Memo MM 67-1371-3, 
 March 1967. 
 
 Kleene, S. C. "Representation of Events in Nerve Nets and Finite 
 Automata," Automata Studies , Princeton University Press, 
 Princeton, New Jersey, pp. 3-^1, 1956. 
 
 Kuck, D. Computer Science 497, University of Illinois, Urbana, 
 Illinois, Spring Semester 1968. 
 
 McNaughton, R. "Testing and Generating Inf. Sequences by a 
 Finite Automaton," Inf. and Cont. , 9., pp. 521-530, 1966. 
 
 Muller, D. E. "Infinite Sequences and Finite Machines," AIEE 
 
 Proceedings 4th Annual Symposium of Sw. Circuit Theory and 
 Logic Design, pp. 3-l6, 1963. 
 
 Paz, A. "Some Aspects of Probabilistic Automata," Inf. and Cont. , 
 9 (1), pp. 223-246, 1968. 
 
62 
 
 [l6] Paz, A. "Fuzzy Star Functions, Probabilistic Automata, and 'Their 
 Approximation by Nonprobabilistic Automata," Journal on 
 Computer System Sciences , 1 (U), pp. 371-390, 1967- 
 
 [17] Rabin, M. 0. "Probabilistic Automata," Inf . and Cont . , 6_, 
 pp. 230-2^5, 1963. 
 
 [l8] Rabin, M. 0. "Decidability of Second Order Theories and Automata 
 
 on Infinite Trees," IBM Research Report RC-2012, February 1968. 
 
 [19] Robinson, T. T. Math klk Classnotes (Logic), University of 
 Illinois, 1965. 
 
 [20] Rosenberg, A. L. "Real-Time Definable Languages," JACM, Ik (U), 
 pp. 61+5-662, October 1967. 
 
 [21] Rounds, W. "Context Free Grammars on Trees," 1st Annual ACM 
 
 Symposium on Theory of Computing Proceedings, pp. 1^3-1^8, 
 May 1969. 
 
 [22] Salomaa, A. "Two Complete Axiom Systems for the Algebra of 
 
 Regular Events," JACM , 13 (l), pp. 158-169, January 1966. 
 
 [23] Santos, E. S. "Maximin Automata," Inf. and Cont. , 13 (H), 
 pp. 363-377, 1968. 
 
 [2U] Solomonoff, R. J. "A Progress Report on Machines to Learn to 
 Translate Languages and Retrieve Information," Advances 
 in Documentation and Library Science , 3_ ( 2 ) , 1961. 
 
 [25] Turakainen, P. "On Stochastic Languages," Inf. and Cont. , 12 (U) 
 1968. 
 
 [26] Van der Waerden, B. L. Modern Algebra, Vol. 1 , Frederick Ungar 
 Publishing Co., New York, New York, 1953. 
 
 [27] Zadeh, A. "Fuzzy Sets," Inf. and Cont. , _8 (3), pp. 338-353, 
 July 1965. 
 

 APPENDIX A 
 
 APPROXIMATION OF PROBABILISTIC TURING AUTOMATA 
 
 BY PROBABILISTIC PUSHDOWN AUTOMATA 
 
6k 
 
 A Probabilistic Pushdown Storage Automaton (called a pushdown 
 P automaton) consists of a finite state control unit, two one-way 
 infinite tapes (an input tape and a storage tape), a read head for the 
 input tape, and a read-write head for the storage tape. The machine 
 operates by changing states at discrete time intervals. The new state 
 and the string of symbols printed on the storage tape at time t+1 
 depend probabilistically upon the old state and the symbols read on the 
 two tapes at time t. This can be stated formally. 
 
 A 
 
 Definition: A Probabilistic Pushdown S t orage Automaton A over T is a 
 system (Q, M, S, 5) where Q is a finite set of states, 
 {q q_ ... a } „ S is a finite set of storage tape symbols, 
 
 {s n s^...s }, with s = #, H is an n-dimensional initial 
 12m m 
 
 state vector, and M is a probabilistic transition function. 
 
 A 
 
 A situation of A is a triple (q, s, a). M is started in the 
 situation (q n > #» a ) where £(q n ) > 0, and a is the first symbol (not 
 
 including #) of some string x e T which is printed on the input tape. 
 
 A 
 
 Pictorially, A has the following initial configuration. 
 
 ( 
 
 # 
 
 § 
 
 
 Then, using the terminology of Haines , (q' , s 1 ) e M(q, s, a) will be 
 written as (q, s, a)~^(q' , s') where p is the probability associated 
 with the transition. Only three types of instructions are allowed: 
 
65 
 
 (1) (q, s, q ) » ( i ' i s'). This means if the automaton is in the state 
 
 q, and scanning symbols s and a. on the storage and input tapes 
 
 respectively, then with probability p, a transition into state 
 q' will occur, and the storage and input tapes will move one 
 square to the left and then s' e S - {#} will be printed on the 
 storage tape one square to the right of s. 
 
 (2) s' = A implies the storage tape is left unchanged and unmoved, so 
 
 (q', s) e M(q, s, a.). The next situation is (q' , s, a ) 
 
 l 1+1 
 
 assuming 1 <_ i < n where the input string is x = a a ...a . 
 
 (3) s' = o implies the symbol s is erased from the scanned square of 
 
 the storage tape and the tape is moved one square to the right. 
 
 If the string written on the storage tape at time t before the 
 
 transition was s. s n ...s. s, s. £ S, then the string at time t+1 
 1 k i D 
 
 after the transition is s s ...s . (q', s ) e M(q, s, a. J and the 
 new situation is (q', s , a . ,). This is defined only if k >_ 0. 
 
 K 1 + 1 
 
 If a = X in any of these instructions, then the input tape is 
 left unmoved and the transition is independent of the input symbol 
 scanned, (q, s, X)— ^*.(q', s') implies (q' , s') e M(q, s, a) for all 
 a z T and the next situation is (q 1 , s', a). 
 
 A 
 
 A pushdown P automaton A terminates if it is in a situation 
 (q, s, a) such that M(q_, s, a) = <J>. Also if X e M(q, s, a) then A may 
 terminate in acceptance if the read-write head is scanning the first 
 square of the storage tape and if during the last transition the read 
 head has just read a and the tape moved. We introduce the useful 
 
66 
 
 notational symbols q_^ and # to use in instructions, q is the 
 
 fictitious dead-end state discussed in Chapter 6 ((q, s, a) ■*■ (q , s') 
 
 means A e M(q, s, a)), and # is a fictitious "blank symbol written on 
 
 the square to the left of a^ and on all squares to the right of a on 
 
 n 
 
 the input tape. Note # j. T, so a = # is not allowed in any instruction. 
 
 A transition sequence for x e T + is a sequence of situations (q , s , a ), 
 
 (q, , s n . a n ) , . . . , (q , s , a ) where A is started in an initial 
 ill n n n 
 
 configuration with xy written on the input tape for some string 
 
 y e T*, and a sequence of elementary transitions consistent with M 
 
 P- 
 occur, (q. , s., a.) •» (q , s ), where p. + is the probability of 
 
 the transition. The situation after the sequence of transitions must 
 
 be (q, s , a ) where a = y, , the first terminal symbol of y. If 
 Ti n n n 1 
 
 further, y = A and (q , s , a ) is the final situation, (q, , #, #), then 
 
 n n n t 
 
 the sequence is called an accepting transition sequence . The probability 
 
 of a transition sequence is the product of the probabilities of the 
 
 elementary transitions, p = IT p.. The probability of partial acceptance 
 
 k i=l X 
 
 m 
 of x is y*(x) = 2 £(q n )p,, where m is the number of transition sequences 
 k=l ° K 
 
 for x. The probability of acceptance of x is 
 
 a 
 
 m(x) = Z £(q )p where i is the number of accepting transition sequences 
 
 k=l ° k 
 
 for x. The analogue of type h normalization (i.e., the total probability 
 
 of leaving any state must sum to l) is £ Z E E p[(q, s, a) 
 
 seSs' eSaeTq* eQ 
 
 (q', s')] = 1. 
 
61 
 
 l/2j n 
 
 s) 
 
 l/2 Jfl 
 
 a) 
 
 Example ?: A pushdown P automaton for the language {a b |n > 0} 
 This automaton is type h normalized. 
 
 A = (Q, M, S, S) over T = {a,b} 
 
 (1) Q = {q Q , q x , q 2 > 
 
 (2) S = {sj} 
 
 (3) M has instructions: 
 
 (q. » #» a)— ^(q 1> s) 
 
 (q 1 , s, a) 
 (q. lS s, b) 
 
 (q 2 , s, b) ^(q 2 , a) 
 
 (q 2 , s, b) »-(q t » °) 
 
 (1+) S = (1, 0, 0) 
 
 The model of Turing Automaton defined herein is derived from 
 the 1-tape online turing machine . A Probabilistic Turing Automaton 
 
 (called a Turing P automaton) is basically a pushdown P automaton in 
 which the storage tape is allowed to move left and right without erasing. 
 More formally, 
 
 A 
 
 Definition: A Probabilistic Turing Automaton A over T is a system 
 (Q, M, S, 5) where Q is a finite set of states, S is a 
 finite set of storage tape symbols, M is a probabilistic 
 transition function, M: Q x T x sJ*.P(Q x S x J) U M 
 where J = {-1, 0, 1} and H is the initial distribution vector. 
 An instruction of A may be written (q, s, a)— *-(q' , s', j) where 
 
 j=l indicates move the tape one square to the left and then print s', 
 
68 
 j=-l implies print s' and then move to the right, and j=0 implies 
 print s' but do not move the tape. 
 
 All other definitions and restrictions are the same as for 
 pushdown P automata. This version of Turing automaton differs from 
 the usual in that it is probabilistic, it contains a move-bef ore-write 
 type of instruction, its storage tape is only infinite to the right, and 
 it must terminate by scanning the # in the initial square. It is easy 
 to see that the latter three alterations do not change the set of 
 languages accepted by a Turing automaton. 
 Algorithm: Given any Turing P automaton, A, the following procedure 
 
 A 
 
 yields a Pushdown P automaton A' , which accepts an 
 approximation language L(A' ) = (T , y'), to the language 
 accepted by A, L(A) = (T + , u). Let A = (Q, M, S, 5) 
 where we restrict H to be 1 for one state (initial state 
 <lJ and for all others, then A 1 = (Q 1 , M 1 , S 1 , =')• S' 
 
 consists of all of S together with stack symbols ?s for ■ 
 all symbols s in S. M 1 can be described by a set of 
 instructions: 
 
 (a) For each (q, s, a)J^q' , s', +1) in M, put (q, s, a)-2*4q' , s') 
 and (q, ?s , a)-£*.(q' , s* ) into M' . 
 
 (b) For each (q, s, a) J*-(q' , s', 0) in M, put (q, s, a)J*-(q Q , c), 
 (q Q , A, A)— *-(q', s*), and (q, ?s, a)-E^{q_ Qi o) into M 1 , where 
 
 q e Q' is a new state which only appears in these three instructions 
 
(0) For each ( 4l s, .) J*,. , ,r, ^, in M> put (q _ g> a) _p^ ^ 
 
 l V X, X)-^. a), (^ x, A)-^., „.), md (q> ?s> a) ^^ o) 
 into P. Put new symbols q ± and q g into Q' . 
 
 (« For each < q , s, a)^ (q ., x , _ x) in M> put (q> ,_ a) ^ q , ^ g) ^ 
 
 (q. ?s, a)-^(q», CT ) into M> 
 (e) For each (,, .. a ) J^, , A , 0) ln M> ^ ^ ^ &) ^ ^ ^ 
 
 (l, ?s, aJ-JUq.', A) into M. 
 W For each ( q , s, ajJWo." , A , +1) ln M> ^ {q> ^ a) _^ qij ^ 
 into M for all symbols !s defined in S'. 
 
 The six instruction types cover all possible directions, with 
 and without printing, that the Turing P automaton can move. Notice that 
 the pushdown P automaton can simulate all of these transitions except 
 -ving to the right without printing, in which case it prints ,.. ft us , 
 if type (f) instructions are not used in the Turing P automaton program,' 
 then the P language recognized will also be recognized by the pushdown P 
 automaton constructed from the Turing P automaton by the previous 
 algorithm. Furthermore, if the s qU are in which ? s would get printed by 
 the pushdown P automaton is never revisited by the Turing P automaton, 
 then the approximation is again exact. 
 
 Definition: The (^-normalized) Initial Definite P language, ^ of 
 a pushdown ^P automaton «■ which approximates a Turing P 
 automaton a" is the set of all strings x which are maximal 
 initial definite segments of strings z c T* 3 p(z) > o. 
 The probability assigned to x is the probability of partial 
 
TO 
 
 acceptance with respect to A, y (x) = y*(x). All 
 other elements y of T* have y (y) =0. x is maximal 
 
 initial definite if it fulfills 
 
 (1) initial: 3 y e T* 3- xy = z, y(z) > 
 
 (2) definite: -J a transition sequence for x with respect to A' which 
 contains no (a, q, ?s)— ±-*^q' , 3') type instructions (called 
 indefinite instructions ) . 
 
 (3) maximal: -^ a transition sequence for x fulfilling (2) above which 
 can be extended to an accepting transition sequence for z with 
 
 A 
 
 respect to A' such that the first instruction after the initial 
 
 definite segment x is indefinite, (q, ?s, a) ■> (q', s'). 
 
 Define lid = I y (x). Define L state set as the 
 
 x e T + 
 
 set of states of Q' in which the pushdown P automaton can 
 
 reside after strings x 3- y (x) > have been input. 
 
 A A 
 
 Theorem: Let A be a Turing P automaton. Let A' be the approximating 
 pushdown P automaton as described above. If the set of 
 
 A 
 
 states of A' reachable from the L y state set all have type 
 
 k normalization (where q reachable means there is a transition 
 sequence such that q^ = q, and q_ e L state set) then the 
 
 following error bound holds: 
 
 E |y(z) - y'(z)|< lid - I y(.z) 
 
 z e T + z e T + 
 
 Proof: Consider a Markov chain whose states are the states of the 
 
 A 
 
 automaton A' , and whose transition probabilities p. . are just 
 
71 
 the probabilities p. , of a transition from state q. to 
 
 q . Take as initial state any q. e L state set. By 
 
 J 
 
 normalization, E p = 1. 
 
 It is then possible to compare y (x) to y'(z) for all strings 
 
 A 
 
 z e L(A' ) 3 z = xy for some string y. y (x) >_ £ y ' (z) . 
 
 z = xy 
 
 The inequality is not a strict equality because we have not 
 excluded anomalies such as instructions which lead to dead- 
 end, nonaccepting states. Since all accepting transition 
 
 A 
 
 sequences with respect to A' begin with transition sequences 
 
 for some x with u ID (x) > 0, 2 y (x) > E y'(z). 
 
 X Z L ID ID ~ z e T + 
 
 lid>_ I u'lz), E Jy(z)-y'U)|= I y'(z)- [z) 
 
 zeT zeT zeT 
 
 because each string of this is accepted under this approximation 
 
 with probability >_ y(z). 
 
 = E + y ' (z) - I + y(z) 
 zeT zeT 
 
 = lid - Z y(z) 
 
 zeT 
 
 = lid - 1 if Lift 1 ) is an NP language. 
 
 Thus, if error of approximation is measured by 
 
 E J y I z) - y'(z)|, then an error bound is lid - 1. 
 zeT 
 
12 
 
 APPENDIX B 
 EXAMPLES OF REGULAR TREE EXPRESSIONS 
 
73 
 
 Expression conventionally written (a U b)*c is written 
 
 a <to> U b <u)> \j c 
 
 The context free grammar of example 7 is 
 
 A ■*> bAB, A -*■ bB, B -v b 
 
 A + cAC, A -> cC, C ->- c 
 
 The corresponding RTE is 
 
 b <[u)]*[b]> + c <[w]«[c]> + b <[b]> + c <[c]> 
 
 The PTA of example k is 
 
 a,p 
 
 
 The corresponding RTE is 
 
 (l) (0) ■«•— superscript probabilities 
 
 a <[wl]*[b < [wl]»[u2]> + b <[a>l]>]> + a <[wl]> 
 
 « — subscript vectors 
 
 ( p;> K> 
 
 ( v 
 
 <*a> 
 
7* 
 
 VITA 
 
 Clarence Arthur Ellis was born in Chicago, Illinois, on 
 May 11, 19^3. He graduated from Beloit College in 196U with the degree 
 of Bachelor of Arts in Physics and Mathematics. Since that time he has 
 been a research assistant in the Department of Computer Science at the 
 University of Illinois. In 1966 he graduated from the University of 
 Illinois with the degree of Master of Science in Mathematics. 
 
 He has been a member of Sigma Xi , the Association for Computing 
 Machinery, the Mathematical Association of America and the Institute of 
 Electrical and Electronics Engineers. 
 
=ormAEC-427 U.S. ATOMIC ENERGY COMMISSION 
 
 (6/68) UNIVERSITY-TYPE CONTRACTOR'S RECOMMENDATION FOR 
 
 AECM3201 DISPOSITION OF SCIENTIFIC AND TECHNICAL DOCUMENT 
 
 I S»e Instructions on Revmne Sid* ) 
 
 1. AEC REPORT NO. 
 
 COO-1U69-01U9 
 
 2. TITLE 
 
 PROBABILISTIC LANGUAGES AND AUTOMATA 
 
 3. TYPE OF DOCUMENT (Check one): 
 
 Q a. Scientific and technical report 
 
 12 b. Conference paper not to be published in a journal: 
 
 Title of conference 
 
 Date of conference 
 
 Exact location of conference. 
 Sponsoring organization 
 
 (2?]c. Other (Specify) Thesis 
 
 4. RECOMMENDED ANNOUNCEMENT AND DISTRIBUTION (Check one): 
 
 a. AEC's normal announcement and distribution procedures may be followed. 
 
 2 b. Make available only within AEC and to AEC contractors and other U.S. Government agencies and their contractors. 
 2 c. Make no announcement or distrubution. 
 
 5. REASON FOR RECOMMENDED RESTRICTIONS: 
 
 6. SUBMITTED BY: NAME AND POSITION (Please print or type) 
 
 C. W. Gear, Professor 
 
 and Principle Investigator 
 
 Organization 
 
 Department of Computer Science 
 University of Illinois 
 Urbana, Illinois 618OI 
 
 Signature 
 
 ^K&k*i*** 
 
 Date 
 
 October 1969 
 
 FOR AEC USE ONLY 
 
 7. AEC CONTRACT ADMINISTRATOR'S COMMENTS, IF ANY, ON ABOVE ANNOUNCEMENT AND DISTRIBUTION 
 RECOMMENDATION: 
 
 8. PATENT CLEARANCE: 
 
 LJ a. AEC patent clearance has been granted by responsible AEC patent group. 
 LJ b. Report has been sent to responsible AEC patent group for clearance. 
 l_J c. Patent clearance not required.