key: cord-0639155-6welmnt4
authors: Crane, Harry; Xu, Min
title: Inference on the History of a Randomly Growing Tree
date: 2020-05-18
journal: nan
DOI: nan
sha: 0896e541dfe2635fa5d1ab008cde47c91fc1c198
doc_id: 639155
cord_uid: 6welmnt4

The spread of infectious disease in a human community or the proliferation of fake news on social media can be modeled as a randomly growing tree-shaped graph. The history of the random growth process is often unobserved but contains important information such as the source of the infection. We consider the problem of statistical inference on aspects of the latent history using only a single snapshot of the final tree. Our approach is to apply random labels to the observed unlabeled tree and analyze the resulting distribution of the growth process, conditional on the final outcome. We show that this conditional distribution is tractable under a shape-exchangeability condition, which we introduce here, and that this condition is satisfied for many popular models for randomly growing trees such as uniform attachment, linear preferential attachment and uniform attachment on a $D$-regular tree. For inference of the root under shape-exchangeability, we propose computationally scalable algorithms for constructing confidence sets with valid frequentist coverage as well as bounds on the expected size of the confidence sets. We also provide efficient sampling algorithms that extend our methods to a wide class of inference problems.

Many growth processes, such as the transmission of disease in a human population, the proliferation of fake news on social media, the spread of computer viruses on computer networks, and the development of social structures among individuals, can be modeled as a growing tree-shaped network. We visualize the process as a growing tree, as in Figure 1 (a), with each individual corresponding to a node labeled by its arrival time and associations between individuals represented by an edge between the corresponding nodes. In the examples mentioned above, the edges of the tree may correspond, respectively, to the transmission of a disease, spread of a rumor, passage of a computer virus, or establishment of a friendship between the connected nodes. For concreteness, we frame the following discussion in the context of disease spread, so that the edges of the tree represent a person-to-person spread of an infection.

In this setting, we assume there is an initial infected individual, called the root, at time t = 1. At each discrete time step t = 2, 3, . . ., a new individual becomes infected by one of the individuals previously infected at times 1, . . . , t − 1 according to a probability distribution that depends on the past history of the process. By tracking the infection through the growth of the corresponding infection tree (as in Figure 1 (a)), the process of infection thus produces a sequence of trees t = (a) (b) Figure 1 : (a) A realization of a tree growth process with nodes labeled in order of arrival. (b) The shape of the process in (a) with node labels removed. The main objective of the paper is to infer properties of the tree in (a) based only on the observed shape in (b).

(t 1 , t 2 , . . .) which represents the complete history of infection in the population. In particular, t n is a tree with n nodes and a directed edge uv indicating that node u passed the infection to node v. Equivalently, labels 1, . . . , n can be assigned to the nodes of t n according to the time at which each node arrives, so that any edge (u, v) ∈ t n is immediately interpreted as a transmission from the node with the lower label to the one with the higher label.

In epidemiological applications, the infection history is important for enacting measures which prevent further spread, such as quarantining and testing. But in many cases the complete infection history is not fully observable. In the 2020 outbreak of Covid-19, for example, long incubation periods and asymptomatic spreading leads to incompleteness in the observed infection history (i.e., some or many edges in the infection tree are unobserved) as well as uncertainty about the direction of spread for those edges which are observed, as shown in Figure 1 (b). Here we consider the problem of inferring properties of the disease history based on such partial observations of the infection spread. Properties of special importance include inference of the root node (so-called 'patient-zero') and inference of infection time. We develop our methods in the specific context in which the contact pattern has been observed (i.e., the 'shape' of the infection tree is known but its directions are not). We discuss some relaxations to this assumption and limitations of our methods in Section 6.

Suppose that a disease has been transmitted among n ≥ 1 individuals (as illustrated in Figure  1 (a)) but only the contact pattern is observed (as in Figure 1 (b)). We propose a methodological framework for answering general inference questions about the infection spread based on the observed contact pattern. In the context of disease spread, important relevant questions include inference of so-called 'patient-zero' (i.e., the initial source of infection) and also the time at which specific individuals became infected. Our proposed method efficiently computes the conditional probability of the disease transmission history given the observed contact process. These conditional probabilities enable us to construct valid confidence sets for the inference questions of interest. These inference procedures may be applied in social media networks to help identify key spreaders and promoters of fake news or applied on the infection pattern of an epidemic (often obtained from contact tracing) to localize patient-zero or to reconstruct the chain of infection (Hens et al.; 2012; Keeling and Eames; 2005) .

Researchers in statistics (Kolaczyk; 2009) , computer science (Bollobás et al.; 2001) , engineering, and physics (Callaway et al.; have studied the probabilistic properties of various random growth processes of networks, including popular models such as the preferential attachment model (Barabási and Albert; 1999) . A line of research in the physics and engineering literature explores the problem of full or partial recovery of the network history based on a final snapshot Cantwell et al.; Timár et al.; 2020; Sreedharan et al.; Magner et al.; . However, the problem of statistical inference on the history of a network growth process has been studied only recently. In statistics, most existing work focuses on the problem of root estimation 2011 , 2016 Fioriti et al.; 2014; Shelke and Attar; and root inference (Bubeck, Devroye and Lugosi; . The latter work by Bubeck, Devroye and Lugosi (2017) shows that one can construct a confidence set of the root whose size does not increase with the size of the network n. These directions are further developed by Bubeck, Eldan, Mossel and Rácz (2017) and Devroye and Reddad (2018) , who consider inference on a seed tree, by Shah and Zaman (2016) , who analyze situations where consistent root estimation is possible, and by Khim and Loh (2017) , who extend the results of Bubeck, Devroye and Lugosi (2017) to the setting of uniform attachment on a D-regular tree. These results, although theoretically sound, do not give a practical method for inferring the root of a tree based on an observed contact pattern. For example, the confidence set algorithms in Bubeck, Devroye and Lugosi (2017) and Khim and Loh (2017) (described in detail in Section 2.3) only give asymptotic coverage guarantees that hold under specific model assumptions and tend to be too conservative for practical purposes; see the numerical examples in Section 5.1 for further discussion.

In this paper, we address the above issues by proposing a new approach for inference on the history of a randomly growing tree. Our approach is to randomly relabel the nodes of the observed shape to obtain a random treeT n whose labels are random but whose shape corresponds to that of the observed contact pattern. Random relabeling thus induces a latent sequence of subtreesT 1 ⊂ T 2 ⊂ . . .T n−1 ⊂T n which represents the history ofT n . We study the conditional distribution of the history P(T 1 , . . . ,T n−1 |T n ) and show that the conditional distribution is tractable under a shape-exchangeability property, a distributional invariance property which is satisfied by the most common instances of the preferential attachment models including linear preferential attachment, uniform attachment and D-regular uniform attachment.

For most of the paper, we focus on the problem of root inference, where our proposed method has a coverage guarantee for all n and produces practical confidence sets even for trees of millions of nodes. However, our approach is applicable to a wide class of other inference problems such as that of arrival time inference and inference of the initial subtree given the final shape.

The paper is organized as follows. In Section 2 we define the problem and review some earlier work. In Section 3 we describe our approach using random labeling and the shape exchangeability condition. In Section 4 we discuss general inference problems for observed contact patterns. In Section 5 we show some simulation studies and illustrate our methods on flu data from a London school.

For n ∈ N, we write [n] := {1, . . . , n} and [n] 0 := {0, 1, . . . , n}. We let t = (V, E) denote a labeled tree where V is the finite set of nodes (generally taken to be [n] where n is the size of the tree) and E ⊂ V × V is the set of edges.

For two finite sets A, B of the same size, we write Θ(A, B) as the set of all bijections between A and B. For two labeled trees t, t , we write πt = t for π ∈ Θ(V (t), V (t )) if (u, v) is an edge in t if and only if (π(u), π(v)) is an edge in t . In this case, we say that π is an isomorphism between t and t . We note that for any labeled tree t and any bijection π ∈ Θ(V (t), V (t)), πt is always a valid tree the represents the result of relabeling the nodes of t by π. For a labeled tree t and a subset of nodes B ⊂ V (t), we write t ∩ B as the (possibly disconnected) subgraph of t restricted only to the nodes in B.

Throughout the paper, we use bold upper-case letters such as T to denote a random tree and bold lower-case letters such as t to denote a fixed tree. For any two random objects X, Y , we write X d = Y if they are equal in distribution.

Following the terminology used in discrete mathematics, we define a labeled tree t n with n nodes as a recursive tree if V (t n ) = [n] and if the subtree t n ∩ [k], obtained by removing all nodes except those with labels {1, 2, . . . , k}, is connected for every k ∈ [n]. In other words, any path from node 1 to any other node must be increasing in the node labels. Equivalently, we may view a recursive tree t n as a map [n] → [n] 0 with t n (j) = i indicating that the parent of j is i, for i < j, and t n (j) = 0 indicating that node j is the root of t n . Figure 1 (a) illustrates a recursive tree with 8 nodes. We write T R n to denote the set of all recursive trees with n nodes. For recursive trees t k , t k with node labels [k] and [k ], respectively, for k < k , we write

That is, t k ⊂ t k indicates that t k is the subtree of t k obtained by removing all nodes and edges from t k except those labeled in [k] . For k < k , we call t k and t k compatible if t k ⊂ t k . A family (t n ) n≥1 such that t k ⊂ t k for all 1 ≤ k < k is called mutually compatible.

A Markovian tree growth process is a mutually compatible family of random recursive trees T = (T n ) n≥1 such that T n ∈ T R n for each n ≥ 1 and

for some family of transition probabilities (p n ) n≥1 . Any such process T = (T 1 , T 2 , . . .) can therefore be constructed by sequentially adding nodes at discrete times k = 2, 3, . . . according to these transition probabilities and labeling the nodes of T n by their arrival time.

Example 1. For a straightforward example of a growth process, define

so that new nodes attach to existing nodes uniformly at each time. The resulting process is called the uniform attachment model.

Example 2. For another popular example, let deg(w, t k ) denote the degree of node w in t k and for t k−1 ⊂ t k let w * (t k−1 , t k ) denote the unique node of t k−1 to which the node labeled k connects to form t k . We then define transition probabilities by

The resulting tree process is called the linear preferential attachment model, as nodes with high degree tend to accumulate more connections.

Example 3. To generalize the previous two examples, we define a general class of preferential attachment (PA) processes indexed by a fixed function φ : N → [0, ∞) as follows. For short, we call these PA φ processes. Given such a function φ, we generate T 1 as a singleton node with label 1 and, for each n = 2, 3, . . ., generate T n as a random tree where we 1. choose an existing node w ∈ [n − 1] of T n−1 with probability proportional to φ(deg(w, T n−1 )), where deg(w, T n−1 ) denotes the degree of node w in T n−1 , 2. and add edge (n, w) to tree T n−1 to form T n .

The resulting transition probabilities of this process thus satisfy

Common examples of the function φ include (a) uniform attachment where φ(d) = 1 for all d ∈ N (as in Example 1), (b) linear preferential attachment where φ(d) = d (as in Example 2), (c) uniform attachment on a D-regular tree where φ(d) = max(0, D − d) for some D ≥ 2, and (d) sublinear preferential attachment where φ(d) = d γ for some γ ∈ (0, 1). We also note that Gao et al. (2017) studies the estimation of the parameter function φ.

For labeled trees t, t , not necessarily recursive, we write t ∼ t if there exists a π ∈ Θ(V (t), V (t )) such that πt = t . We define the shape of t as the equivalence class of all t that are equivalent to t up to some relabeling, sh(t) := {t labeled tree : t ∼ t}.

A shape is typically referred to as an "unlabeled tree" or just a "tree". We prefer the term "shape" here to emphasize that sh(t) does not refer to a single tree but rather to an equivalence class of all trees with a given structure. Figure 1 (b) shows the 'shape' of the labeled tree in Figure 1 (a). In our analysis, this 'shape' represents the set of all trees that produce the same structure after removing labels.

Since sh(t n ) is an equivalence class, we need to first define what it means to refer to the nodes of sh(t n ). To that end, we let U n be an alphabet of n distinct letters and represent the shape sh(t n ) by an arbitrary labeled tree t * n ∈ sh(t n ) with node labels U n . We may now refer to the nodes of sh(t n ) through its labeled representation t * n . For convenience, we defineT n as the set of all labeled trees with n nodes whose labels take values in the alphabet U n .

We note that it is necessary to work with a labeled representation of a shape because (i) our inference questions make reference to properties of specific nodes (e.g., a particular node being the root) and (ii) computer programs require as input a labeled tree instead of an equivalent class of labeled trees. Because the choice of the labeled representation is arbitrary, we only consider inference methods that are independent of the choice of the representation. We formalize this with the notion of labeling-equivariance in Remark 2. figure) . Both ρ and ρ are isomorphisms but we have that root ρ (T n ) = A and root ρ (T n ) = D.

Let T = (T k ) k≥1 be a Markovian growth process for which we observe only the unlabeled shape sh(T n ) for some n ≥ 1. For a labeled representation T * n ∈ sh(T n ) whose node labels take values in U n , we say that the root of T n is v if the node labeled v ∈ U n in the labeled representation T * n corresponds to the root of T n . More precisely, letting ρ ∈ Θ([n], U n ) be an unobserved bijection such that ρT n = T * n , we define

Remark 1. It is important to note that the root node depends on the choice of the isomorphism ρ; there could exist ρ ∈ Θ([n], U n ) such that ρ T n = T * n and that ρ (1) = ρ(1); see Figure 2 . This is because multiple nodes of an unlabeled shape sh(T n ) can be indistinguishable (see (18) in Section 3.6 for a formal definition of indistinguishable nodes). We show in Remark 2 that this issue does not pose a problem to our inference procedure as long as we only consider labeling-equivariant confidence sets.

Formally, the problem of root node inference is to construct, for given a confidence level ∈ (0, 1), a confidence set C (T * n ) ⊂ U n such that

As a trivial solution is to let C be the set of all nodes U n , an important aspect of the inference problem is to make the confidence set C (·) as small as possible while still maintaining valid coverage (3). We note that the root node cannot be consistently estimated since, for a tree of 2 nodes, it is impossible to distinguish which one is the root.

Remark 2. Since our observation is the unlabeled network sh(T n ), it is natural to require the confidence set C (·) to be labeling-equivariant so it does not depend on the choice of the labeled representation T * n . More precisely, for any τ ∈ Θ(U n , U n ), we require

where τ (C (T * n )) is the set containing the image of all members of C (T * n ) under τ . In particular, if τ is an automorphism in the sense that τ T * n = T * n , we require C (T * n ) to be invariant with respect to τ .

The definition of the root node (2) relies on a particular isomorphism ρ. However, for a labelingequivariant confidence set, as in (4), the probability of coverage (3) does not depend on this choice of the isomorphism. Indeed, if we write u ∈ U n as the root node root ρ (T n ) with respect to ρ ∈ Θ([n], U n ), then, for any τ ∈ Θ(U n , U n ), the node labeled τ (u) is the root node under an alternative labeling ρ • τ and we see that u = root ρ (T n ) ∈ C (ρT n ) if and only if τ (u) = root τ •ρ (T n ) ∈ C ((τ • ρ)T n ).

An unlabeled shape s may contain indistinguishable nodes that are given different labels in a labeled representation. For example, the nodes labeled A, C, D in the labeled tree in Figure 2 are indistinguishable. However, any set of indistinguishable nodes must be either all included in or all excluded from a labeling-equivariant confidence set by the fact that such confidence sets are invariant with respect to automorphisms.

More generally, for a labeled representation T * n of the observed shape sh(T n ) where ρT n = T * n for some ρ ∈ Θ([n], U n ), we let S be a discrete set and let f ρ : T R n → S be a function. For a confidence level ∈ (0, 1), our inference problem is to construct a confidence set C (T * n ) ⊂ S such that

One example is inference on the arrival time of a node. Let S = [n] and, for a given element u ∈ U n , define the random arrival time of node u as

The problem is then to construct, for a given ∈ (0, 1), a confidence set C (u) (T * n ) ⊂ [n] such that the true arrival time Arr (u) ρ (T n ) is contained in C (u) with probability at least 1 − . Since C (u) is a set of possible arrival times of the node u ∈ U n , we again impose a labeling-equivariance requirement and assume that C (u) does not depend on the labeled representation T * n of the unlabeled shape sh(T n ) in the sense that C (u) (T * n ) = C (τ (u)) (τ T * n ) for any τ ∈ Θ(U n , U n ), as the node u gets relabeled τ (u) under the τ T * n representation. With this requirement and the fact that Arr (u) ρ (T n ) = Arr (τ (u)) τ •ρ (T n ) for any τ ∈ Θ(U n , U n ), we see that

Another example is seed-tree inference, which has been studied by Devroye and Reddad (2018) . In this case, for a given k ∈ N, let S = {t k : |V (t k )| = k and V (t k ) ⊂ U n } consist of all trees with k nodes labeled in U n , and let seed (k)

The confidence set C (T * n ) for seed (k) ρ (T n ) would be a set of subtrees of size k of T * n . As before, we require that τ (C (T * n )) = C (τ T * n ). We may consider a number of other inference problems, e.g., given a subset of nodes V ⊂ U n , what is the order in which they were infected? The approach given below can be applied quite generally to most conceivable questions of this kind. These may all be formalized in a manner identical to the examples that we have shown.

For the problem of root node inference, Bubeck, Devroye and Lugosi (2017) consider procedures that assign a centrality score to each node of the observed shape and then take the largest K( ) nodes to be the confidence set where K( ) is a size function whose value depends on the underlying distribution. More precisely, let sh(T n ) be the observed shape with labeled representation T * n whose node labels take values in U n . Let ψ : U n ×T n → [0, ∞) be a scoring function and let K( ) be an integer for any ∈ (0, 1). Assuming that the nodes u 1 , u 2 , . . . , u n ∈ U n are sorted so that

The ψ function is labeling-equivariant in that for any τ ∈ Θ(U n , U n ), we have ψ(u,t n ) = ψ(τ (u), τt n ). The induced confidence set C K( ),ψ is therefore also labeling-equivariant.

With these definitions, Bubeck, Devroye and Lugosi (2017, Theorem 5) show that if the random recursive tree T n has the uniform attachment distribution, then there exists a function ψ such that,

In other words, so long as K( ) is large enough, C K( ),ψ (T * n ) has asymptotic confidence coverage of 1 − . If T n is distributed according to the preferential attachment distribution, then Bubeck, Devroye and Lugosi (2017, Theorem 6) show that there exists ψ such that C K( ),ψ has asymptotic coverage when K( ) ≥ C log 2 (1/ ) 4 for some universal constant C > 0. Lower bounds on the size K( ) are also provided.

If T n is distributed according to uniform attachment on D-regular trees for some D ≥ 3, then Khim and Loh (2017, Corollary 1) shows that there exists ψ such that C K( ),ψ has asymptotic coverage when K( ) ≥ C D / for some constant C D > 0 depending only on D.

These above results are surprising in that as the size of the tree increases, the size K( ) of the confidence set can remain constant-the intuition for this being that the "center" of the growing tree does not move significantly as new nodes arrive (Jog and Loh; 2015) . However, these results have a number of shortcomings that make them impractical for real applications. First, the confidence guarantee is asymptotic and say nothing about the finite n case. Second, the bound on the size K( ) is theoretical and too conservative to be useful. As we show in Figures 3 and 4, the bound K( ) given in each case tends to be excessively conservative even for relatively large values of n. Third, it is necessary to know the model of the random recursive tree T n in order to choose the correct size K( ). In the next section, we present an alternative approach to constructing confidence sets for the root node that addresses all these shortcomings.

The choice of the scoring function ψ is crucial. The ideal choice is the likelihood, which is complicated to express so we defer its formal definition to equation (19) in Section 3.6 to avoid interrupting the flow of exposition. Bubeck, Devroye and Lugosi (2017) remark that the likelihood is computationally infeasible and analyzes a relaxation instead. One such relaxation is based on taking products of the sizes of the subtrees (termed by Shah and Zaman (2011) as rumour centrality) and it plays an important role in our approach; see (16) for a precise definition. Interestingly, one implication of our work is that this product-of-subtree-sizes relaxation in fact induces the same ordering of the nodes as the true likelihood so that the confidence set constructed from the relaxed likelihood is the same as the confidence set constructed from the true likelihood.

In this section, we describe our approach to root inference through the notion of label randomization and shape exchangeability. In Section 4, we show that the same approach applies to various other inference problems as well.

Let T n be a random recursive tree and let sh(T n ) be its shape. Given T n and any labeled representation T * n ∈ sh(T n ), we may independently generate a random bijection Λ uniformly in Θ(U n , U n ) and apply it onto T * n to obtain a randomly labeled treeT n := ΛT * n . We note here that the resulting object satisfiesT n d = ΠT n where Π ∈ Θ([n], U n ) is another uniform random bijection chosen independently of T n and Λ. In particular, the marginal distribution ofT n does not depend on the choice of representative T * n ∈ sh(T n ). To fix notation, from now on we writeT n to denote a random labeled tree generated in this way.

For any k ∈ [n], we also defineT k := Π| [k] T k , where we interpret Π| [k] is the domain restriction of the bijection random bijection Π defined on [k] . In this way,T 1 ⊂T 2 ⊂ . . . ⊂T n . Since each of the random subtreesT k has the same shape as T k , the random shape sh(T k ) has the same distribution as sh(T k ). In particular, the label of the singletonT 1 is drawn uniformly from U n and each subsequent node added at time k is drawn uniformly from U n \V (T k−1 ).

We can interpret label randomization as an augmentation of the probability space. An outcome for a Markov tree growth process is a recursive tree t n ∈ T R n whereas an outcome for the label randomized sequenceT 1 ⊂T 2 ⊂ . . . ⊂T n is a pair (t n , π) where t n is a recursive tree and π is a bijection from [n] to U n . By definingt k := π| [k] t k , we obtain a bijective correspondence between a pair (t n , π) and a sequence of nested subtreest 1 ⊂t 2 ⊂ . . . ⊂t n where V (t n ) = U n . We define any such sequence of nested subtrees (equivalently any pair (t n , π)) as a tree history, or simply a history for short. Intuitively, π −1 ∈ Θ(U n , [n]) is the isomorphism that gives the ordering information of the nodes int n as the node labels oft n take value in the alphabet U n . Given the correspondence between the sequencet 1 ⊂t 2 ⊂ . . . ⊂t n and the pair (t n , π), the probability of a history is given by

To give a concrete example, let the random recursive tree T n be generated from preferential attachment process PA φ . Then, the corresponding sequence of random treesT 1 ⊂T 2 ⊂ . . . ⊂ T n have the distribution whereT 1 is a singleton node drawn uniformly from U n and for k ∈ {2, 3, . . . , n}, we select a node u ∈ V (T k−1 ) with probability proportional to φ(deg(u,T k−1 )), select a node v ∈ U n − V (T k−1 ) uniformly at random, and add the edge (u, v) toT k−1 to formT k .

Suppose we have data in the form of a given unlabeled shape s n with an arbitrarily labeled representationt n whose node labels take values in U n . We may apply label randomization if necessary to assume without the loss of generality thatt n is an outcome of the randomly labeled treeT n . Our inference approach is based on the conditional distribution of a history P(T 1 ,T 2 , . . . ,T n−1 |T n =t n ).

For example, for u ∈ U n and observed shape s n with labeled representationt n ∈ s n , we may interpret P(T 1 = {u} |T n =t n ) as the probability of u being the root node conditional on observing the shape s n . Label randomization is a data augmentation scheme that simplifies the analysis and the computation. We can define the conditional probability of a node being the root without the use of label randomization (see Remark 4) but we show in Theorem 7 that the alternative definition is equivalent to the conditional root probability with label randomization. The calculation in (6) makes this approach precise.

Our proposed approach to either compute or approximate the conditional distribution of a history given a randomly labeled final state is especially natural for the purpose of developing a statistical framework which applies to a broad class of inference problems about the tree history. This general strategy can also be found in approaches posed independently within other disciplines, as in the similar conditional probability-based approaches in the physics literature Cantwell et al.; Timár et al.; 2020) and the randomized labeling framework proposed for tree reconstruction algorithms in computer science Magner et al.; .

With the definition of the conditional history (7), a natural approach for inferring the root is to construct a level 1 − credible set by iteratively adding nodes with the largest conditional root probabilities until the sum of the conditional root probabilities among the nodes in the set exceed 1 − . More precisely, let our data be an unlabeled shape s with a labeled representationt n with node labels in U n . We sort the nodes u 1 , . . . , u n ∈ U n oft n such that

and we define

We then define the -credible set as

If there are no ties in the conditional root probabilities (8), then B (t n ) is the smallest subset of U n such that

When there are ties however, the second condition in our definition of K dictates that we resolve ties by including all nodes with equal conditional root probabilities. Breaking ties by inclusion ensures that B (t n ) is labeling-equivariant. In general, credible sets do not have valid Frequentist confidence coverage. However, our next theorem shows that in our setting, the credible set B is in fact an honest confidence set.

Theorem 1. Let T n be a random recursive tree and let T * n be any arbitrary labeled representation (with labels taking values in U n ) of the observed shape sh(T n ), and let ρ ∈ Θ([n], U n ) be any isomorphism such that ρT n = T * n . We have that, for any ∈ (0, 1),

Proof. We first claim that, for a given shape s with a labeled representationt n ∈ s, the credible set B (t n ) is labeling-equivariant (cf. Remark 2) in the sense that for any τ ∈ Θ(U n , U n ), we have that τ B (t n ) = B (τt n ). Indeed, since (T 1 ,T 2 , . . . ,T n ) d = (τT 1 , τT 2 , . . . , τT n ), we have that, for any u ∈ U n ,

Therefore, for any u, v ∈ U n , we have that

is constructed by taking the top elements of U n that maximizes the cumulative conditional root probabilities, the claim follows. Now, let ρ ∈ Θ([n], Θ) be such that ρT n = T * n and let Π be a random bijection drawn uniformly in Θ(U n , U n ). Then,

where the penultimate equality follows from the labeling-equivariance of B and where the last inequality follows because P(T 1 ∈ B (T n ) |T n =t n ) ≥ 1 − for all labeled treet n (with labels in U n ) by the definition of B .

Theorem 1 shows that we may obtain a valid confidence set by constructing a credible set. The credible set can be efficiently computed for a class of tree growth processes that we describe in the next section.

The conditional history distribution (7) can be intractable for a general Markov tree growth process but it has an elegant characterization when the growth process satisfies a shape exchangeability condition.

Definition 2. A random recursive tree process T = (T n ) n≥0 is shape exchangeable if for all n ≥ 1 P(T n = t n ) = P(T n = t n ) for all recursive trees t n , t n ∈ T R n satisfying sh(t n ) = sh(t n ). (12) Immediate examples of shape exchangeable processes include the uniform attachment and the linear preferential attachment models from Examples 1 and 2. Theorem 4 shows that these two classes combine to characterize the class of all shape exchangeable processes.

In general, if a random recursive tree T n is shape exchangeable, then the conditional probability (7) of the random sequence of label-randomized treesT 1 , . . . ,T n takes on a simple form as shown in Proposition 3 below. We define some necessary concepts and notation before stating the result.

Lett n be a labeled tree with nodes labeled by U n . We define hist(t n ) as set of all historiest 1 ⊂ t 2 ⊂ . . . ⊂t n−1 ⊂t n that result int n . We may associate each distinct historyt 1 ⊂t 2 ⊂ . . . ⊂t n with a sequence v 1 , v 2 , . . . , v n ∈ U n such thatt k =t n ∩ {v 1 , . . . , v k } for every k ∈ [n]. We thus have that

. . , v n of the node label set U n that satisfy the constraint thatt n restricted to v 1 , . . . , v k is a connected sub-tree for any k ∈ [n]. Equivalently, we may define hist(t n ) as the set of all label bijections π ∈ Θ([n], U n ) such that π −1t n is a recursive tree.

We write #hist(t n ) the denote the number of distinct histories oft n . It is clear that #hist(t n ) depends only on the shape sh(t n ). Moreover, for a particular node v ∈ U n , we define hist(t n , v) as the set of all histories rooted at the node v, that is,

Proposition 3. Suppose T n is shape exchangeable and letT 1:n = (T 1 , . . . ,T n ) be the randomly labeled history. Then, for any historyt 1 ⊂t 2 ⊂ . . . ⊂t n with labels in U n , we have that

Proof. Suppose T n is shape exchangeable. Lett n be any labeled tree with nodes labels in U n let t • 1 ⊂ · · · ⊂ t • n−1 ⊂t n and t • 1 ⊂ · · · ⊂ t • n−1 ⊂t n be two histories oft n . By (6) and shape exchangeability of T n , we have that

Therefore,

for all histories (t • i ) 1≤i≤n and (t • i ) 1≤i≤n corresponding tot n . It follows that the conditional probability is equal for all histories, and thus the conditional distribution of the history given the final statet n is uniform over hist(t n ), as was to be proven.

for α, β satisfying either • β < 0 and α = −Dβ for some integer D ≥ 2 or

• β ≥ 0 and α > −β.

Remark 3. Theorem 4 shows that shape exchangeability encompasses three widely studied tree growth processes. Both the linear preferential attachment process, where φ(d) = d, and the uniform attachment, where φ(d) = 1, are shape exchangeable. In addition, we note that the case where β is negative corresponds to uniform attachment on the D-regular tree, i.e., φ(d) = D − d for some D ≥ 2. However, the sublinear preferential attachment process where φ(d) = d γ for some γ ∈ (0, 1) is not shape exchangeable; we discuss inference procedures for non-shape exchangeable trees in Section 4.2. Thus, for uniform attachment, linear preferential attachment, and D-regular uniform attachment processes, Proposition 3 and Theorem 4 combine to imply that inference about measurable functions of the unobserved history can be performed without knowing the α and β parameters governing the process. In particular, valid confidence sets can be constructed by observing only the shape of the final state.

Equivalently, if T n has the PA φ distribution where φ(d) = max(0, α+βd), then, by Proposition 3, the shape sh(T n ) is a sufficient statistic for α and β and knowledge of the history is ancillary to the estimation of α and β. We note that Gao et al. (2017) makes a similar informal statement regarding the estimation of the parameter function φ.

Proof. First suppose that φ(d) = max(0, α + βd) for α, β satisfying the conditions of the theorem. For n ≥ 1 and an arbitrary recursive tree t n , let t k := t n ∩ [k] and let u k ∈ [k − 1] be the parent node of k for every k ≥ 2. Then we have (15) where the penultimate equality follows because t k−1 has k − 2 edges and where the final equality follows because for every node v ∈ [n] such that d := deg(v, t n ) > 1, v is attached to a new node at times k 1 , k 2 , . . . , k d−1 ⊂ [n] and the degree of v at time k i is i. Because the distribution of T n depends on T n only through its degree distribution, which is a measurable function of sh(T n ), it follows that T n is shape exchangeable.

For the converse, suppose that T n is from a PA φ process and is shape exchangeable. For any pair of nodes u and v with degree k and k and at least one leaf node each in a realization t n , consider removing a leaf from both nodes to obtain t * n−2 with resulting degree distribution d * . And now consider adding the (n − 1)st and nth node to t * n−2 in order to obtain the final state t n . By shape exchangeability, both orderings must result in the same probability.

Let Φ(d * ) := w =u,v φ(deg(w, t * n−2 )) be the total weight to nodes of t * n−2 other than u and v. In the case where the (n − 1)st node connects to u and the nth connects to v, the conditional probability is

.

And in the case where the (n − 1)st node connects v and the nth connects to u, the conditional probability is

Shape exchangeability forces

from which it immediately follows that φ satisfies

for all i, j ≥ 1, and thus

for all i ≥ 1 and j ≥ 2 such that φ(i) > 0 and φ(j − 1) > 0. It follows that φ(i) − φ(1) = iβ and therefore must have the form

for φ(1) ≡ α. Finally, we must have φ(d) ≥ 0 for all d to ensure that the function φ determines a valid probability distribution.

Shape exchangeable tree processes are naturally suited to the inference questions highlighted in Section 2.2. For example, for the question of root inference, suppose we observe sh(T n ) = s with an arbitrary labeled representationt n , then we may compute, for each node v ∈ U n oft n ,

In fact, the numerator #hist(t n , u) coincides with the notion of rumor centrality defined by Shah and Zaman (2011) for the purpose of estimating the root node. The following proposition summarizes a characterization of #hist(t n , u) given in Section IIIA of Shah and Zaman (2011) . We note that Knuth (1997) made the same observation in the context of counting the number of ways to linearize a partial ordering.

Proposition 5. (Knuth; 1997; 2011) Lett n be a labeled tree with V (t n ) = U n and let u ∈ U n . Viewingt n as being rooted at u, define, for every node v ∈ U n , the treet Writing v 1 , v 2 , . . . , v L as the neighbors of u, we have that

Using the fact that #hist(t n , v) = #hist(t n , pa(v))

for any node v and its parent node pa(v), viewingt n as being rooted at u, Shah and Zaman (2011) derive an O(n) algorithm for counting the number of histories for all possible roots {#hist(t n , u)} u∈Un . We give the details in Algorithm 1 for reader's convenience. Using Algorithm 1, we conclude that the overall runtime of computing the confidence set B (·) is O(n log n) since we need to also peform a sort.

Algorithm 1 Computing {#hist(t n , v)} v∈Un 2011) Input: a labeled treet n . Output: #hist(t n , v) for all nodes v ∈ U n .

Arbitrarily select root u ∈ U n .

for v ∈ U n do Compute and store n Remove

Add Children(v) to S end while

In this section, we use the results of Bubeck, Devroye and Lugosi (2017) and Khim and Loh (2017) to provide a theoretical analysis of the size of the confidence set B (·). We also provide empirical studies of the size in our simulation studies in Section 5.1.

In Section 3.2, we defined K (t n ) for a fixed labeled treet n in (9) which is the size of our confidence set B (t n ). In this section, we analyze a slight variation where, assuming that the nodes u 1 , u 2 , . . . , u n are sorted in decreasing order by their conditional root probabilities, we define

where Eq(u,t n ) ⊂ U n is defined formally in (18) and is intuitively the set of nodes equivalent to u in the treet n . For practical applications, we prefer (9); we observe in simulations that the two are almost always equivalent. By again defining B (t n ) := {u 1 , . . . , u K (tn) } we have that, for any fixed labeled treet n , the set B (t n ) is the smallest labeling-equivariant subset of U n such that

We (2017) and Khim and Loh (2017) to bound the size of our confidence sets.

Theorem 6. Let ∈ (0, 1) be arbitrary, let T n be a random recursive tree, and let T * n be an arbitrary labeled representation of sh(T n ) whose node labels take values in U n . Let K (T * n ) be defined as in (17).

If T n is distributed according to the uniform attachment model, we have that, for all δ ∈ (0, 1),

If T n is distributed according to linear preferential attachment,

If, for some integer D ≥ 3, T n is distributed according to uniform attachment on D-regular trees, then lim sup

We defer the proof of Theorem 6 to Section S1 in the appendix. From Theorem 6, we see that in all three cases, the random size K (T * n ) = #B (T * n ) is O p (1) as n → ∞, which shows that the size of the confidence set is of a constant order even when the number of nodes tends to infinity. Moreover, the median size is asymptotically at most K ua ( /2), K pa ( /2), K reg,D ( /2) respectively for each of the three cases. As we show in our simulation studies (see e.g. Tables 2), these bounds tend to be very conservative.

Since the size of our confidence set K (T * n ) depends on the observed tree T * n , it is adaptive to the underlying distribution of T n . In contrast, the sizes of the confidence sets considered in Bubeck, Devroye and Lugosi (2017) and Khim and Loh (2017) depend only on and hence must be chosen with knowledge of the true model.

In this section, we show that P(T = {u} |T n =t n ) is proportional to the likelihood of u being the root on observing sh(T n ) = sh(t n ). Thus, any confidence sets created by ordering the nodes according to their conditional root probability P(T = {u} |T n =t n ) also maximizes the likelihood. We first follow Bubeck, Devroye and Lugosi (2017, Section 3) to derive the likelihood of a node u being the root on observing the unlabeled shape sh(T n ).

Given any labeled trees t, t , not necessarily recursive, and two nodes u ∈ V (t, ), u ∈ V (t ), we say that (t, u) and (t , u ) have equivalent rooted shape (written (t, u) ∼ 0 (t , u)) if there exists an isomorphism τ ∈ Θ(V (t), V (t )) such that τ t = t and τ (u) = u . We then define the rooted shape of (t, u) as the equivalence class sh 0 (t, u) = {t labeled tree, u ∈ V (t ) : (t , u ) ∼ 0 (t, u)}.

We give examples of rooted shapes with 4 nodes in Figure 3 .

For any labeled tree t, not necessarily recursive, and for any node u ∈ V (t), we define the set of indistinguishable nodes as Eq(u, t) := {v ∈ V (t) : (t, v) ∈ sh 0 (t, u))}

as the set of all nodes v ∈ U n where rootingt n at either node u or v yield the same rooted shape. In other words, Eq(u,t n ) is the set of all the nodes oft n that are indistinguishable from u once we remove the node labels. We give examples of indistinguishable nodes in Figure 3 . Let T n be a random recursive tree and suppose we observe that sh(T n ) = s n where s n is a shape with an arbitrary labeled representationt n whose node labels take value in U n . For any node u ∈ U n , the likelihood that any node in Eq(u,t n ) is the root is then the sum of the probabilities of all outcomes of the random recursive tree T n that has the same rooted shape as (t n , u). Since Eq(u,t n ) may contain multiple nodes, we then divide by the size of the set Eq(u,t n ) to obtain the likelihood of node u being the root.

To be precise, for a labeled treet n and a node u ∈ U n , we define recur(t n , u) := {t ∈ T R n : (t, 1) ∈ sh 0 (t n , u)} which is the set of all distinct recursive trees that have the same rooted shape ast n rooted at u. The likelihood of a node u is then

where we divide by #Eq(u,t n ) to account for multiplicity of indistinguishable nodes. If T n is shape exchangeable, then P(T n = t) is a constant for any recursive tree t and thus L(u,t n ) ∝ #recur(t n , u) #Eq(u,t n ) (for shape exchangeable models)

The number of distinct recursive trees #recur(t n , u) is in general smaller than the number of distinct histories #hist(t n , u), see for example Figure 3 . Bubeck, Devroye and Lugosi (2017, Proposition 1) derives an exact expression of the #recur(t n , u) in terms of isomorphism classes of the subtrees oft n . This cannot be efficiently computed since counting the number of isomorphic subtrees is a P -complete problem (Goldberg and Jerrum; .

Since the likelihood L(u,t n ) is based on counting the recursive trees #recur(t n , u) and the conditional root probability P(T 1 = {u} |T n =t n ) is based on the counting the histories #hist(t n , u), it is not obvious how to relate the two. Our main result this section is the next proposition which shows that they in fact induce the same ordering of the nodes of a treet n . We note this proposition strengthens the existing results in literature; in particular, it implies that Bubeck, Devroye and Lugosi (2017, Theorem 5) applies in fact to the exact MLE instead of an approximate MLE as previously believed.

Theorem 7. Let T n be a random recursive tree, not necessarily shape exchangeable, and let T 1 , . . . ,T n be the corresponding random history. For any labeled treet n with node labels taking value in U n , we have that, for all u ∈ U n ,

P((T n , 1) ∈ sh 0 (t n , u) | T n ∈ sh(t n )).

We defer the proof of Theorem 7 to Section S2 in the appendix.

Remark 4. From Theorem 7, we see that, for a labeled treet n and a node u ∈ V (t n ), we can define the conditional probability that u is the root node with the quantity sh 0 (t n , u) | T n ∈ sh(t n )), without the use of the label randomized treeT n . We divide by #Eq(u,t n ) to address the issue that there may be multiple nodes that are indistinguishable from node u. Theorem 7 shows that this is equivalent to the the label randomized conditional probability P(T 1 = {u} |T n =t n ). We use the latter expression for its simplicity.

The approach that we take for root inference and the structural properties that we proved for shape exchangeable processes may also be used for general questions of interest about unobserved histories. In these cases, we replace an exact calculation of the conditional probability by Monte Carlo approximation. In particular, we derive two different computationally efficient sampling protocols to generate a history of a given tree uniformly at random. To make the discussion concrete, consider again the arrival time of a given node u ∈ U n : Arr (u) ρ (T n ) = ρ −1 (u) ∈ [n] for ρ ∈ Θ([n], U n ) such that ρT n = T * n . In this case, for a given treet n , we may construct the confidence set by first computing B (u) (t n ) as the smallest subset of [n] such that

where Π is a random bijection distributed uniformly in Θ([n], U n ) such that ΠT n =T n and where we takeT 0 as the empty set. Following Theorem 1, we may again show that B (u) (t n ) has valid Frequentist coverage; we defer the formal statement and proof of this claim to Section S3. To compute the confidence set B (u) (t n ), we need to compute, for each t ∈ [n], the conditional probability P(u ∈T t and u / ∈T t−1 |T n =t n ).

We propose a Monte Carlo approximation where we generate independent samples {T from the conditional history distribution P(T 1 , . . . ,T n−1 |T n =t n ). For an event E, we may then approximate

For example, we may approximate the probability of node u arriving at time t (see (20) In the next section, we assume shape exchangeability and show two exact sampling schemes which allow us to efficiently carry out this approach. In Section 4.2, we devise an importance sampling scheme for computing the conditional probability under a general process, not necessarily shape exchangeable.

If T n is shape exchangeable, then we may generate a sample {T 

n−1 } by drawing a single history from the set hist(t n ) uniformly at random. In general, the total number of histories is large, but uniform sampling can be carried out sequentially by either (i) forward sampling, which builds up a realization from the conditional distribution by a sequential process of adding nodes to the following scheme, or (ii) backward sampling, which recreates a history from the observed shape by sequentially removing nodes according to the correct conditional distributions.

Throughout this section, it will be convenient to think of a historyT 1 ⊂T 2 ⊂ . . . ⊂T n−1 as an ordered sequence u = (u 1 , u 2 , . . . , u n ) ∈ U n n whereT k =T n ∩ {u 1 , . . . , u k } for every k ∈ [n].

Conditional on the final shape sh(T n ) = s n , a uniform random history can be generated by an analog to the Pólya urn process. We generate the first element u 1 (root) of the history from the distribution P(T 1 = · |T n =t n ) over U n , where the conditional root probability distribution can be computed in O(n) time through Algorithm 1. Once the root of the history is fixed, we then sequentially choose the next node u by size-biased sampling based on the size of the subtree rooted root at u away from the root u 1 . More explicitly, lett n be a tree labeled in U n and rooted at u 1 ∈ U n . For any v ∈ U n , we writẽ t (u1) v to denote the subtree oft n rooted at node v away from u 1 and n

as the size of the subtree. Given the shape oft n , we generate a history (u 1 , . . . , u n ) ∈ hist(t n ) by • sampling u 1 from P(T 1 = · |T n =t n ) and

• given u 1 , u 2 , . . . , u k−1 , sampling u k from among the remaining nodes in U n − {u 1 , . . . , u k−1 } according to P(u k = u k |T n =t n , u 1 = u 1 , . . . , u k−1 = u k−1 ) = n (u 1 ) u k n−k+1 , {u 1 , . . . , u k } ∈ hist(t n , u 1 ), 0, otherwise.

We note that (21) is a well-defined probability distribution because once we have fixed the first k − 1 nodes of the history (u 1 , u 2 , . . . , u k−1 ), the probability on the left hand side of (21) is positive only for a neighbor v of (u 1 , u 2 , . . . , u k−1 ). Summing n (u1) v over all neighbors v of (u 1 , u 2 , . . . , u k−1 ) gives exactly the denominator n − k + 1 of (21).

The coming proposition shows that the result of this process is a valid draw from the conditional distribution of the history given sh(T n ).

Proposition 8. Let T n be a shape exchangeable preferential attachment process andT n be a randomly labeled element of sh(T n ). For any sequence of nodes (u 1 , u 2 , . . . , u n ) ∈ hist(t n , u 1 ), the conditional distribution of u k given {T n =t n , u 1 = u 1 , . . . , u k−1 = u k−1 } the conditional distribution in (21).

Proof. Let the labeled treet n be fixed and suppose (u 1 , u 2 , . . . , u n−1 ) is a random history where P(u 1 = u |T n =t n ) = P(T 1 = {u} |T n =t n ) for all u ∈ U n and where the probabilities of u 2 , . . . , u n−1 are specified by (21).

By Proposition 5 and (16), we have that, for any fixed history (u 1 , u 2 , . . . , u n−1 ) ∈ hist(t n ),

as desired.

Proposition 8 states that to generate the second node of the history given that the first node is u 1 , we consider all neighbors v 1 , . . . , v L(u1) of u 1 (where L(u 1 ) denotes the number of neighbors of u 1 ) and choose v with probability proportional to the size of the subtree n (u1) v . Continuing in this way, once we have generated the first k − 1 nodes of the history, we consider all neighbors v 1 , . . . , v L(u 1:(k−1) ) of the subtreet n ∩{u 1 , . . . , u k−1 } and again choose a neighbor v with probability proportional to the size of the subtree n

The existence of such a sampling scheme for shape exchangeable processes is closely related to other sequential constructions for generating samples from exchangeable processes, such as the Chinese restaurant process found throughout the Bayesian nonparametrics and combinatorial probability literature; see, e.g., Crane (2016) for an overview of various constructions for exchangeable partition processes. The size-biased sampling without replacement in the above can also be related to general Polya urn schemes which are known to produce exchangeable sequences.

The computational complexity required to compute the size of all the subtrees {n (u1) v } v∈Un is linear in total number of nodes n because we can use a bottom-up procedure that makes a single pass through all of the nodes of the tree. Therefore, the computational complexity of generating the first k elements of the history depends on the number of neighbors L(u 1 ), L(u 1:2 ), . . . , L(u 1:(k−1) ). The number of neighbors could be of order O(n) and thus requiring O(n 2 ) time to generate a full history (u 1 , . . . , u (n−1) ).

We can improve the runtime of generating a full history to O(n log n) in the worst case in Algorithm 2. The algorithm proceeds by drawing the first node of the history u 1 from distribution (16); we now view the input labeled treet n as being rooted at u 1 . We then generate a random permutation of the node labels U n \ {u 1 } and modify the random permutation by swapping the position of a node v with that of its parent pa(v) if v appears in the permutation before pa(v). We continue until the random permutation satisfies the constraints necessary to be a valid history. The modification process can be done efficiently through sorting so that the overall runtime is at most O(n log diam(t n )) where diam(t n ) is the length of the longest path (diameter) of the treet n , as shown in the following proposition.

Algorithm 2 Generating a random history uniformly from hist(t n ).

Input: Labeled treet n whose node labels take value in U n . Output: A history represented as a sequence u 1 , u 2 , . . . , u n ∈ U n where u 1 = u. Let v = Γ −1 (t). If v ∈ M, continue to the next iteration.

pa(v) denotes the parent-node of v with respect tot n rooted at u.

Write (t 1 , . . . , t k ) = {Γ(v 1 ), . . . , Γ(v k )} and order them such that t (1) ≤ t (2) ≤ . . . t (k) . 8: Set u t (1) = v k , u t (2) = v k−1 , . . . , u t (k) = v 1 .

Add v 1 , . . . , v k to M.

10: end for Proposition 9. For any labeled treet n with node labels U n , Algorithm 2 generates a history u 1 , u 2 , . . . , u n uniformly at random from hist(t n ). Moreover, Algorithm 2 has a worst-case runtime of O(n log diam(t n )) where diam(t n ) is the diameter, i.e. the length of the longest path, of t n .

Proof. It is clear that the output u 1 , . . . , u n is a valid history. Suppose u 1 = u 1 for some u 1 ∈ U n , then u 2 = v for some neighbor v of u 1 if and only if Γ(v ) = 2 for some node v in the subtreet (u1) v rooted at v. Since Γ is a random permutation, the event that Γ(v ) = 2 for some v ∈ V (t (u1) v ) is exactly n (u 1 ) v n−1 . Now assume that we have generated the first k elements of the history u 1 = u 1 , u 2 = u 2 , . . . , u k = u k . Again, the event that u k+1 = v for some neighboring node v of {u 1 , . . . , u k } if and only if

), which occurs with probability exactly n (u 1 ) v n−k+1 . By Proposition 8, it thus holds that (u 1 , u 2 , . . . , u n−1 ) is uniform sample from hist(t n )

To prove the runtime of Algorithm 2, we note that Γ may be generated in O(n) time by the Knuth-Fisher-Yates shuffle algorithm (Fisher and Yates; 1943) . For each node v, we let k v denote the number of ancestor nodes, including v itself, not in the set M when the algorithm is at iteration t = Γ(v). It then holds that v∈Un\{u} k v = n since each node is placed in the set M as soon as it is visited by the algorithm. Thus,

which proves the proposition.

We may also generate a historyT (m) 1 , . . . ,T (m) n−1 , equivalently represented as a sequence of nodes u 1 , u 2 , . . . , u n , from the conditional distribution P(T 1 , . . . ,T n−1 |T n =t n ) by iterative removing one leaf a time.

If T n is shape exchangeable, then we have by Proposition 3 that for any labeled treet n and any leaf node u that

Thus, given an observed shape s n and with a labeled representationt n , we may reconstruct a history leading to that shape by recursively removing leaves according to (22) . In general, it may be possible to stop backward sampling before generating the entire tree history. For example, for determining the event of which of two nodes was infected first, the event is determined as soon as one of these nodes is removed during the reversed process.

If T n is not shape exchangeable, then the conditional distribution (7) is possibly intractable to directly compute or sample. In the case where the process is not shape exchangeable but where the model is known and the underlying probabilities are straightforward to compute, we propose an importance sampling approach where we use the shape exchangeable conditional distribution as the proposal distribution. More precisely, for a given observed shape s n with a labeled representatioñ t n , we generate samples of the historyT For a historyt 1 ,t 2 , . . . ,t n−1 of a labeled treet n , we write P unif (T 1 =t 1 , . . . ,T n−1 =t n−1 |T n = t n ) = 1 #hist(tn) as uniform distribution over hist(t n ) and P(T 1 =t 1 , . . . ,T n−1 =t n−1 |T n =t n ) as the actual probability. The importance weight for a single historyt 1 , . . . ,t n−1 is defined as the ratio of the actual probability over the proposal probability. In our setting, we observe that w(t 1 , . . . ,t n−1 ) = P(T 1 =t 1 , . . . ,T n−1 =t n−1 |T n =t n ) P unif (T 1 =t 1 , . . . ,T n−1 =t n−1 |T n =t n ) = P(T 1 =t 1 , . . . ,T n−1 =t n−1 ,T n =t n ) #hist(t n )

∝ P(T 1 =t 1 , . . . ,T n−1 =t n−1 ,T n =t n ),

where the last line follows because the #hist(t n )/P(T n =t n ) term of (23) depends only on the final treet n and not on the historyt 1 , . . . ,t n−1 . Since the importance weights only need to be specified up to a multiplicative constant, we may compute the actual probability P(T 1 =t 1 , . . . ,T n−1 = t n−1 ,T n =t n ) as the weights. To give a concrete example of importance sampling, we consider the problem of computing the conditional root probability P(T 1 = {v} |T =t n ) for a node v ∈ U n when T n is not shape exchangeable. Our first step is to generate M samples of histories {t Since any conditional distribution P(T 1 , . . . ,T n−1 |T n =t n ) is supported on hist(t n ) and thus dominated by the uniform distribution, we immediately obtain by the law of large number that the importance sampling protocol converges to the true conditional probability as the number of samples M goes to infinity.

Proposition 10. Let T n be any random recursive tree and letT n be the corresponding labelrandomized tree. For any labeled treet n with V (t n ) = U n , for any event E ⊂ hist(t n ), we have that

where w m is defined as in (24).

Remark 5. After completion of this manuscript, some recent independent work on sampling algorithms for random tree histories was brought to the authors' attention. Cantwell et al. (2019) have independently observed the same sampling probability as in Proposition 8 for the forward sampling algorithm and Young et al. (2019) have independently proposed a similar importance sampling scheme which is also adaptive.

As a simple illustration our root inference procedure, we show in Figure 4a a tree of 1000 nodes generated from the linear preferential attachment model. We construct the 95% confidence set, comprising of around 30 nodes colored green and cyan as well as the 85% confidence set, comprising of 6 nodes colored green. The true root node is colored yellow and is captured in the 95% confidence set. In Figure 4b , we show the same except that the tree is generated from the uniform attachment model.

In our first set of simulations studies, we verify that our confidence set for the root node has the frequentist coverage predicted by the theory. We generate trees from the general preferential attachment model PA φ where we take φ(d) = d, 1, 8 + d, 8 − d which correspond to linear preferential attachment, uniform attachment, mixed LPA and UA, and uniform on 8-regular tree models respectively. We generate 200 independent trees and then calculate our confidence sets and report the percentage of the trials where our confidence set captures the true root node. We summarize the results in Table 1 . Our findings are in full agreement with our theory, showing that we indeed attain valid coverage.

In our second set of simulation studies, we analyze the size of our confidence sets for the root node. First, we generate trees from the linear preferential attachment model where we vary the Table 1 : Empirical coverage of our confidence set for the root node. We report the average over 200 trials. Tree size is 10, 000 in all cases.

tree size from n = 5, 000 to n = 100, 000. We then compute the 95% confidence sets and report the average size of the confidence set as well as the standard deviation from 200 independent trials. We summarize the result in Table 2 . We observe that, in accordance with Theorem 6, the size of our confidence sets does not increase with the size of the tree. Next, we perform the same experiment on linear preferential attachment trees except that we hold the tree size constant at n = 10, 000 and instead vary the size from the confidence level from 0.90 to 0.95 to 0.99. We report the average size of the confidence set as well as the standard deviation from 200 independent trials. We summarize the results in Table 3 . To compare with these results, we also compute the size of the confidence set given by the bound C log(1/ ) 2 / 4 from Bubeck, Devroye and Lugosi (2017, Theorem 6) . The constant C arises from complicated approximations. We use C = 0.23 as a conservative lower bound and justify this bound in Section S4 in the appendix. We find that the bound, though theoretically beautiful, yield confidence sets that are far too conservative to be useful.

We then analyze the size of the confidence sets under the uniform attachment model. We use the same setting where we let the tree size for n = 10, 000 and summarize the results in Table 4 . We observe that under linear preferential attachment model, the size of the confidence set increases much more with the confidence level than under the uniform attachment model. This is in accordance with Theorem 6 and the theoretical analysis of Bubeck, Devroye and Lugosi (2017) . We also compare the size of our confidence sets with the bound of 2.5 log(1/ )/ that arises from Bubeck, Devroye and Lugosi (2017, Theorem 4) . We note that Bubeck, Devroye and Lugosi (2017) also gives a bound of a exp b log(1/ ) log log 1/ but this is far too large for any conservative values of a, b.

Number of nodes 5,000 10,000 20,000 100,000 Size of confidence set 31.31 ± 11.55 34.23 ± 13.6 35.85 ± 15 36.68 ± 12.5 Table 3 : Size of the confidence set under linear preferential attachment model for a tree of n = 10, 000 nodes. We also give the best bound on size from Bubeck, Devroye and Lugosi (2017, Theorem 6) of C log 2 (1/ )/ 4 (letting C = 0.23 as a conservative bound, see Section S4) for comparison purpose.

Next, we illustrate Algorithm 2 for sampling from the uniform distribution on the set of histories of a labeled tree. We generate a single tree of 300 nodes from the linear preferential attachment model, shown in Figure 5a . We select three nodes, colored red, blue, and green and we draw 500 samples from the conditional distribution of the history to infer the conditional distribution of arrival times of these three nodes. The true arrival time of the red node is 3, of the blue node is 50, and of the green node is 200; the true root node is shown in yellow. The inferred conditional distribution of arrival times is shown in Figure 5b . We observe that the conditional distribution of the arrival times reflect the "centrality" of these three high-lighted nodes. Table 4 : Size of the confidence set under uniform attachment model for a tree of n = 10, 000 nodes. We also give the best bound on size from Bubeck, Devroye and Lugosi (2017, Theorem 3) of 2.5 log(1/ )/ for comparison purpose. 

In this section, we run our method on a flu transmission network from Hens et al. (2012) . The data set originates from an A(H1N1)v flu outbreak in a London school in April 2009. The patientzero was a student who returned from travel abroad. After the outbreak, researchers used contact tracing to reconstruct a network of inter-personal contacts between 33 pupils in the same class as patient-zero, depicted in Figure 6a where patient-zero is colored yellow. Using knowledge of the true patient-zero, times of symptom onset among all the infected students, and epidemiological models, Hens et al. (2012) reconstructed a plausible infection tree, which is shown in Figure 6b . We first consider the plausible infection tree reconstructed by Hens et al. (2012) and see if we can determine the patient-zero from only the connectivity structure of the tree alone. We assume that the observed tree is shape exchangeable and apply the root inference procedure described in Section 3.2. We construct the 95% confidence set, which comprises the group of 10 nodes colored green (and patient-zero colored yellow), as well as the 85% confidence set, which comprises of 4 nodes with the conditional root probability labels in red. The true patient-zero, colored yellow, is the node with the third highest conditional root probability and it is captured by both confidence sets.

Next, we study the contact network in Figure 6a . The network is highly non-tree-like and so we first reduce it to the tree case by generating a random spanning tree where we generate a random Gaussian weight on each edge and then take the minimum spanning tree via Kruskal's algorithm (we note that this is not the uniform random spanning tree). We then apply our root inference procedure on the random spanning tree and compute the 95% and the 85% confidence set. We repeat this procedure 200 times (with 200 independent random spanning trees) and report the average sizes of the confidence sets as well as the coverage in Table 5 . We observe that although the random spanning trees are not necessary shape exchangeable, our root inference procedure is still able to provide useful output. We believe that we can use the same approach to perform history inference on a randomly growing network, not necessarily a tree; we defer a detailed study of this approach to future work. Figure 6a .

In this paper, we consider the specific setting where the shape of the infection tree is known but the infection ordering is unobserved and must be inferred. In many real world applications such as contact tracing, which is used for infectious disease containment, we do not observe the exact infection tree but rather a network of interactions among a group of individuals. In these cases, our methods may be applied as a heuristic on a spanning tree of the observed network. We defer a careful study of inference on the history of a general network to future work. Another open question is how to incorporate side information that are often present with the edges. For example, in contact tracing, each edge may be associated with a time stamp of when that edge was formed. The time stamp may be noisy because of the patients being interviewed may not remember the timing of the interactions perfectly but the information could still be valuable in providing a more precise inferential result.

The second author would like to thank Alexandre Bouchard-Coté and Jason Klusowski for very helpful conversations. The authors would also like to thank Jean-Gabriel Young for pointing us to some recent related work in the physics community and Tauhid Zaman for providing some additional references. This work is partially supported by NSF Grant DMS-1454817.

Supplementary material to 'Inference for the History of a Randomly Growing Tree' Harry Crane and Min Xu S1 Proof of Theorem 6

Proof. Let T n be a random recursive tree, let T * n be a labeled representation of the observed shape sh(T n ) and let ρ ∈ Θ([n], U n ) be an isomorphism such that ρT n = T * n . Suppose there exists scoring function ψ and, for every ∈ (0, 1), an integer K( ) such that C K( ),ψ (·) (defined in Section 2.3) is an asymptotically valid confidence set for the root node with coverage at least 1 − . More precisely, we suppose lim inf n→∞ P(root ρ (T n ) ∈ C K( ),ψ (T * n )) ≥ 1 − for any ∈ (0, 1). Let δ ∈ (0, 1). Then, there exists a real-valued sequence µ n → 0 such that, for any ρ ∈ Θ([n], U n ), P(T 1 ∈ C K(δ ),ψ (T n )) = π∈Θ([n],Un) P(Π(1) ∈ C K(δ ),ψ (ΠT n ) | Π = π)P(Π = π) = P(ρ(1) ∈ C K(δ ),ψ (ρT n )) = P(root ρ (T n ) ∈ C K(δ ),ψ (ρT n )) ≥ 1 − δ + µ n .

(S1.1)

For any labeled treet n , we have that by definition (17) that B (t n ) is the smallest labelingequivariant subset of U n such that P(T 1 ∈ B (t n ) |T n =t n ) ≥ 1 − . Hence, if K (t n ) > K(δ ), then it must be that P(T 1 ∈ C K(δ ),ψ (T n ) |T n =t n ) ≤ 1 − . Therefore, we have from (S1.1) that 1 − δ + µ n ≤ P(T 1 ∈ C K(δ ),ψ (T n )) = t n ∈Tn P(T 1 ∈ C K(δ ),ψ (T n ) |T n =t n )P(T n =t n ) ≤ P(K (t) ≤ K(δ )) + (1 − )P(K (t) ≥ K(δ )).

By simple algebra, we obtain P(K (t n ) ≥ K(δ )) ≤ δ + µ n / .

By Bubeck, Devroye and Lugosi (2017, Theorem 5), we have when T n has the uniform attachment distribution, then there exists scoring function ψ such that for any ∈ (0, 1), the set C Kua( ),ψ (·) contains the root with at least 1 − probability asymptotically. The first part of the Theorem thus follows.

We may obtain the other two claims of the Theorem in identical ways by using Bubeck, Devroye and Lugosi (2017, Theorem 6) and Khim and Loh (2017, Corollary 1) .

Proof. Before proceeding to the proof, we first establish some helpful notation.

For any labeled trees t, t , not necessarily recursive, we define the set of isomorphisms as I(t, t ) := {π ∈ Θ(V (t), V (t )) : πt = t }.

And, for u ∈ V (t) and v ∈ V (t ), we also define the restricted set of isomorphisms as I(t, u, t , v) := {π ∈ Θ(V (t), V (t )) : πt = t , π(u) = v}.

We note that I(t, t) is the set of automorphisms of t.

We have the following facts:

Fact 1 I(t, t ) is non-empty if and only if t, t have the same shape. Moreover, the cardinality of I(t, t ) depends only on that shape.

Fact 2 I(t, u, t , u ) is non-empty if and only if (t, u) and (t , u ) have the same rooted shape and the cardinality of I(t, u, t , u ) depends only on that rooted shape. As a consequence, I(t, u, t, v) is non-empty if and only if v ∈ Eq(u, t).

Fix (t, u) and (t , u ) and let us suppose that they have the same rooted shape. If #Eq(u, t) = 1, then #I(t, u, t , u ) = #I(t, t ). In general, we have that #I(t, t ) = u ∈V (t ) #I(t, u, t , u ) = #I(t, u, t , u )#Eq(u, t).

Recall that any historyt 1 ⊂t 2 ⊂ . . . ⊂t n can be represented as a pair (t n , π) where t n is a recursive tree such that sh(t n ) = sh(t n ) and where π is a bijection from [n] to U n . Similarly, any pair (t n , π) can be represented as a history by takingt k = πt k for all k ∈ [n].

We then have that π ∈ I(t n , 1,t n , u) if and only if (t n , π) ∈ hist(t n , u).

Let Π be a random bijection distributed uniformly in Θ([n], U n ), independently of T n , such that T n = ΠT n . We have, for anyt n with V (t n ) = U n and u ∈ U n , (tn,π)∈hist(tn,u) P(T 1 = πt 1 , . . . ,T n−1 = πt n−1 ,T n = πt n ) = (tn,π)∈hist(tn,u) P(T 1 = t 1 , . . . , T n−1 = t n−1 , T n = t n )P(Π = π) = 1 n! (tn,π)∈hist(tn,u) P(T 1 = t 1 , . . . , T n−1 = t n−1 , T n = t n ) = 1 n! tn∈recur(tn,u) P(T n = t n )#I(t n , 1,t n , u) = #I(t n ,t n ) #Eq(u,t n )n! tn∈recur(tn,u) P(T n = t n ) = #I(t n ,t n ) n! L(u,t n ).

Thus, P(T 1 = {u} |T n =t n ) = (tn,π)∈hist(tn,u) P(T 1 = πt 1 = u, . . . ,T n−1 = πt n−1 ,T n = πt n ) (tn,π)∈hist(tn) P(T 1 = πt 1 , . . . ,T n−1 = πt n−1 ,T n = πt n ) = L(u,t n ) v∈Un L(v,t n )

.

The final equality in the statement of the Theorem follows from the observation that P((T n , 1) ∈ sh 0 (t n , u)) #Eq(u,t n ) = 1 #Eq(t n , u) tn∈recur(tn,u) P(T n = t n ) = L(u,t n ). and that P(T n ∈ sh(t n )) = v∈Un P((T n , 1) ∈ sh 0 (t n , v)) #Eq(v,t n ) = v∈Un L(v,t n ),

where we divide by the size of the equivalent node class to adjust for double counting. The theorem then follows as desired.

Recall that for a random recursive tree T n , we defineT 1 , . . . ,T n as the corresponding labelrandomized sequence of trees. For a given labeled treet n with V (t n ) = U n , for a node u ∈ U n , for ∈ (0, 1), define B (u) (t n )

as the smallest subset of [n] such that P(Π −1 (u) ∈ B (u) (t n ) |T n =t n ) = t∈B (u) (tn) P(u ∈T t and u / ∈T t−1 |T n =t n ) ≥ 1 − , where Π is a random bijection distributed uniformly in Θ([n], U n ) and independently of T n such that ΠT n =T n and where we takeT 0 as the empty set. Then, we have the following guarantee.

Proposition S1. Let T n be a random recursive tree and let T * n ∈ sh(T n ) be any labeled representation such that V (T * n ) = U n . Then, for any ρ ∈ Θ([n], U n ) such that ρT n = T * n , for any ∈ (0, 1),

Moreover, B (u) (·) is labeling-equivariant.

Proof. We closely follow the proof of Theorem 1. Lett n be a labeled tree with V (t n ) = U n and let u ∈ U n be a node. We first show labelingequivariance: we claim that for any τ ∈ Θ(U n , U n ), B (u) (t n ) = B τ (u) (τt n ).

To see this, note that {τT 1 , . . . , τT n } d = {T t , . . . ,T n } and thus, P(u ∈T t and u / ∈T t−1 |T n =t n ) = P(u ∈ τ −1T t and u / ∈ τ −1T t−1 | τ −1T n =t n ) = P(τ (u) ∈T t and u / ∈T t−1 |T n = τt n ) Now let Π be a random bijection distributed uniformly in Θ([n], U n ) and independently of T n and letT n = ΠT n , we have that for any ∈ (0, 1), P(Arr (u) ρ (T n ) ∈ B (u) (T * n )) = P(ρ −1 (u) ∈ B (u) (ρT n )) = P(Π −1 (u) ∈ B (u) (ΠT n ) | Π = ρ)

where the last inequality follows because (S3.3) holds for everyt n .

In Section 5.1, we compare the size of our confidence sets against the bound of C log 2 1/ 4 provided in Bubeck, Devroye and Lugosi (2017) for the linear preferential attachment setting. The value of the universal constant C is difficult to determine since it depends on a non-normal limiting distribution described only through its characteristics function; see for more details.

We claim however that C ≥ 0.23. To see this, we note that when ≤ 0.5, any confidence set must contain at least 2 nodes since it is impossible to estimate the root with probability greater than 0.5. Therefore, with = 0.49, C ≥ 2 log 2 (1/ ) 4 −1 ≥ 0.23.

Emergence of scaling in random networks

The degree sequence of a scale-free random graph process

Finding Adam in random growing trees

From trees to seeds: on the inference of the seed from large tree in the uniform attachment model

On the influence of the seed graph in the preferential attachment model

Network robustness and fragility: Percolation on random graphs

Recovering the past states of growing trees

The ubiquitous Ewens sampling formula

On the discovery of the seed in uniform attachment trees

Predicting the sources of an outbreak with a spectral technique

Statistical tables for biological, agricultural and medical research

Consistent estimation in general sublinear preferential attachment trees

Counting unlabelled subtrees of a tree is # p-complete

Robust reconstruction and analysis of outbreak data: influenza a(h1n1)v transmission in a school-based population

Persistence of centrality in random growing trees

Networks and epidemic models

Confidence sets for the source of a diffusion in regular trees

Fundamental Algorithms

Statistical Analysis of Network Data: Methods and Models

Times: Temporal information maximally extracted from structures

Rumors in a network: Who's the culprit?

Finding rumor sources on random trees

Source detection of rumor in social network-a review

Inferring temporal information from a snapshot of a dynamic network

Choosing among alternative histories of a tree

Phase transition in the recoverability of network history