key: cord-0137442-44dtnkvf
authors: Wang, Lijia; Tong, Xin; Wang, Y. X. Rachel
title: Statistics in everyone's backyard: an impact study via citation network analysis
date: 2021-10-16
journal: nan
DOI: nan
sha: 43bd3cbbd499d7c49ebc799396ff6fd11c0754a0
doc_id: 137442
cord_uid: 44dtnkvf

The increasing availability of curated citation data provides a wealth of resources for analyzing and understanding the intellectual influence of scientific publications. In the field of statistics, current studies of citation data have mostly focused on the interactions between statistical journals and papers, limiting the measure of influence to mainly within statistics itself. In this paper, we take the first step towards understanding the impact statistics has made on other scientific fields in the era of Big Data. By collecting comprehensive bibliometric data from the Web of Science database for selected statistical journals, we investigate the citation trends and compositions of citing fields over time to show that their diversity has been increasing. Furthermore, we use the local clustering technique involving personalized PageRank with conductance for size selection to find the most relevant statistical research area for a given external topic of interest. We provide theoretical guarantees for the procedure and, through a number of case studies, show the results from our citation data align well with our knowledge and intuition about these external topics. Overall, we have found that the statistical theory and methods recently invented by the statistics community have made increasing impact on other scientific fields.

As a discipline that focuses on the collection, analysis and interpretation of data, statistics is outward facing and often serves as a tool in other scientific investigations. The age of Big Data has brought about new challenges and opportunities in many fields, where the postulation, verification and refinement of scientific models rely on empirical data. In this sense, one would expect statistics to play an increasingly important role in these fields as the need for methods and tools for handling large, complex data increases. On the other hand, much of the fundamental research in statistical theory and methods requires rigorous mathematical arguments and abstract formulations for generalizability. It can be argued that the technical nature of such works serves as a barrier, making direct adoption of research developments difficult in other fields. In this paper, we consider measuring the impact of theoretical and methodological research in statistics on other scientific disciplines in recent decades. As John Tukey deftly put it: "the best thing about being a statistician is that you get to play in everyone's backyard."

One direct way to measure the impact of academic works is through citation data. In the digital age, comprehensive bibliometric studies have been made possible by the existence citation databases such as Web of Science and Scopus. From these databases, citations between papers can be extracted, represented as a network, and studied using network analysis techniques. These citation networks have been used to track the movements of ideas and measure the distance between different scientific fields [1, 2] . Coauthorship networks can also be constructed from publication records for studying the structure of collaboration patterns [3, 4] . More specifically in statistics, [5] and [6] used the Bradley-Terry model to measure the import and export of knowledge between statistical journals. [7] collected and analyzed citation and coauthorship networks for papers in top statistical journals. Rather than focusing on the structure of citation patterns inside statistics, we provide the first comprehensive study analyzing the connections between statistics and other fields.

We collect citation information for papers published in selected statistical journals from the Web of Science (WOS) Core Collection. These published papers are termed source papers for being the source of knowledge export; our complete data contains citations between source papers as well as their citations by papers (termed citing papers) in other journals and fields. Using descriptive statistics, we characterize the trends of citation volumes and compositions of citing fields for the source papers over time, paying attention to fields external to statistics. We compare the internal and external citations for highly cited source papers and identify the corresponding statistical research areas highly ranked by both criteria. Citation trend analysis of these areas allows us to associate them with external fields on which they have made an intellectual impact.

Given a network, one of the most commonly used analysis techniques is community detection, also known as node clustering. On the citation network for source papers, global clustering techniques can be used to partition the nodes into densely connected communities as has been done in [7] , offering a global view of various research areas within statistics. However, in this paper, we are more interested in connecting these communities in statistics with research topics in other disciplines they have cast an influence on. That is, given an external research topic (e.g., , we consider finding the most relevant community in statistics, with relevance measured by the citation data. A local clustering perspective is particularly suitable in this case since i) we expect the relevant community to be small compared with the whole network, making it challenging to detect by global clustering methods; and ii) the citations between the source papers and the citing papers give a natural way of finding "seed nodes" for local clustering algorithms.

A large class of local clustering algorithms is based on seed expansion: given a small subset of seed nodes from a community of interest, the rest of the community is detected by ranking the other nodes according to the landing probabilities of random walks started from the seeds. Different classes of algorithms correspond to different ways of combining these probabilities for random walks of different lengths, with the most popular ones being versions of personalized PageRank (PPR) [8, 9, 10] and heat kernels [11, 12] . These algorithms have been widely applied to large-scale real networks with much empirical success. More recently, attention has been paid to studying their theoretical properties on networks generated from the stochastic block model (SBM, [13] ) and its variant. [14] showed that PPR corresponds to the optimal linear classifier under a suitable two-block SBM. Using the more general degree-corrected SBM (DC-SBM, [15] ), [16] showed that PPR can include high-degree nodes outside the community of interest, while using the adjusted PPR (aPPR) algorithm in [17] can correct the degree bias, achieving consistency in the detection of the target community with high probability.

After the nodes have been ranked in terms of their relevance to the target community, it remains to choose the size of the local cluster and cut the sorted list of nodes at the desired size. A scoring function is thus needed to evaluate the quality of the communities found along the sorted list. One of the most widely used scoring functions is conductance [17, 18, 19] , which measures the fraction of total edge volume that points outside the cluster. A smaller conductance indicates the cluster is more separated from the rest of the network, hence more likely to be a community on its own. Assessing the performance of various scoring functions on a large number of real networks, [18] showed conductance consistently gives good performance in identifying ground-truth communities. The theoretical properties of conductance, however, has not been investigated under the local clustering setting with generative network models. For our local clustering procedure, we adopt aPPR followed by conductance local minimization. Under the DC-SBM, we show that with high probability, this procedure finds all the nodes in the community to which the seed nodes belong.

The rest of the paper is organized as follows. In Section 2, we describe the data collection procedure and various covariates used in our analysis. We provide a summary of citation trends over time and citation distributions for each journal. In Section 3, we study the diversity of citing fields by grouping the citing papers according to the research areas they belong to. In particular, we determine which highly cited source papers have high citations both within statistics and outside statistics, as well as those that appear to have a larger impact on one side of the audience. In Section 4, we describe our local clustering procedure for finding the statistical community most relevant to an external research topic. We provide theoretical analysis of its behavior under the DC-SBM and demonstrate its performance on simulated data and a number of case studies from our citation network. We end the paper with a discussion of the merits and limitations of our study, pointing to directions in which it can be extended in the future.

2 Data collection and overview of citation trends 2 

We conducted our study on all the papers published from 1995 to 2018 in five influential statistics journals: Annals of Applied Statistics (AOAS), Annals of Statistics (AOS), Biometrika, Journal of the American Statistical Association (JASA) and Journal of the Royal Statistical Society: Series B (JRSSB) 1 . Using a Python script, we crawled the bibliographic database Web of Science (WoS) Core Collection to collect citation data for a total of 9,338 papers published in these journals in the time span considered. We only included publications whose document types are listed as "article" in WoS. We call these publications source papers since they act as a source of knowledge for papers citing them. Among our selected journals, AOS, Biometrika, JASA, and JRSSB are considered by many researchers in the statistics community as top outlets for theory and method works. We have also included AOAS as a representative journal with a broad applied focus.

For each source paper, the WoS database provides a list of papers citing it and the corresponding publication information. We finished extracting these lists before December 2020. In addition to the citations between the source papers, 264,356 papers from other journals (or from the selected five statistics journals but published in 2019 and 2020) cited these source papers; these papers are called citing papers. 2 Rather than limiting to "article" as we did for the source papers, the citing papers can be of any document type. Based on the lists of citations, we build a citation network that consists of 273,694 nodes including all the source and citing papers, and edges representing citations between the source papers and from the citing papers to the source papers.

The above citation network can be represented by a binary adjacency matrix A ∈ {0, 1} 273694×9338 , in which

In this matrix, we assign each source paper to an index in I s = {1, . . . , 9338} and each citing paper to an index in I c = {9339, . . . , 273694}. Our current study does not contain citations from the source papers to the citing papers since we are primarily interested in the impact of source papers on other scientific works.

We obtained the publication information for both source and citing papers from the WoS database. In particular, the following variables are central to our analysis: (1) article 1 We include both publication names JRSSB used during 1995-1997. 2 The accessibility of citing papers depends on the university library VPN used to access the WoS database. title, (2) publication source title (e.g., journal or conference names), (3) publication year, (4) author keywords, (5) abstract, (6) WoS categories (e.g., "Statistics & Probability" and "Mathematical & Computational Biology"), and (7) research areas (e.g., "Mathematics"). In our dataset, only 98 of all the papers do not have any specified categories (nor research areas), thus we label their categories (and research areas) as "NA". We use the broad research areas to classify the general field of each paper and the WoS categories to provide finer classifications when statistics needs to be distinguished from other research fields. More discussions on the division and field classification can be found in Section 3.

Furthermore, in Section 3, we use the variables (2) publication source title, (3) publication year and (7) research areas to illustrate the change of impact on external and internal areas over time for the selected five journals and their highly cited papers. In Section 4, we select papers from target topics based on (1) article title and (5) abstract, and validate the local community found using (4) author keywords. More details about the usage of these variables will be presented in the respective sections.

As shown in Figure 1a , the vast majority of source papers have fewer than 500 citations with 1.92% of the source papers receiving zero citations. Figure 1b further plots the distribution of the citation counts for source papers with citations from 0 to 500. We observe that removing the zero-citation papers would lead to a power-law distribution of the citation counts. Notably, only 0.06% papers (6 papers) received more than 5,000 citations. Highly cited papers like these will be discussed in more details in Section 3.2. Looking at the trends over the years, the total number of citations for each journal grows consistently (Supplementary Figure S1) , and the growth is not due to the journals expanding their volumes of publications. In fact, there was no significant increase in the annual number of publications in each journal (Supplementary Figure S2 ) except AOAS. AOAS was established in 2007 and subsequently went through fast growth period before stabilizing. To account for the effect of publication numbers, for each year T , we normalize the annual citation count for each journal by the total number of published papers from 1995 to T in that journal, since any citing paper published in year T is free to cite source papers in the period 1995-T . Figure 2a shows that the normalized citations still increase consistently over the years for all the journals, among which JRSSB enjoys substantially more citations per article after 2002. AOAS' normalized citations have been growing quickly as a relatively new journal. It is clear that citation counts are not distributed equally across all the papers, and one possible way to measure citation inequality is through the Lorenz curves [2, 7] . For journal j, define

where N j is the number of publications, p is the percentage, and d (1) , d (2) , . . . , d (N j ) are the citation numbers in a non-decreasing order of papers in journal j published in 1995-2018. L(p) calculates the percentage of citations shared by the bottom p percent of papers as a measure of inequality. Figure 2b plots L(p) as a Lorenz curve for each journal, with curves closer to the bottom right corner indicating greater extent of inequality. Most journals have highly similar curves, while JRSSB appears to have the most significant inequality. This can be explained by the fact that there are four papers that each received more than 5,000 citations, accounting for 49.5% of the total citations towards JRSSB in this period. (Recall that we only have six papers in our dataset with citation numbers exceeding 5,000.) After removing these four papers, the normalized citation counts for JRSSB become much closer to the other journals but remain the highest of all journals (Supplementary Figure S3 ).

As the overall citations for each journal increase over the years, how much of the increase can be attributed to research fields outside statistics? In this section, we break down the citations by their research fields, paying attention to the distinction between internal and external citations.

As mentioned Section 2.1, even though the WoS categories and research areas can help us identify the research field each paper belongs to, we still have to make a decision about whether a citation should be considered inside (internal) or outside (external) of statistics. This is a subjective decision in some sense given the interdisciplinary nature of many research topics in statistics and the overlap of statistics with fields such as mathematics, computational biology and econometrics. We take the following approach, which perhaps can be viewed as conservative in estimating external impact. We consider two types of internal papers. The first type includes papers containing the tag "Statistics & Probability" in their WoS categories, which applies to all the papers published in common statistics and/or probability journals. These papers are labeled as "STATS" in our subsequent plots. The second type includes papers whose WoS categories contain the keyword "math" (e.g., "Mathematics" and "Mathematical & Computational Biology"). Additional papers selected by this step are published in journals such as Journal of Econometrics, BMC Bioinformatics, thus from fields reasonably close to statistics. In what follows, these papers are labeled as "MATH" and counted as internal citations. The rest of the papers are considered as external. This procedure divides our dataset into 83,503 internal and 190,191 external papers.

We use the research areas 3 provided by WoS to classify the external papers into five broad categories: arts & humanities (ART), life sciences & biomedicine (BIO), physical sciences (PHY), social sciences (SOC), and technology (TECH). We note that the finer divisions under research areas could also be used to search for the second type of internal papers described above with "math" as a keyword, but doing so would lead to a subset of the papers already selected by the WoS categories.

Using the category labels discussed above, Figure 3a shows the research area breakdowns for all the citations over the years. If an external paper lists multiple research areas, each area is weighted equally and contributes a fractional count to the total. As expected, in the earlier years of our period of study, most of the citations are from within statistics. However, the proportion of external citations soon begins to increase at a fast pace and finally exceeds half. Among the external citations, BIO and TECH have heavy weights. The same trend for each journal separately is presented in Supplementary Figure S4 . The proportion of external citations also increases over time for all the journals, with AOAS and JRSSB having larger proportions than the others. One way to summarize the distribution of proportions and put the diversity measure for each journal on the same scale is through the use of Gini concentration [5] . Let

where s i is the proportion of citations from research category i, and we consider the same categories as shown in Figure 3a except that we combine STATS and MATH into one internal category. Journals with more diverse citations by external categories have lower Gini concentrations. Figure 3b plots the change in the Gini concentration for each journal over the years. Overall the trends agree with our results in Figure 3a and Supplementary Figure S4 . All the journals have demonstrated increasing connections with external fields, with AOAS, JASA, and JRSSB being more diverse than the others.

In the previous section, we compared the proportions of internal and external citations at an aggregated level within each journal. Now we turn to examine the internal and external impact of some specific source papers selected based on their high citation counts. Do highly cited papers always have high impact both internally and externally? To this end, we first rank the source papers according to their internal and external citation counts separately. Focusing on papers in the top 20 list by either internal or external counts, Figure 4 shows their respective ranks internally and externally. One can see that most of these papers are ranked high under both criteria except for a few outliers. We focus on the most obvious two (boxed in red) and provide their information in Table 1 and further analysis below. The first paper [20] in Table 1 ranks in the top 20 based on the internal citation counts, but its external rank is relatively lower in comparison. Since the paper is about distribution theory, unsurprisingly we find most of the citations come from fields closely related to statistics. Supplementary Table S1 provides the top 10 WoS categories and their number of occurrences among the citations, with "Statistics & Probability" appearing most often. Also, most of these categories contain the keyword "math", which explains the higher internal rank. The other categories (e.g, "Computer Science, Interdisciplinary Applications") are still closely related to statistics or mathematics. Upon removing the internal papers, the occurrences of these categories other than statistics and mathematics decrease significantly (Supplementary Table S2 ), suggesting many of the previous counts are contributed by internal papers with multiple category labels. Overall, the paper has reached a larger audience within statistics and mathematics, most likely due to its technical nature.

The second paper [21] in Table 1 demonstrates the opposite pattern, with a high external rank but a low internal rank. This paper proposes a practical method of evaluating and adjusting for the possibility of publication bias (e.g., a preference for positive results), a well-known phenomenon in published academic research especially in meta-analysis, and thus has attracted wide scientific interests. Supplementary Table S3 lists the top 10 most frequent WoS categories among all the citations. One can see that list is dominated by psychiatry and psychology, while statistics or mathematics related categories are not present. This list remains almost unchanged after removing all the internal papers from the citations (Supplementary Table S4 ). We have additionally searched for keywords related to publication bias in the title and author keywords of the internal papers. The search only returns 59 papers, confirming the topic is less explored internally and could be a potential area for further theoretical and methodological development in statistics. We note that Figure 4 has another paper [22] with a low internal rank (469) and a high external rank (12) . The paper has a similar category profile to [21] (Supplementary Table S5 Table 2 : Papers whose internal and external citations both rank in the top 20.

As can be observed in Figure 4 , most papers have both high internal and external ranks. Table 2 lists all the papers that are ranked in the top 20 both internally and externally. We classify these papers roughly into five topics: Markov chain Monte Carlo (MCMC), causal inference (causal), penalized regression, false discovery rate (FDR), Bayesian model selection. To investigate the influence of these papers on other fields, we consider the aggregated citations by the five topics and break down the citations by category labels, similar to Figure 3a . In this case, we have added two category labels: "BE" for the research area "Business & Economics" and "CS" for the research area "Computer Science", since we notice a considerable number of citations are from these two areas, especially for causal inference and penalized regression. To avoid double counting, papers with the BE (or CS) label will not be counted in SOC (or TECH), which is the broad category BE (or CS) belongs to in WoS. Similar to before, multiple labels for one paper are weighted equally. Figure 5 shows that the influence on other fields differs by statistical research topics. FDR and Bayesian model selection have always attracted a substantial proportion of citations from BIO, even from the earlier years. MCMC and penalized regression have more citations from CS than the others. On the other hand, causal inference has the largest proportion of citations from SOC and BE among the five topics. 97 98 99 00 01 02 03 04 05 06 07 08 09 10 11 12 13 14 15 16 17 18 year count FDR (d) 

We have seen that different research topics in statistics often have different citation profiles by external fields, indicating they may have a heavier influence on some fields and topics and less so on others. This prompts us to consider the question, given a specific external research topic, can we identify the most relevant statistical research topic (with relevance measured by our collected citation data)? This section investigates a local clustering approach by aPPR followed by appropriate cutoff selection. We present theoretical studies under the DC-SBM and results on simulated data. More importantly, we demonstrate the efficacy of the procedure on our citation data through several detailed case studies.

A typical local clustering method starts from one or multiple seed nodes and performs a random walk in the neighborhood of the seeds to gather other relevant nodes. In our setting, we first use keyword search to select a subset of citing papers, I t ⊂ I c , from an external topic of interest (see details in Section 4.3). The seed nodes are constructed using citation information between the source papers I s and the topic papers in I t , and the local clustering is performed on I s and their network A s . For clustering purpose, we consider two papers as related in content if a citation exists between them; the direction of this citation is less important if we think of it as a form of association. For this reason, we treat A s as an undirected network in this section. That is,

for i, j ∈ I s = {1, . . . , 9338}.

Next we present the details of the local clustering procedure and its theoretical properties under a network model with community structure. Standard order notations O, Ω, O p and Ω p will be used throughout.

In order to analyze the behavior of local clustering, we adopt the popular DC-SBM [15] , which captures both node heterogeneity and community structure, as the underlying network model. While such a model may not capture all the features of our citation network, the presence of node heterogeneity is reflected by the uneven distribution of citation counts, and it is plausible to assume the underlying communities correspond to different research topics. For convenience of notation, we will describe the DC-SBM and local clustering procedure using a general symmetric adjacency matrix A and a general set of nodes I, with the understanding that they refer to A s and I s in our data analysis.

In the original SBM [13] , N nodes are assigned to K blocks or communities, and the probability of an edge between two nodes only depends on their community memberships. To abbreviate notations, write the set {1, . . . , n} as [n] for any integer n. The set of nodes I = [N ] is partitioned into K blocks by the function g : [N ] → [K]. Let n k denote the size of block k, I k denote the set of nodes in block k for k ∈ [K]. The proportion of members in block k is τ k = n k /N . We consider the case that the number of blocks K is fixed, and τ k is bounded below by a constant for all the k ∈ [K]. The probability of an edge between nodes i and j is

where B ∈ [0, 1] K×K is the connectivity matrix. We adopt the common parametrization for B as B = ρ N S, where S is a fixed K × K matrix, and ρ N is the average edge density satisfying ρ N → 0 at some rate as N → ∞.

DC-SBM introduces node heterogeneity by adding a degree parameter θ i for each node i, so that the probability of an edge between i and j becomes

Some constraint is needed on θ i for identifiability, and we adopt the constraint i∈I k θ i = n k for all k ∈ [K] following Karrer and Newman [15] .

The degree of node i is defined as d i = j∈I A ij . The population adjacency matrix is the conditional expectation of A, i.e.,

It follows then the population node degrees are d i = j∈I A ij , and the expected degree

Given an adjacency matrix A, define the diagonal matrix D = diag(d 1 , . . . , d N ) and the graph transition matrix P = D −1 A. The personalized PageRank (PPR) vector p ∈ [0, 1] N is the stationary distribution of the process

where α ∈ (0, 1] is the teleportation constant, and π ∈ [0, 1] N is a probability vector called the preference vector encoding one or multiple seed nodes. For example, if there is one seed node v 0 = 1, π = (1, 0, . . . , 0) . Under a network model with community structure such as SBM or DC-SBM, the goal is to recover all the nodes with the same community membership as v 0 by ranking the elements in the PPR vector p.

In our setting, we choose source papers that have high citation counts by a set of topic papers as the seed nodes. For a source paper j ∈ I s and a set of topic papers I t , its citation count is a j = i∈It A ij , where A is the citation network defined in Eq (1). The preference vector π ∈ [0, 1] 9338 is calculated as

Here t is a chosen threshold constant. We extend the setting of a single seed node in [14] and [16] to multiple seed nodes, but still make the assumption that they all belong to the same community. While it is unlikely that all papers cited by a specific topic come from the same community, the threshold t helps us prune the vector π and make the assumption more reasonable.

Related to PPR, the adjusted personalized PageRank (aPPR) vector is defined as

where p i is the ith entry in the PPR vector. [16] showed that under the DC-SBM, adjusting by the degrees leads to a consistent ordering of the entries in p * so that entries with the highest values belong to the target community. Formally, let n be a community size cutoff. Then n nodes with the largest p * i values are selected as members in the target community, that is

where p * (1) , . . . , p * (N ) is the sorted list of p * in a non-increasing order.

Corollary 1 in Chen et al. [16] shows that with v 0 = 1 (assuming without loss of generality it belongs to block 1), provided we know the correct size cutoff n = n 1 (recall n 1 = |I 1 |), then the aPPR clustering can recover all the nodes in block 1 with high probability, i.e., C n 1 = I 1 . Since our setting makes use of multiple seed nodes, the follow proposition extends their result to better fit our situation. 

Then for sufficiently large N , with probability at least

Here ∆ α is an increasing function of α. The exact form of ∆ α and the proof of Proposition 1 are deferred to Supplementary Materials B.1.

Given the result in Proposition 1, it remains to choose the correct size n for C n to fully recover the target community (block 1). To achieve this, an objective function is needed to evaluate the quality of the clusters found. Conductance is a popular objective function to be optimized either globally or locally [19, 18] and often used in conjunction with a local clustering algorithm like PPR [8, 33] . It tends to favor small clusters weakly connected to the rest of the graph, and one would expect such an assortative structure in citation networks with communities defined by research topics.

For a set of nodes I ⊆ I, we define its conductance φ as

where A i· = j∈I A ij . The numerator is known as the cut of the graph partitioned by I and its complement (I ) c , while the denominator represents the volume vol(I , I).

We note that an alternative form of conductance has min{vol(I , I), vol((I ) c , I)} in the denominator. The two forms are equivalent when the size of I is smaller than (I ) c , a condition we expect to hold for C n 1 and its neighborhood, given n 1 is small compared with N . Hence we choose the form in Eq (7) for easier bookkeeping.

Proposition 1 demonstrates that the aPPR vector sorts the nodes in terms their relevance to the target community with high probability. The sorted list of nodes leads to a sequence of clusters {C n } N n=1 and their conductance values {φ(C n )} N n=1 . Our next theorem establishes that the correct choice of n occurs at a local optimum along this sequence, justifying the practice of choosing the community size cutoff by inspecting the conductance plot. 

Then for sufficiently large N , there exits n with n − n 1 = Ω(N ) such that

uniformly for n ∈ [n ].

The proof of Theorem 1 has two major parts. The first part in Supplementary Materials B.2 analyzes the optimality properties of φ at the population level with the help of the result in Proposition 1. The second part in Supplementary Materials B.3 incorporates noise from the adjacency matrix and proves the local optimality result in the theorem. We make two remarks as follows.

a) The bound in Eq (9) and the lower bound on n − n 1 guarantees the optimum at n 1 is well separated from its neighborhood and this neighborhood is wide enough to be observed in a conductance plot. b) As can be seen from the proofs in Supplementary Materials B.2, the larger the gap between S 11 min i>1Si· and 2 max j>1 S 1jS1· , the more peaked and easier to spot the local optimum is. In an assortative graph with S ii > max j =i S ij , a smaller τ 1 will lead to a smallerS 1· in Eq (8), making the inequaltiy more easily satisfied. Thus using conductance as a objective function is well suited to the situation where n 1 is a small fraction of N .

Algorithm 1: Local clustering Input : adjacency matrix A, preference vector π, and teleportation constant α. 1 Compute the aPPR vector p * in Eq (5) based on (A, π, α). 2 Construct the sequence of clusters {C n } N n=1 according to Eq (6) and p * .

In this section, we examine the performance of our local clustering procedure as summarized in Algorithm 1 on data simulated from the DC-SBM. We focus on the case where the target community (block 1) is small compared with the whole graph as we consider it to be more relevant to our real data structure.

Consider a DC-SBM with K = 2, n 1 = 50, n 2 = 3000, and B = 0.05 0.01 0.01 0.05 .

To simulate the degree parameters, let η i ∼ Uniform(1, 10) for i = 1, . . . , n 1 + n 2 , and

. . , n 1 ; n 2 η i / n 1 +n 2 j=n 1 +1 η j , i = n 1 + 1, . . . , n 1 + n 2 so that they satisfy the identifiability constraint on θ i . We investigate the effect of the teleportation constant α and the number of seed nodes, denoted m, on the accuracy of the local clustering results. When using m seed nodes, the corresponding preference vector have π i = 1/m for i ∈ [m], and π i = 0 for the other entries. To determine the community size, we search for the first obvious local minimum in the conductance plot, and we find these optimal points usually occur before n < 55. Supplementary Figure S5 provides examples of the conductance plots for α = 0.15 and different m values; the cases for other α are similar. Table 3 shows the average precision and recall rates and their standard deviations for finding members in block 1 with α = 0.15 and five seed counts (1, 5, 10, 15, 20) ; each setting is repeated for 50 simulations. We can see that the precision increases as the number of seeds increases, since more seeds will provide more initial information for the clustering. More seeds also help to stabilize the variance of recall and increase the mean recall by a smaller margin. On the other hand, the influence of α is rather minimal. The results from different α values (0.05, 0.25) are presented in Supplementary Table S6 . In Figure 6 , we further illustrate the distributions of these precision and recall values for the case α = 0.15. For all the case studies from the citation data in the next section, we set α = 0.15 as in Chen et al. [16] .

For the sake of completeness, in Supplementary Materials C.2, we compare the local clustering procedure with commonly used global clustering techniques including spectral clustering and SCORE [34] for different values of n 1 . As expected, local clustering is better suited to the situation with smaller n 1 .

Next we apply local clustering to our citation data and use the procedure to find the most relevant statistical research areas for given external topics. We choose three external topics (single-cell transcriptomics, labor economics and flu) of high general interests spanning biology, economics and epidemiology, and discuss the results in detail. More examples of topics and their clustering results can be found in Supplementary Materials C.4.

Before applying Algorithm 1, it remains to describe the selection of topic papers I t and the construction of the preference vector π. (Recall that the adjacency matrix used here is A s as described in Eq (2).) For each external topic, papers in I t are chosen by keyword searches among the citing papers. More concretely, for the topics of single-cell transcriptomics and labor economics, we find citing papers that contain the relevant keywords 4 in their abstracts. For a more accurate search result, we further restrict the labor economics papers to the category SOC using the labels in Section 3.1. The single-cell papers can come from a more diverse set of categories, and as shown in Supplementary Figure S8a Supplementary Figure S9 contains the conductance plot for each topic and our choices of the local minimum. The size of the target community found for each topic is listed in Table 4 . We can see that these subnetworks indeed have significantly denser connections (and in some cases, higher clustering coefficients) than the whole network. The subnetworks and the word clouds generated from the keywords of the subnetwork papers can be found in Figure 7 . We discuss these in more details below, interpreting the results with our understanding of the topics.

Size Table 4 : Summary statistics for the subnetworks in Figure 7 compared with the global graph A s .

Rapid advances in single-cell sequencing technologies in the past decade have enabled researchers to profile different aspects of an individual cell, in particular its transcriptome. After appropriate preprocessing, a single-cell transcriptomic data usually takes the form of a large, sparse matrix, with tens of thousands of rows representing genes and columns representing cells. The sparse, noisy and heterogeneous nature of such data has proved a fertile ground for the development of statistical and computational methods (see e.g. [35] for a review). Inspecting the subnetwork and word cloud in Figure 7a , perhaps unsurprisingly, a significant fraction of the papers selected are concerned with multiple testing and connected to the hub node 79 [31] . As an example, multiple testing is routinely performed in the analysis of single-cell RNA-seq (scRNA-seq) data for identifying differentially expressed genes, which involves applying a statistical test to a large number of genes to determine if their expression levels are significantly different between two sets of cells. The word cloud also suggests clustering as another main keyword; in the subnetwork, clustering is a topic shared by the set of papers tightly knit around node 35 [36] and 78 [37] . In the analysis pipeline of scRNA-seq data, clustering is applied to a dimension-reduced scRNA-seq matrix to identify distinct subpopulations of cells, which can correspond to different cell types or states. The related feature selection and model selection problems are highly relevant in this context, as they help researchers determine genes (features) that distinguish these subpopulations and the total number of subpopulations observed.

Labor economics aims to understand the functioning and dynamics of the markets for wage labor. Many fundamental questions in this subject-How does education affect income? How does healthcare affect income?-are of causal nature. Economists and governments would like to design policies that might achieve certain economic and social welfare goals based on causal analysis. Randomized controlled trials (RCT) are usually not available for Labor Economics problems. Therefore, it is not surprising to see that an overwhelming majority of the statistics papers selected in the subnetwork and word cloud in Figure 7b are in the realm of causal inference. Concretely, in the the word cloud, the frequently appearing keywords (minus "test") are all technical terms in causal inference-"propensity score", "instrumental variable", "structure model", "matched sampling", "treatment effect", "matching", and "observational study". Notably, the node 78 [24] , a hub in the subnetwork, links the structural equations framework in econometrics and the potential outcomes framework in statistics. The paper provides conditions for a causal interpretation of the instrumental variable (IV) estimand, and quantifies the bias of violations of the critical assumptions. Moreover, many cited statistics papers (node 16 [38] and node 18 [39] ) are rather recent, and they also appear in the subnetwork. This coincides with the recent surge of the study of causal inference in the statistical community in the last few years and offers some evidence that the new developments quickly penetrate into other research fields.

The global pandemic of Covid-19 has further ignited wide research interests in the modeling and prediction of the spread of an epidemic. We choose flu as an example of epidemics due to its longer history of study and frequent appearance in the literature of epidemiology. (The results from using Covid-19 as the topic are presented in Supplementary Materials C.4.) As expected, many of the keywords in Figure 7c are related to stochastic processes and state-space modeling. The word MCMC appears the most often being a commonly used technique for parameter estimation in these epidemic models. Looking more closely at the subnetwork, many of the papers focus on refining the susceptible-infectious-recovered (SIR) model for infectious diseases including flu and SARS. For the two hub nodes 28 [40] and 3 [41] , the former is concerned with the parameter estimation problem for different types of observed data, while the latter extends the SIR model by incorporating incubation stage and time dynamics to track the spread of flu.

In this paper, we study the citation network arising from selected statistical papers in the past two decades, a period coinciding with the rise of Big Data and statistics being perceived to play increasingly important roles in many scientific disciplines. Unlike previous studies on statistics citation networks, we focus on the connections between statistics and other disciplines and use citation data to investigate the external influence of various statistical works.

First performing descriptive analysis, we show that both the overall volume of citations and the diversity of citing fields have been increasing over time for all the journals considered. Even typical theoretical journals such as AOS have been attracting a significant proportion of external citations in recent years, which is quite encouraging. Next by distinguishing between internal and external citations, we identify research areas in statistics that have high impact under both criteria. The most highly cited papers are ranked high both internally and externally. On the other hand, papers with a large number of external citations but relatively fewer internal citations can point to areas where future development in relevant theory and methods may be rewarded by immediate visibility outside statistics. Lastly, using the technique of local clustering, we identify the statistical research communities most relevant to various external topics of interest. Under the DC-SBM, we prove the combination of aPPR and conductance selects all nodes in the target community with high probability. We demonstrate the performance of the algorithm using simulated data, examining its stability with respect to the number of seeds and the teleportation constant. Presenting a number of case studies using external topics of high general interests, we show that the communities selected align well with our intuition and understanding of the topics.

Our study takes the first step toward understanding the influence of statistical works on other disciplines that use tools and methods from statistics to aid their discoveries. The data we have collected can be of independent interests, opening opportunities for further modeling and analysis from different perspectives. We also note that some of the limitations in our current study can be addressed by expanding the scope of the data. For example, in analyzing the trend of diversity of citing fields, it would be ideal to collect information about the number of published papers in each citing field and include it as a normalization factor. The data could also be expanded to include more journals and other types of source publications, such as conferences and books, over a longer period of time to allow for a more comprehensive historical view and richer analysis. We leave the collection and analysis of these more extensive data as future work.

Compared with global clustering, the theoretical properties of local clustering techniques are less well characterized under generative network models. Our application and theoretical results of local clustering can be extended to incorporate mixed membership modeling and temporal changes in the evolution of communities. We have currently used textual data (e.g., keywords) as a way to validate the target communities found; it would be more interesting to include such data as covariates in the network model subject to clustering analysis.

We end the discussion by acknowledging the limitations of citation itself as a form of data measuring intellectual influence, some of which have already been pointed out in previous studies [5, 6] . Not all citations carry the same weight -a paper could be mentioned just in the literature review or serve as the foundation that inspired the paper citing it; arguably the latter type of citation is more important. Citations are not always attributed to the correct source, and modern day style of research relying on search engines such as Google is likely to bias toward papers already with high citation counts. Many data scientists and practitioners in industry do not necessarily publish their works but can still make use of ideas and tools in statistical papers, resulting in missing citations. Nevertheless, despite these limitations, citation data provide a useful and necessary first passage into investigating the intellectual influence of scientific works. 

Under the DC-SBM, Chen et al. [16] constructed the "block-wise" population version of aPPR vector p * ∈ R K and proved that when there is a single seed node in block 1,

The separation between block 1 and the other blocks is defined as ∆ α ∈ [0, 1],

Note that ∆ α is an increasing function function of α. This separation together with appropriate concentration analysis allowed them to show in their Corollary 1 that the sample aPPR vector can consistently recover all the nodes in block 1 given the correct size cutoff.

The following property of p * can be easily derived from the linearity of PPR vectors in general, p * (ω 1 π 1 + ω 2 π 2 ) = ω 1 p * (π 1 ) + ω 2 p * (π 2 ), where ω i ≥ 0 and ω 1 + ω 2 = 1 .

This property enables us to extend Corollary 1 in Chen et al. [16] to the setting with multiple seed nodes in a straightforward way.

Proof of Proposition 1. We first check that their assumption max i∈I d i min i∈I d i < c 0 for some constant c 0 holds under our assumption (c.1). We have

By assumption (c.1),

Since S is a fixed matrix, max i∈I d i min i∈I d i is bounded above.

It remains to show the inequality (S.1) holds for multiple seed nodes from the same block. Without loss of generality, we consider two seed nodes v 1 = 1 and v 2 = 2 from block 1, and their corresponding preference vectors are π 1 = (1, 0, 0, . . . , 0) and π 2 = (0, 1, 0, . . . , 0) . When ω 1 + ω 2 = 1 and ω i ≥ 0, ω 1 π 1 + ω 2 π 2 can be considered as a preference vector containing two seed nodes from the same block. Now Eq (S.1) applies to π 1 and π 2 separately, that is p * 1 (π 1 ) > max{p * k (π 1 ) | k = 2, . . . , K} and p * 1 (π 2 ) > max{p * k (π 2 ) | k = 2, . . . , K} . (S.5) By Eq (S.3) and Eq (S.5), we have p * 1 (ω 1 π 1 + ω 2 π 2 ) = ω 1 p * 1 (π 1 ) + ω 2 p * 1 (π 2 ) > ω 1 max{p * k (π 1 ) | k = 2, . . . , K} + ω 2 max{p * k (π 2 ) | k = 2, . . . , K} ≥ max{ω 1 p * k (π 1 ) + ω 2 p * k (π 2 ) | k = 2, . . . , K} = max{p * k (ω 1 π 1 + ω 2 π 2 ) | k = 2, . . . , K} .

(S.6)

The rest of the proof is the same as that of Corollary 1 in Chen et al. [16] .

In this section, we analyze the optimality properties of the conductance function under the population version before we present the sample version in the next section. Such a technique has been widely used in a number of works (e.g., Bickel and Chen [42] and Zhao et al. [43] ); our case mostly differs in the construction of the confusion matrix and analysis of the population version of the objective function.

A major difference between the previous works and our analysis is that they aim to recover all the blocks, whereas we are only concerned about the target block (block 1). For a given cutoff set C n in Eq (6) (n ∈ [N ]), which essentially partitions all the nodes I into two sets, we consider the label assignment function z = h(C n ). More concretely,

for each node u ∈ I. In other words, we merge blocks 2, . . . , K into one block and collectively call them block 2. Therefore, the correct assignment z 0 (as far as block 1 is concerned) should have z 0 (u) = 1 for u ∈ I 1 , and z 0 (u) = 2 for u / ∈ I 1 . Given the aPPR vector p * and the corresponding sequence {C n } n∈[N ] , denote

which is the set of all possible labels generated from {C n } n∈ [N ] . We have |ζ| = N .

Recall that Proposition 1 establishes the aPPR vector recovers block 1 with high probability when n = n 1 , i.e., C n 1 = I 1 . We will show that under this high probability event, φ(C n 1 ) is a local minimum by analyzing the neighborhood around n 1 . It is easy to see this event also implies the following property for the set C n ,

In other words, all the nodes in C n are from block 1 when the cutoff n < n 1 ; C n is exactly block 1 when n = n 1 ; and the whole block 1 is contained in C n when n > n 1 .

For clarity of description, we first study the properties of φ under the SBM with K blocks. Following the convention, let z − z 0 1 = N u=1 1{z(u) = z 0 (u)}, and for 1 ≤ a, b ≤ 2,

Define the confusion matrix R ∈ [0, 1] 2×K ,

where g denotes the correct labels as introduced in 4.1.1. Let R abbreviate R(z, g), and RSR abbreviate R(z, g)SR(z, g) . Note that R 1 = τ where τ k = |I k |/N is the proportion of nodes in block k. Let µ N = N 2 ρ N , then

For convenience, for a general 2 × 2 matrix M , define

We immediately have

Moreover, we write G(z) = F (RSR (z)) , which is the population version of Eq (S.12) and only depends on z. The following lemma shows z 0 is a well separated local optimum in a suitable neighborhood defined around C n 1 .

Recall that we are working under the event C n 1 = I 1 so that Eq (S.8) holds. 

Proof of Lemma 1. We have

According to (S.8) , we consider the following cases.

Case 1: n = n 1 . Then, C n = I 1 (i.e., {u | z(u) = 1} = {u | g(u) = 1}), z = z 0 . Note that R 1j = 0 for j = 1, R 21 = 0, and R 11 = τ 1 . By Eq (S.13),

Case 2: n < n 1 . By Eq (S.8), C n I 1 , that is, {u | z(u) = 1} {u | g(u) = 1}. It follows then R 1j = 0 for j = 1, and

(S.14)

We have

Therefore,

Case 3: n > n 1 . By Eq (S.8), C n I 1 , that is, {u | z(u) = 1} {u | g(u) = 1}. Hence R 21 = 0, R 11 = τ 1 , and

We have

Substituting,

For a small but fixed ε > 0 (to be specified later), by choosing n satisfying n − n 1 = N ε , we can restrict max j>1 R 1j ≤ ε for all N > 1/ε and n ∈ [n ] according to Eq (S.15). Then we have

According to assumption (c.3), there exists a constant c > 0 such that

Therefore,

The same result can be shown for a DC-SBM with K blocks by defining a confusion tensor, adding a dimensionality for the nodes; all the other notations remain the same unless otherwise specified. Define the confusion tensor T ∈ {0, 1 N } 2×K×N as

Then, define the degree-corrected confusion matrixR ∈ [0, 1] 2×K ,

Let T abbreviate T (z, g), andR abbreviateR(z, g). Now we have

The following lemma is similar to Lemma 1 but extends the result to the DC-SBM. 

Proof of Lemma 2. We havẽ

Similar to the proof for Lemma 1, we consider the three cases in Eq (S.8).

Case 1: n = n 1 . Again, we have C n = I 1 and z = z 0 , soR 1j = 0 for j = 1,R 21 = 0, andR 11 = τ 1 . By Eq (S.24), we haveG

Case 2: n < n 1 . By Eq (S.8), C n I 1 (i.e., {u | z(u) = 1} {u | g(u) = 1}), sõ R 1j = 0 for j = 1, and

The last inequality holds by assumption (c.1). Also, we havẽ

Case 3: n > n 1 . By Eq (S.8), C n I 1 (i.e., {u | z(u) = 1} {u | g(u) = 1}), sõ R 21 = 0,R 11 = τ 1 , and

Similar to Eq (S. 16) ,

By assumption (c.1),

Here R is the original confusion matrix defined in Eq (S.10). By the same argument as in Lemma 1, we can find a fixed ε > 0, such that by choosing n − n 1 = N ε , we can restrict max j>1 R 1j ≤ ε for all N > 1/ε and n ∈ [n ] according to Eq (S.15). Therefore, max j>1R 1j ≤ U θ ε .

Similar to Eq (S.17),

According to Eq (S.18), we havẽ

Eq (S.27) has an upper boundary similar to (S.19) that is

N Ω(|n − n 1 |) for n ∈ {n 1 + 1, . . . , n } .

The proof of the main theorem relies on the optimality properties of the population version we derived in the previous section and concentration inequalities in following lemma.

and P( max

Proof. These are well-known inequalities that can be proved by Bernstein's inequality. For the sake of completeness, we present the details here. We have

Also, we have

µ N X ab is a sum of independent zero mean random variables bounded by 1. By Bernstein's inequality,

Since ε ≤ 6C S , for fixed a, b, z,

Therefore,

We have |ζ| = N , which establishes Eq (S.29). Moreover, according to Eq (S.7), we have |{z ∈ ζ : z − z 0 1 = m}| ≤ 2, which establishes Eq (S.30). Now, we assume z(m + 1) = z 0 (m + 1), . . . , z(N ) = z 0 (N ). Then,

Based on the Bernstein inequality, we have

For ε ≤ 12mC S /N and fixed a, b, z, z 0 ,

Then,

which establishes Eq (S.31).

The proof of the main theorem combines the population version result in Lemma 2, which holds under the high probability event established in Proposition 1, and Lemma 3, which controls noise through concentration.

Proof of Theorem 1. According to Eq (S.12), our goal is same as showing that there exists n − n 1 = Ω(N ) such that

where z = ζ and ζ = {z = h(C n ) | n ∈ [n ]} for λ n satisfying (c.2). Note that z − z 0 1 = |n − n 1 | according to the definition in Eq (S.7).

The proof technique is similar to [44] . By Taylor expansion,

where ∂F ∂M is the partial derivative with respect to the vectorized M .

∂F ∂M is continuous with respect to M , so

Therefore, there exists positive constants C 1 , C 2 , C 3 , C 4 such that

Under the highly probability event described in Eq (S.8), by Lemma 2, there exist n − n 1 = Ω(N ) and a positive constant C 0 satisfying

The other terms can be bounded by concentration, noting that assumption (c.2) implies λ N → ∞. We write P max

By Eq (S.29), we have

According to Eq (S.31), we have

Then,

Also, by Eq (S.30), we have P max

Similar to Eq (S.38),

Again, by Eq (S.32),

Then, 

We compare the local clustering procedure with common global clustering techniques including the usual spectral clustering, SCORE [34] and the more recently proposed SCORE+ [45] . We find that SCORE+ performs better than SCORE in most of our experimental settings, so we only present the results of SCORE+ in what follows. We consider the three settings below. Table S7 : Means and standard deviations of precision and recall for local clustering, spectral clustering and SCORE+. Each setting is repeated in 50 simulations.

Setting 1: n 1 = 150. The preference vector has π i = 1/50 for i = 1, . . . , 50, π i = 0 for others. For the local clustering method, we observe that a local minimum usually occurs for n < 200. Thus we search for the minimum point in the range 1 − 200. Figure S6a is an example of a conductance plot under this setting. The local minimum is obvious at the point n = 129 in this example.

Setting 2: n 1 = 100. The preference vector has π i = 1/30 for i = 1, . . . , 30, π i = 0 for others. In this case, we search for the local minimum within the range 1 − 150. Figure S6b gives an example of a conductance plot under setting 2. Again the local minimum is clear in the plot.

Setting 3: n 1 = 50. The preference vector has π i = 1/20 for i = 1, . . . , 20, π i = 0 for others. Here, we search for the local minimum in the range 1 − 55, as mentioned in Section 4.2.

All the other parameters (e.g., K, B and θ i ) are the same as in Section 4.2. The teleportation constant α is set to 0.15 as before. Note that the number of seeds in π decreases when the size of block 1 decreases, as we expect fewer seeds to be available for smaller community sizes.

For each setting, we calculate the precision and recall from 50 simulations and record their means and standard deviations in Table S7 . All the methods have high precision and recall rates in Setting 1. However, as n 1 decreases and the two block sizes become more imbalanced in Settings 2 and 3, the performance of spectral clustering and SCORE+ be-come worse, whereas local clustering remains stable with high averages and small standard deviations. As shown in Table 3 and Table S6 , which examine the effect of α and seed number under Setting 3, local clustering has slightly higher average precision and substantially higher average recall than the other two methods even when only a single seed is used. We also note that the standard deviations of local clustering are smaller than those of spectral clustering in all the settings. More detailed distributions of these precision and recall rates under the three settings can be found in the violin plots of Figure S7 . For each topic, we construct the preference vector by Eq (4). We choose the threshold t based on the citation counts from the topic papers to the source papers. For the topic "single-cell" and "labor economics", the top papers receive more than 90 citations; we set t = 10. For the topic "flu", the highest citation count is less than 90, and we set t = 5.

The conductance plot for each topic is shown in Figure S9 . In most cases, there is an obvious local minimum leading to a reasonable community size. In (b), we choose the first minimum occurring after n ≥ 10 for a more plausible subnetwork size and clearer interpretation of result. Figure S10 .

As a rapidly emerging topic of wide scientific interests and public relevance, we apply our local clustering procedure to the topic Covid-19. We search papers with the word "covid" in their abstracts and either have the category label BIO or SOC. We treat these two sets of papers separately as we expect them to focus on different aspects of the pandemic in their studies. Note that as we finished the data collection process before 2020 December, most of the research related to vaccines or new Covid-19 strains (e.g., the Delta variant) had not yet appeared. For Covid papers in BIO, a considerable number of papers found by the clustering procedure are on survival analysis. In particular, we can observe a hub at node 24 [46] in Figure S10a . This paper proposed the Fine-Gray method, which is popular in competing risk analysis (a type of survival analysis). The cause-specific hazard functions with explanatory covariates are commonly used in this type of analysis, but they often lack interpretations. As a result, clinicians prefer the cumulative incidence functions that are the marginal probability of certain events. [46] modeled the cumulative incidence function by a proportional hazards model, which helps analysts measure the effect of covariates. Another hub is node 39 [31] , pointing to the need for multiple testing in many analyses in this field.

On the other hand, the papers found for Covid-19 SOC focus more on the societal reactions under a pandemic. In Figure S10b , we can observe a hub centered at node 10 [47] . This paper analyzed the effects of California Proposition 99 (Tobacco Tax and Health Protection Act of 1988) through the synthetic control method that is commonly used to evaluate the effect of an intervention in comparative case studies. At the beginning of the pandemic, the effect of various quarantine measures became of great public concern. Synthetic control methods are used to study the outcomes of different quarantine policies. 

Weaving the fabric of science: Dynamic network models of science's unfolding structure

Shorter distances between papers over time are due to more cross-field references and increased citation rate to higher-impact papers

The structure of scientific collaboration networks

Coauthorship and citation patterns in the physical review

Citation patterns in the journals of statistics and probability

Statistical modelling of citation exchange between statistics journals

Coauthorship and citation networks for statisticians

Communities from seed sets

Overlapping community detection using seed set expansion

Community membership identification from small seed sets

A local graph partitioning algorithm using heat kernel pagerank

Heat kernel based community detection

Stochastic blockmodels: First steps

Block models and personalized pagerank

Stochastic blockmodels and community structure in networks

Targeted sampling from massive block model graphs with personalized pagerank

Local graph partitioning using pagerank vectors

Defining and evaluating network communities based on ground-truth

Local network community detection with continuous optimization of conductance and weighted kernel k-means

The multivariate skew-normal distribution

A nonparametric "trim and fill" method of accounting for publication bias in meta-analysis

Testing the number of components in a normal mixture

Reversible jump markov chain monte carlo computation and bayesian model determination

Identification of causal effects using instrumental variables

Least angle regression. The Annals of statistics

The control of the false discovery rate in multiple testing under dependency

Model selection and estimation in regression with grouped variables

Journal of the royal statistical society: series B (statistical methodology

A direct approach to false discovery rates

Bayesian measures of model complexity and fit

Controlling the false discovery rate: a practical and powerful approach to multiple testing

Regression shrinkage and selection via the lasso

Learning with partially absorbing random walks

Fast community detection by score

The triumphs and limitations of computational methods for scrna-seq

Finding the number of clusters in a dataset: An information-theoretic approach

Estimating the number of clusters in a data set via the gap statistic

A permutation test for the regression kink design

Balancing covariates via propensity score weighting

Estimation in multitype epidemics

Tracking epidemics with google flu trends data and a state-space seir model

A nonparametric view of network models and newmangirvan and other modularities

Consistency of community detection in networks under degree-corrected stochastic block models

Correction to the proof of consistency of community detection

SCORE+ for network community detection

A proportional hazards model for the subdistribution of a competing risk

Synthetic control methods for comparative case studies: Estimating the effect of california's tobacco control program

The authors would like to thank Dr. Tung-Yu Wu for help with the data collection process and Prof. Peter J. Bickel, Prof. Jingyi Jessica Li for many fruitful discussions. Y.X.R.W. gratefully acknowledges funding from the Australian Research Council DECRA Fellowship (DE180101252).