key: cord-0701564-s8j3ildz authors: Chung, Moo K.; Ombao, Hernando title: Lattice Paths for Persistent Diagrams with Application to COVID-19 Virus Spike Proteins date: 2021-05-01 journal: ArXiv DOI: nan sha: b6626cb792fb742e20feefe380da1cea89efd20d doc_id: 701564 cord_uid: s8j3ildz Topological data analysis, including persistent homology, has undergone significant development in recent years. However, one outstanding challenge is to build a coherent statistical inference procedure on persistent diagrams. The paired dependent data structure, which are the births and deaths in persistent diagrams, adds complexity to statistical inference. In this paper, we present a new lattice path representation for persistent diagrams. A new exact statistical inference procedure is developed for lattice paths via combinatorial enumerations. The proposed lattice path method is applied to study the topological characterization of the protein structures of the COVID-19 virus. We demonstrate that there are topological changes during the conformational change of spike proteins, a necessary step in infecting host cells. Despite its rigorous mathematical foundation developed since its introduction 20 years ago [11] , persistent homology still suffers from numerous statistical and computational problems. It has not yet become a standard tool in medical imaging. This approach has been applied to a wide variety of data including brain networks [9] , protein structures [12] , and RNA viruses [5] . However, most of these methods only serve as exploratory tools that provide descriptive summary statistics rather than formal inference. The main difficulty is due to the heterogeneous nature of topological features, which do not have a one-to-one correspondence across persistent diagrams. Motivated by these challenges, we propose a more principled topological inference procedure through lattice paths. Lattice paths are widely studied algebraic objects in combinatorics and may have potential applications in persistent homology [2, 16, 18] . Previously, they have been used to differentiate 0th and 1st Betti numbers in functional brain networks [9] . However, the method is limited to a specific type of filtration that produces only up to the 1st Betti numbers. Here, we propose a procedure that is generalizable to arbitrary persistent diagrams. The main advantage of the lattice path approach is that it provides probabilistic statements about the similarity of two persistent diagrams. This is often needed to produce some baseline quantitive measure, such as the p-value, commonly used in biomedical research [7, 9] . Existing methods for computing p-values usually rely on approximate resampling techniques: jackknife [8] , bootstrap [1] and the permutation test [9] . Right: Spike proteins of the three different corona viruses. The spike proteins consist of three similarly shaped interwinding substructures identified as A (blue), B (red) and C (green) domains. Our approach here is different because it is analytic and thus compute the exact probability . The main contributions of this study are the following: (1) a new data representation via Dyck and lattice paths; (2) the first analytic approach for computing probabilities without resampling; (3) the first topological study on the shape of COVID-19 virus spikes proteins. The proposed lattice path method was use din differentiating the conformational changes of the COVID-19 virus spike proteins (Figure 1 ). This demonstration is particularly relevant due to the potential for advancing vaccine development [19, 4] . High dimensional objects, such as proteins and molecules, can be modeled as a point cloud data V consisting of p number of points (atoms) indexed as V = {1, 2, · · · , p}. Suppose that the distance ρ ij between points i and j satisfies the metric properties. For proteins, we can simply use the Euclidean distance between atoms in a molecule. Then X = (V, ρ), ρ = (ρ ij ) is a metric space where we can build a filtration necessary for persistent homology. If we connect points following some criterion on the distance, they will form a simplicial complex which will follow the topological structure of the molecule [10, 15, 21] . Note that the k-simplex is the convex hull of k + 1 points in V . A simplicial complex is a finite collection of simplices such as points (0-simplex), lines (1-simplex), triangles (2-simplex) and higher dimensional counterparts. In particular, the Rips complex X is a simplicial complex, whose k-simplices are formed by (k + 1) points which are pairwise within distance [13] . While a graph has at most 1-simplices, the Rips complex has at most (p − 1)-simplices. The Rips complex induces a hierarchical nesting structure called the Rips filtration X 0 ⊂ X 1 ⊂ X 2 ⊂ · · · for filtration values 0 = 0 < 1 < 2 < · · · . The filtration is quantified through k-cycles where 0-cycles are the connected components, 1-cycles are loops while 2-cycles are a 3-simplices (tetrahedron) without interior. The Betti numbers β k , count the number of independent k-cycles. During the Rips filtration, the i-th k-cycles are born at filtration value b i and die at d i . The collection of all the paired filtration values {(b 1 , d 1 ), · · · , (b q , d q )} displayed as 1D intervals is called the barcode and displayed as scatter points in 2D plane is called the persistent diagram. Since b i < d i , the scatter points in the persistent diagram are traditionally displayed above the line y = x line by taking births in the x-axis and deaths in the y-axis. Dyck paths. The first step in the proposed lattice path method is to sort the set of all the birth and death values in the filtration as an order statistic c : c (1) < c (2) < · · · < c (2q) , where c (i) is one of the birth or death values. The subscript (i) denotes the i-th smallest value. We will simply call such sequence as the birth-death process. Every possible valid sequence of birth and death values can be viewed as forming a probability space, where each valid sequence is likely to happen with equal probability. During the filtration, the sequence of birth and death occurs somewhat randomly but still maintains a specific pairing structure. There exists a one-to-one relation between the ordering information and Dyck paths if we identify births with and deaths with [2, 16] . If we trace the arrows, we obtain the Dyck path ( Figure 2 ) [18] . A valid Dyck path always starts at y = 0 and ends at y = 0. At any moment during the filtration, a Dyck path cannot go below y = 0. The total number of Dyck paths is called the Catalan number κ p = 1 q+1 2q q . The first few Catalan numbers are κ 1 = 1, κ 2 = 2, κ 3 = 5 and κ 4 = 14. More rapid changes in the direction of Dyck paths imply more fleeting fluctuations indicative of smaller topological signals. Less fluctuations indicate larger persistence and thus larger topological structures. The first path in Figure 2 has larger persistence while the last path has smaller persistence. Lattice paths. If we rotate the Dyck paths clockwise at 45 • and flip vertically, we obtain equivalent monotone lattice paths consisting of a sequence of → (uparrow) and ↑ (downarrow). Figure 2 displays corresponding monotone lattice paths between (0, 0) and (q, q) in Z 2 where the path does not pass above the diagonal line y = x [18] . During the filtration, there cannot be more deaths than births and thus the path must lie below the diagonal line. The total area below Dyck paths can be used to quantify the Dyck paths [6] . Since Area below Dyck path = q 2 2 − total area of boxes below lattice path, we will simply use lattice paths for quantification. If we tabulate how the area of boxes change over the x-coordinate in a lattice path, it is monotone. In the first path in Figure 2 , the number of boxes below the first and the last lattice paths are (0, 0, 0, 0) and (0, 1, 2, 3). The area below the path is related to persistence. A barcode with smaller persistences (last path in Figure 2 ) will have more boxes (dark gray boxes) while longer persistences will have fewer boxes (first path in Figure 2 ). Given the sequence of heights of piled-up boxes, we can recover the corresponding lattice path by tracing the outline of boxes. We can further recover the original pairing information about births and deaths. Remarks. In the situation where the birth values are identical at −∞, persistent diagrams line up vertically as (0, d (i) ) and hence we simply augment them as . For instance, the Rips filtration for 0-cycles line up vertically. Weighted lattice paths. The lattice and Dyck path representations only encode the ordering information about how births and deaths are paired, and do not encode the actual filtration values. This is remedied by adaptively weighting the length of arrows in lattice paths. We sort the set of birth values b i and death values d i as the order statistics: We start at origin (0, 0). When we encounter a birth b (i) , we take the horizontal step to b (i) . When we encounter a death d (i) , take the vertical step to d (i) ( Figure 2 ). The weighted lattice paths contains the same topological information as the original persistent diagram. Using the weighted lattice paths, we map a birthdeath process to a monotone function φ: There exists a one-to-one map from a birth-death process to a monotone function φ with φ(0) = 0 and φ(1) = q. Proof. We explicitly construct such a function φ. Consider the sequence of heights (or death values) as we traverse the weighed lattice path: h : h 1 ≤ h 2 ≤ · · · ≤ h q . The heights may not strictly increase (Figure 2-right) . If births occurs r times sequentially in the birth-death process, the heights h will have r repeated identical death values d (i) as a subsequence. To make the subsequence strictly increasing, we simply add δ(0, 1, 2, · · · , r − 1)(d (i+1) − d (i) ) to the repetition for δ ≤ 1 r (Figure 3 for example). Denote the transformed sequence as h = (h 1 , · · · , h q ). φ(t) is given as a step function . From φ(t), the original sequence h and the original birth-death process can be recovered exactly. Such map from a birth-death process to φ is one-to-one. Remarks. Note h − h 2 → 0 as δ → 0. So by making δ as small as possible, we can construct a strictly monotone h to be arbitrarily close to h. The normalized step function φ(t)/q can be viewed as an empirical cumulative distribution and many statistical tools for analyzing distributions can be readily applied. Figure 4 -bottom displays the lattices paths and the normalized step functions of 1-cycles corresponding to the spike proteins used in the study. The proposed lattice path method uses φ(t)/q as a new topological functional descriptor -which is one of the main contributions in this paper. Exact topological inference. The task now is to develop a method for testing the topological equivalence of two birth-death processes: where each element is either a birth or death. Let φ 1 and φ 2 be the step functions corresponding to C 1 and C 2 . The topological distance will be used as the test statistic for testing the equivalence of C 1 and C 2 . The normalizing denominators q 1 and q 2 ensures that the value of step functions are in [0, 1]. The statistic D(φ 1 , φ 2 ) is the upper bound of area difference under φ 1 (t)/q 1 and φ 2 (t)/q 2 : Theorem 2. Under the null hypothesis of the equivalence of C 1 and C 2 , Proof. The statement can be proved similarly as the combinatorial construction of the Kolmogorov-Smirnov test that also uses the lattice path enufmeration [3, 14, 9] . First, we combine monotonically increasing sequences C 1 and C 2 and sort them into a bigger monotone sequence of size q 1 + q 2 . Then , we represent the combined sequence as the sequence of → and ↑ respectively depending on if they are coming from C 1 or C 2 . Under the null, there is no preference and they equally likely come from C 1 or C 2 . If we follow the sequence of arrows, it forms a monotone lattice path from (0, 0) to (q 1 , q 2 ). In total, there are q1+q2 q1 possible equally likely lattice paths that forms the sample space. This is equivalent to the total number of permutations in a two-sample test. From Theorem 1, the values of φ 1 (t) and φ 2 (t) are integers from 0 to q. Thus, (φ 1 (t), φ 2 (t)) are the (x, y)-coordinates in lattice paths. Note Such points are all the lattice points satisfying within the band bounded by two lines y/q 2 = x/q 1 ± d (dotted red lines in Figure 3 ). Thus, the probability of the event D < d is equal the number of valid paths over the every possible paths: where A u,v is the total number of valid paths from (0, 0) to (u, v). Since there are only two paths leading to (u, v), we write If there is no constraint, A u,v = u+v u . However, due to the constraint, A u,v has to be iteratively computed with the boundary condition A u,0 = A 0,v = 1 for all u and v. For numerical implementation, we start from the bottom left corner (0, 0) and move toward the top right corner (q 1 , q 2 ) in solving the recursion numerically: Computing A q1,q2 iteratively requires at most q 1 · q 2 operations while the permutation test will cause a computational bottleneck for large q 1 and q 2 . Thus, the proposed lattice path method computes the exact p-value substantially faster than the permutation test. Since most protein molecules consist of thousands of atoms, q 1 and q 2 should be sufficiently large to apply the asymptotic: The result follows by treating φ 1 /q 1 and φ 2 /q 2 as the empirical cumulative distribution functions and using the asymptotic expansion of the KS-test [14, 17] . Subsequently, the p-value under null is given by The proposed lattice path method is used to study the topological structure of the severe acute respiratory syndrome coronavirus 2 (SARS-Cov-2), which is often called COVID-19. Since the start of the global pandemic (approximately December 2019), COVID-19 has already caused 3.85 million deaths in the world as of June 2021. The COVID-19 virus is specific member of a much broader coronavirus family, which all have spike proteins surrounding the spherically shaped virus similar to the sun's corona. The glycoprotein spikes that bind with receptors on cells and consequently cause severe infection. The atomic structure of spike proteins can be determined through the cryogenic electron microscopy (cryo-EM) [19, 4] . Figure 1 -left illustrates spike proteins (colored red) that surrounds the spherically shaped virus. Each spike consists of three similarly shaped protein molecules with rotational symmetry often identified as A, B and C domains. The spike proteins have two distinct conformations identified as open and closed states, where the domain's opening is necessary for interfacing with the host cell's surface for membrane fusion and viral entry (Figure 1-right) . Indeed, most current vaccine efforts focus on preventing the open state from interfacing with the host cell. Hence, this line of research is of prime importance in vaccine development and therapeutics [4] . In this study, we analyzed the spikes of three different coronaviruses identified as 6VXX, 6VYB [19] and 6JX7 [20] . The 6VXX and 6VYB are the closed and open states of SARS-Cov-2 from human while 6JX7 is feline coronavirus ( Figure 1 ). All the domains of 6VXX have exactly 7604 atoms and are expected to be topologically identical. Applying the lattice path method to 1-cycles, we tested the topological equivalence of the B-domain and the A-and C-domains within 6VXX. The normalized step functions are almost identical and the observed topological distances are 0.0090 and 0.0936, which give the p-value of 1.00 each. As expected, the method concludes that they are topologically equivalent. The domain B of 6VXX is also compared against the domain B of feline coronavirus 6JX7 consisting of 9768 atoms. Since 6JX7 is not from human, it is expected that they are different. The topological distance is 0.9194 and p-values is 0.00×10 −38 confirming that the topological nature of spikes are different. This shows the biggest difference among all the comparisons done in this study. In this paper, we proposed a new representation of persistent diagrams using the Dyck and lattice paths. The novel representation enables us to perform the statistical inference combinatorially through the proposed lattice path method by enumerating every possible valid lattice paths analytically. The lattice path method is subsequently used to analyze the coronavirus spike proteins. The normalized step function φ(t)/q for all the spike proteins show fairly stable consistent global monotone pattern but with localized differences. We demonstrated the lattice path method has the ability to statistically discriminate between the conformational changes of the spike protein that is needed in the transmission of the virus. We hope that the our new representation enables scientists in their effort to automatically identify the different types and states of coronaviruses in a more principled manner. Local persistent homology based distance between maps Geometry of the space of phylogenetic trees A Kolmogorov-Smirnov test for r samples Distinct conformational states of SARS-CoV-2 spike protein Topology of viral evolution Moments of dyck paths. Discrete mathematics On the bootstrap for persistence diagrams and landscapes Persistent homology in sparse regression and its application to brain morphometry Exact topological inference of the resting-state brain networks in twins Computational topology: An introduction Topological persistence and simplification A topological measurement of protein compressibility Barcodes: The persistent topology of data Nonparametric Statistical Inference Computational topology for shape modeling Noncrossing partitions Estimate of deviation between empirical distribution functions in two independent samples Structure, function, and antigenicity of the SARS-CoV-2 spike glycoprotein Cryo-EM analysis of a feline coronavirus spike protein reveals a unique structure and camouflaging glycans Topology for computing The schematic illustration of COVID-19 virus is provided by Alissa Eckert and Dan Higgins of Disease Control and Prevention (CDC), US. The proteins 6VXX and 6VYB are provided by Alexander Walls of University of Washington. The protein 6JX7 is provided by Tzu-Jing Yang of National Taiwan University. Figure 2 -left is modified from an image in Wikipedia. This study is supported by NIH R01 EB022856 and R01 EB028753, NSF MDS-2010778, and CRG from the King Abdullah University of Science and Technology (KAUST).