key: cord- -fewy y a authors: wang, ming-yang; liang, jing-wei; mohamed olounfeh, kamara; sun, qi; zhao, nan; meng, fan-hao title: a comprehensive in silico method to study the qstr of the aconitine alkaloids for designing novel drugs date: - - journal: molecules doi: . /molecules sha: doc_id: cord_uid: fewy y a a combined in silico method was developed to predict potential protein targets that are involved in cardiotoxicity induced by aconitine alkaloids and to study the quantitative structure–toxicity relationship (qstr) of these compounds. for the prediction research, a protein-protein interaction (ppi) network was built from the extraction of useful information about protein interactions connected with aconitine cardiotoxicity, based on nearly a decade of literature and the string database. the software cytoscape and the pharmmapper server were utilized to screen for essential proteins in the constructed network. the calcium-calmodulin-dependent protein kinase ii alpha (camk a) and gamma (camk g) were identified as potential targets. to obtain a deeper insight on the relationship between the toxicity and the structure of aconitine alkaloids, the present study utilized qsar models built in sybyl software that possess internal robustness and external high predictions. the molecular dynamics simulation carried out here have demonstrated that aconitine alkaloids possess binding stability for the receptor camk g. in conclusion, this comprehensive method will serve as a tool for following a structural modification of the aconitine alkaloids and lead to a better insight into the cardiotoxicity induced by the compounds that have similar structures to its derivatives. the rhizomes and roots of aconitine species, a genus of the family ranunculaceae, are commonly used in treatment for various illnesses such as collapse, syncope, rheumatic fever, joints pain, gastroenteritis, diarrhea, edema, bronchial asthma, and tumors. they are also involved in the management of endocrinal disorders such as irregular menstruation [ , ] . however, the usefulness of this aconitine species component intermingles with toxicity after it is administered to a diseased patient. so far, few articles have recorded the misuse of aconitine medicinals with strong emphasis and thus have referenced that the misuse of this medicinal can result in severe cardio-and neurotoxicity [ ] [ ] [ ] [ ] [ ] . in our past research, it was evidenced that the aconitine component is the main active ingredient in this species' root and rhizome, and is responsible for both therapeutic and toxic effects [ ] . this medicinal has been tested for cancerological and dermatological activities. its application to disease conditions proved to exhibit an activity that slowed down cancer tumor growth and to cure serious cases of dermatosis. it was also found to have an effect on postoperative analgesia [ ] [ ] [ ] [ ] . however, a previous safety study has revealed that aconitine toxicity is responsible for its restriction in clinical settings. further studies are needed to explain the cause of aconitine toxicity as well as to show whether the toxicity supersedes its usefulness. a combined network analysis and in silico study was once performed to obtain insight on the relationship between aconitine alkaloid toxicity and the aconitine structure, and it was found that the cardiotoxicity of aconitine is the primary cause of patient death. the aconitine poison is similar to the poison created by some pivotal proteins such as the ryanodine receptor (ryr and ryr ), the gap junction α- protein (gja ), and the sodium-calcium exchanger (slc a ) [ ] [ ] [ ] [ ] . however, among all existing studies about the aconitine medicinal, no one has reported detail of its specific binding target protein linked to toxicity. protein-protein interactions (ppis) participate in many metabolic processes occurring in living organisms such as the cellular communication, immunological response, and gene expression control [ , ] . a systematic description of these interactions aids in the elucidation of interrelationships among targets. the targeting of ppis with small-molecule compounds is becoming an essential step in a mechanism study [ ] . the present study was designed and undertaken to identify the critical protein that can affect the cardiotoxicity of aconitine alkaloids. a ppi network built by the string database is a physiological contact for the high specificity that has been established for several protein molecules and has stemmed from computational prediction, knowledge transfer between organisms, and interactions aggregated from other databases [ ] . the analysis of the ppi network is based on nodes and edges and is always performed via cluster analysis and centrality measurements [ , ] . in cluster analysis, highly interconnected nodes and protein target nodes are divided and used to form sub-graphs. the reliability of the ppi network is identified by the content of each sub-graph [ ] . the variability in centrality measurements is connected to the quantitative relationship between the protein targets and its weightiness in the network [ ] . hence, ppi networks with protein targets related to aconitine alkaloid cardiotoxicity must enable us to find the most relevant protein for aconitine toxicity and to understand the mechanism at the network level. in our research, the evaluation and visualization analysis of essential proteins related to cardiotoxicity in ppis were performed by the clusterone and cytonca plugins in cytoscape . , designed to find the potential protein targets via combination with conventional integrated pharmacophore matching technology built in the pharmmapper platform. structural modification of a familiar natural product, active compound, or clinical drug is an efficient method for designing a novel drug. the main purpose of the structural modification is to reduce the toxicity of the target compound while enhancing the utility of the drug [ ] . the identification of the structure-function relationship is an essential step in the drug discovery and design, the determination of the d protein structures was the key step in identifying the internal interactions in the ligand-receptor complexes. x-ray crystallography and nmr were the only accepted techniques of determining the d protein structure. although the d structure obtained by these two powerful techniques are accurate and reliable, they are time-consuming and costly [ ] [ ] [ ] [ ] [ ] . with the rapid development of structural bioinformatics and computer-aided drug design (cadd) techniques in the last decade, computational structures are becoming increasingly reliable. the application of structural bioinformatics and cadd techniques can improve the efficiency of this process [ ] [ ] [ ] [ ] [ ] [ ] [ ] [ ] [ ] [ ] . the ligand-based quantitative structure-toxicity relationship (qstr) and receptor-based docking technology are regarded as effective and useful tools in analysis of structure-function relationships [ ] [ ] [ ] [ ] . the contour maps around aconitine alkaloids generated by comparative molecular field analysis (comfa) and comparative molecular similarity index analysis (comsia) were combined with the interactions between ligand substituents and amino acids obtained from docking results to gain insight on the relationship between the structure of aconitine alkaloids and their toxicity. scores from functions were used to evaluate the docking result. the value-of-fit score in moe software reflects the binding stability and affinity of the ligand-receptor complexes. when screening for the most potential target for cardiotoxicity, the experimental data was combined with the value-of-fit score by the ndcg (normalized discounted cumulative gain). the possibility of a protein being a target of cardiotoxicity corresponds with the consistency of this experimental data. since the pioneering paper entitled "the biological functions of low-frequency phonons" [ ] was published in , many investigations of biomacromolecules from a dynamic point of view have occurred. these studies have suggested that low-frequency (or terahertz frequency) collective motions do exist in proteins and dna [ ] [ ] [ ] [ ] [ ] . furthermore, many important biological functions in proteins and dna and their dynamic mechanisms, such as cooperative effects [ ] , the intercalation of drugs into dna [ ] , and the assembly of microtubules [ ] , have been revealed by studying the low-frequency internal motions, as summarized in a comprehensive review [ ] . some scientists have even applied this kind of low-frequency internal motion to medical treatments [ , ] . investigation of the internal motion in biomacromolecules and its biological functions is deemed as a "genuinely new frontier in biological physics," as announced in the mission of some biotech companies (see, e.g., vermont photonics). in order to consider the static structural information of the ligand-receptor complex, dynamical information should be also considered in the process of drug discovery [ , ] . finally, molecular dynamics was carried out to verify the binding affinity and stability of aconitine alkaloids and the most potential target. this present study may be instrumental in our future studies for the synergism and attenuation of aconitine alkaloids and for the exploitation of its clinical application potential. a flowchart of procedures in our study is shown in figure . of-fit score by the ndcg (normalized discounted cumulative gain). the possibility of a protein being a target of cardiotoxicity corresponds with the consistency of this experimental data. since the pioneering paper entitled "the biological functions of low-frequency phonons" [ ] was published in , many investigations of biomacromolecules from a dynamic point of view have occurred. these studies have suggested that low-frequency (or terahertz frequency) collective motions do exist in proteins and dna [ ] [ ] [ ] [ ] [ ] . furthermore, many important biological functions in proteins and dna and their dynamic mechanisms, such as cooperative effects [ ] , the intercalation of drugs into dna [ ] , and the assembly of microtubules [ ] , have been revealed by studying the low-frequency internal motions, as summarized in a comprehensive review [ ] . some scientists have even applied this kind of low-frequency internal motion to medical treatments [ , ] . investigation of the internal motion in biomacromolecules and its biological functions is deemed as a "genuinely new frontier in biological physics," as announced in the mission of some biotech companies (see, e.g., vermont photonics). in order to consider the static structural information of the ligand-receptor complex, dynamical information should be also considered in the process of drug discovery [ , ] . finally, molecular dynamics was carried out to verify the binding affinity and stability of aconitine alkaloids and the most potential target. this present study may be instrumental in our future studies for the synergism and attenuation of aconitine alkaloids and for the exploitation of its clinical application potential. a flowchart of procedures in our study is shown in figure . the whole framework of the comprehensive in silico method for screening potential targets and studying the quantitative structure-toxicity relationship (qstr). the compounds were aligned over, under the superimposition of the common moiety and template compound . the statistical parameters for database alignment-q , r , f, and see-were the whole framework of the comprehensive in silico method for screening potential targets and studying the quantitative structure-toxicity relationship (qstr). the compounds were aligned over, under the superimposition of the common moiety and template compound . the statistical parameters for database alignment-q , r , f, and see-were table . the comfa model with the optimal number of components presented a q of . , an r of . , an f of . , and an see of . , and contributions of the steric and electrostatic fields were . and . , respectively. the comsia model with the optimal number of components presented a q of . , an r of . , an f of . , and an see of . , and the contributions of steric, electrostatic, hydrophobic, hydrogen bond acceptor, and hydrogen bond donor fields were . , . , . , . , and . , respectively. the statistical results proved that the aconitine alkaloids qstr model of comfa and comsia under the database alignment have adequate predictability. experimental and predicted pld values of both the training set and test set are shown in figure , and the comfa ( figure a ) and comsia ( figure b ) model gave the correlation coefficient (r ) value of . and . , respectively, which demonstrated the internal robustness and external high prediction of the qstr models. experimental and predicted pld values of both the training set and test set are shown in figure residuals vs. leverage williams plots of the aconitine qstr models are shown in figure a ,b. all values of standardized residuals fall between σ and − σ, and the values of leverage are less than h*, so the two models demonstrate potent extensibility and predictability. residuals vs. leverage williams plots of the aconitine qstr models are shown in figure a ,b. all values of standardized residuals fall between σ and − σ, and the values of leverage are less than h*, so the two models demonstrate potent extensibility and predictability. under mesh (medical subject headings), a total of articles ( articles were received from web of science, and others were received from pubmed) were retrieved. after selecting cardiotoxicity-related and excluding repetitive articles, articles were used to extract the correlative proteins and pathways for building a ppi network in the string server. the correlative proteins or pathways are shown in table . all proteins were taken as input protein in the string database to find its direct and functional partners [ ] , and proteins and its partners were then imported into the cytoscape . to generate the ppi network with nodes and edges ( figure ). potassium voltage-gated channel h scn a sodium voltage-gated channel type , scn a sodium voltage-gated channel type scn a sodium voltage-gated channel type scn a sodium voltage-gated channel type scn a sodium voltage-gated channel type kcnj potassium inwardly-rectifying channel j during the case of screening of the essential proteins in ppi network, three centrality measurements (subgraph centrality, betweenness centrality, and closeness centrality) in cytonca were utilized to evaluate the weight of nodes. after removing the central node "ac," the centrality measurements of nodes were calculated by cytonca and documented in table s . the top % of three centrality measurement values of all node are painted with a different color in figure a . to screen the node with the high values of each three centrality measures, nodes with three colors were overlapped and merged into sub-networks in figure b . under mesh (medical subject headings), a total of articles ( articles were received from web of science, and others were received from pubmed) were retrieved. after selecting cardiotoxicity-related and excluding repetitive articles, articles were used to extract the correlative proteins and pathways for building a ppi network in the string server. the correlative proteins or pathways are shown in table . all proteins were taken as input protein in the string database to find its direct and functional partners [ ] , and proteins and its partners were then imported into the cytoscape . to generate the ppi network with nodes and edges ( figure ). table . proteins related to aconitine alkaloids induced cardiotoxicity extracted from articles. classification frequency ryanodine receptor ryr ryanodine receptor gja gap junction α- protein (connexin ) slc a sodium/calcium exchanger atp a calcium transporting atpase fast twitch kcnh potassium voltage-gated channel h scn a sodium voltage-gated channel type , scn a sodium voltage-gated channel type scn a sodium voltage-gated channel type scn a sodium voltage-gated channel type scn a sodium voltage-gated channel type kcnj potassium inwardly-rectifying channel j during the case of screening of the essential proteins in ppi network, three centrality measurements (subgraph centrality, betweenness centrality, and closeness centrality) in cytonca were utilized to evaluate the weight of nodes. after removing the central node "ac," the centrality measurements of nodes were calculated by cytonca and documented in table s . the top % of three centrality measurement values of all node are painted with a different color in figure a . to screen the node with the high values of each three centrality measures, nodes with three colors were overlapped and merged into sub-networks in figure b . in the sub-networks, the voltage-gated calcium and sodium channel accounted for a large proportion, which is consistent with our research in clustering the network (clusters , , and ). all proteins in the sub-networks will be utilized to predict the results of the pharmmapper server to receive the potential target of cardiotoxicity induced by aconitine alkaloids (in figure a ,b). in the meantime, v o (camk g) and vz (camk a) were identified as the potential targets with higher fit scores. in the sub-networks, the voltage-gated calcium and sodium channel accounted for a large proportion, which is consistent with our research in clustering the network (clusters , , and ). all proteins in the sub-networks will be utilized to predict the results of the pharmmapper server to receive the potential target of cardiotoxicity induced by aconitine alkaloids (in figure a ,b). in the meantime, v o (camk g) and vz (camk a) were identified as the potential targets with higher fit scores. all compounds were docked into three potential targets. the values of ndcg are shown in table . the dock study of three proteins with an ndcg of . and . , respectively (the detailed in the sub-networks, the voltage-gated calcium and sodium channel accounted for a large proportion, which is consistent with our research in clustering the network (clusters , , and ). all proteins in the sub-networks will be utilized to predict the results of the pharmmapper server to receive the potential target of cardiotoxicity induced by aconitine alkaloids (in figure a ,b). in the meantime, v o (camk g) and vz (camk a) were identified as the potential targets with higher fit scores. all compounds were docked into three potential targets. the values of ndcg are shown in table . the dock study of three proteins with an ndcg of . and . , respectively (the detailed all compounds were docked into three potential targets. the values of ndcg are shown in table . the dock study of three proteins with an ndcg of . and . , respectively (the detailed docking result is shown in table s ) proves that the result of the dock study of v o is consistent with the experimental pld , so the protein v o was utilized for the ligand interaction analysis. table . ranking results by experimental and predicted pld and fit score. experimental pld fit score ( v o) fit score ( vz ) ndcg . . the d-qstr contour maps were utilized to visualize the information on the comfa and comsia model properties in three-dimensional space. these maps used characteristics of compounds that are crucial for activity and display the regions around molecules where the variance of activities is expected based on physicochemical property changes in molecules [ ] . the analysis of favorable and unfavorable regions of steric, electrostatic, hydrophobic, hbd, and hba atom fields contribute to the realization of the relationship between the aconitine alkaloid's toxic activity and its structure. steric and electrostatic contour maps of the comfa qstr model are shown in figure a ,b, respectively. hydrophobic, hbd, and hba contour maps of the comsia qstr model are shown in figure c -e. compound has the most toxic activity, so it was chosen as the reference structure for the generation of the comfa and comsia contour maps. in the case of the comfa study, the steric contour map around compound is shown in figure a . the yellow regions near r , r , and r showed the substituents of the molecule, which proved that these positions were not ideal for sterically favorable functional groups. therefore, compounds , , and (with pld values of . , . , and . , respectively), which consist of sterically esterified moieties at positions r and r , were less toxic than compounds and (with pld values of . and . ), which were substituted by a small hydroxyl group, and compound (with a pld value of . ) has less toxic activity due to the esterified moiety in r . the green regions, sterically favorable the comfa electrostatic contour map is shown in figure b . the blue region near the r and r substitution revealed that the replacement of electropositive groups is in favor of toxicity. this can be proven by the fact that the compounds with hydroxy in these two positions had higher pld values than the compound with acetoxy or no substituents. the red region surrounding molecular scaffolds was not distinct, which revealed that there was no connection between the electronegative and the toxicity. the comsia hydrophobic contour map is shown in figure c . the r , r , and r around the white region indicated that the hydrophobic groups were unfavorable for the toxicity, so the esterification of hydrophilic hydroxyl or dehydroxylation decreased the toxicity, which is consistent with the steric and electrostatic contour map. the yellow contour map near the r manifested that the hydrophilic hydroxy was unfavorable to the toxicity, which can be validated by the fact that aconitine alkaloids with hydroxy substituents in r (compound , with a pld the ppi network of aconitine alkaloids cardiotoxicity was divided into nine clusters using clusterone. statistical parameters are shown in figure . six clusters, namely clusters , , , , , and , which possess quality scores higher than . , a density higher than . , and a p-value less than . , were selected for further analysis (in figure ) . clusters , , and consisted of proteins mainly involved in the effects of various calcium, potassium, and sodium channels. cluster mainly the comsia contour map of hbd is shown in figure d . the cyan regions at r , r , and r represented a favorable condition for the hbd atom, which clearly validated the fact that the compounds with hydroxy in this region show potent toxicity. a purple region was found near r , which proved that the hbd atom (hydroxyl) in this region has an adverse effect on toxicity. the hba contour map is shown in figure . the magenta region around r substitution proved that this substitution was favorable to the hba atom, so compounds , , , and with the hba atom in the r substitution exhibit more potent toxicity (with pld values of . , . , . , and . ) than compounds with methoxymethyl substituents (compounds , , and with pld values of . , . , and . ). the red contour map where hba atoms are unfavorable for the toxicity was positioned around r and r . these contours were well validated by the lower pld value of compounds with carbonyl in these substitutions. the ppi network of aconitine alkaloids cardiotoxicity was divided into nine clusters using clusterone. statistical parameters are shown in figure . six clusters, namely clusters , , , , , and , which possess quality scores higher than . , a density higher than . , and a p-value less than . , were selected for further analysis (in figure ) . clusters , , and consisted of proteins mainly involved in the effects of various calcium, potassium, and sodium channels. cluster mainly consisted of three channel types related to the cardiotoxicity of aconitine alkaloids, cluster contained calcium and sodium channels and some channel exchangers (such as ryr and ryr ), and cluster mainly consisted of various potassium channels. all of these findings are consistent with previous research about the arrhythmogenic properties of the toxicity of aconitine alkaloids: the aconitine binds to ion channels and affects their open state, and thus the corresponding ion influx into the cytosol [ ] [ ] [ ] . the channel exchangers play a crucial role in keeping the ion transportation and homeostasis inside and outside of the cell. cluster contained some regulatory proteins that can activate or repress the ion channels through the protein expression level. atp a , ryr , ryr , cacna c, cacna d, and cacna s mediate the release of calcium, thereby playing a key role in triggering cardiac muscle contraction and maintaining the calcium homeostasis [ , ] . aconitine may cause aberrant channel activation and lead to cardiac arrhythmia. clusters and consisted of camp-dependent protein kinase (capk), cgmp-dependent protein kinase (cgpk), and guanine nucleotide binding protein (g protein). they have not been fully studied to prove whether the cardiotoxicity induced by aconitine alkaloids is linked to the capk, cgpk, and g proteins; however, some studies have shown that cardiotoxicity-related protein kcnj (potassium inwardly-rectifying channel) is controlled by g proteins and the cardiac sodium/calcium exchanger and is said to be regulated by capk and cgpk [ , ] . the result of clusterone indicated that the constructed network is consistent with existing studies and that the network can be used to screen essential proteins in the cytonca plugin. the protein v o belonging to the camkii (calcium/calmodulin (ca + /cam)-dependent serine/threonine kinases ii) isozyme protein family plays a central role in cellular signaling by transmitting ca + signals. the camkii enzymes transmit calcium ion (ca + ) signals released inside the cell by regulating signal transduction pathways through phosphorylation. ca + first binds to the small regulatory protein cam, and this ca + /cam complex then binds to and activates the kinase, which then phosphorylates other proteins such as ryanodine receptor and sodium/calcium exchanger. thus, these proteins are related to the cardiotoxicity induced by aconitine alkaloids [ ] [ ] [ ] . the excessive activity of camkii has been observed in some structural heart disease and arrhythmias [ ] , and past findings demonstrate neuroprotection in neuronal cultures treated with inhibitors of camkii immediately prior to excitotoxic activation of the camkii [ ] . the acute cardiotoxicity of the aconitine alkaloids is possibly related to this target. based on the analysis of the ppi network above, camkii was selected as the potential target for further molecular docking and dynamic simulation. the dock result of v o is shown in figure a . compound has the highest fit scores, so it was selected as the template for conformational analysis. the mechanisms of camkii activation and inactivation are shown in figure b . compound affects the normal energy metabolism of the myocardial cell via binding in the atp-competitive site in figure c . the inactive state of the camkii was regulated by cask-mediated t /t phosphorylation, and this state can be inhibited by the binding of compound in the atp-competitive site. such binding moves camkii toward a ca + /cam-dependent activation active state and a ca + /cam-dependent activation through structural rearrangement of the inhibitory helix caused by ca + /cam binding and the subsequent autophosphorylation of t [ ] , which will induce the excessive activity of camkii and dynamic imbalance of the calcium ions in the myocardial cell, eventually leading to heart disease and arrhythmias. molecules , , x for peer review of channel) is controlled by g proteins and the cardiac sodium/calcium exchanger and is said to be regulated by capk and cgpk [ , ] . the result of clusterone indicated that the constructed network is consistent with existing studies and that the network can be used to screen essential proteins in the cytonca plugin. the protein v o belonging to the camkii (calcium/calmodulin (ca + /cam)-dependent serine/threonine kinases ii) isozyme protein family plays a central role in cellular signaling by transmitting ca + signals. the camkii enzymes transmit calcium ion (ca + ) signals released inside the cell by regulating signal transduction pathways through phosphorylation. ca + first binds to the small regulatory protein cam, and this ca + /cam complex then binds to and activates the kinase, which then phosphorylates other proteins such as ryanodine receptor and sodium/calcium the information of a binding pocket of a receptor for its ligand is very important for drug design, particularly for conducting mutagenesis studies [ ] . as has been reported in the past [ ] , the binding pocket of a protein receptor to a ligand is usually defined by those residues that have at least one heavy atom within a distance of Å from a heavy atom of the ligand. such a criterion was originally used to define the binding pocket of atp in the cdk -nck a complex [ ] , which was later proved to be very useful in identifying functional domains and stimulating the relevant truncation experiments. a similar approach has also been used to define the binding pockets of many other receptor-ligand interactions important for drug design [ , , , [ ] [ ] [ ] [ ] . the information of a binding pocket of camkii for the aconitine alkaloids will serve as a guideline for designing drugs with similar scaffolds, particularly for conducting mutagenesis studies. in figure a , four top fit scores-compounds , , , and -generated similar significant interactions with amino acid residues around the atp-competitive binding pocket. four compounds formed with many van der waals interactions within the noncompetitive inhibitor pocket through amino acid residues such as asp , lys , glu , lys , and leu . the ligand-receptor interaction showed that the hydroxy in r formed a side chain donor interaction with asp . in addition, the hydroxy in r and r also formed a side chain acceptor interaction with glu and ser , respectively (the docking result of compounds and in figure a ). these results correspond to the comfa and comsia contour maps. however, the small electropositive and hydrophilic group in r , r , and r possess a certain enhancement function to toxicity. there were aromatic interactions between the phenyl group in r and amino acid residues. the phenyl group in r formed aromatic interactions with leu , leu , and phe , while the small group hydroxyl did not form any interaction with asp , which demonstrate that bulky phenyl group is crucial to this binding pattern and toxicity. this was mainly equal to the comfa steric contour map, where r was ideal for sterically favorable groups. the methoxymethyl r generated backbone acceptor with lys , which correspond to the comsia hba contour map, where r was favorable for the hba atom. compound docked into v o, and the atp-competitive pocket was painted green; the t , t , and t phosphorylation sites were painted green, orange, and yellow, respectively; the inhibitory helix was painted red. the result of md simulation is shown in figure . the red plot represented the rmsd values of the docked protein. the values of rmsd reached . Å in . ns and then remained between and . Å throughout the simulation for up to ns. the averaged value of the rmsd was . Å. the md simulation demonstrated that the ligand was stabilized in the active site. finally, we combined the ligand-based d-qstr analysis with the structure-based molecular docking study to identify the necessary moiety related to the cardiotoxicity mechanism of the aconitine alkaloids (in figure ). finally, we combined the ligand-based d-qstr analysis with the structure-based molecular docking study to identify the necessary moiety related to the cardiotoxicity mechanism of the aconitine alkaloids (in figure ). to build the ppi network of aconitine alkaloids, literature from january to february was retrieved from pubmed (http://pubmed.cn/) and web of science (http://www.isiknowledge.com/) with the mesh word "aconitine" and "toxicity" and without language restriction. all documents about cardiotoxicity caused by aconitine alkaloids were collected. the proteins related to the aconitine alkaloids cardiotoxicity of this decade were gathered and taken as the input protein in the string (https://string-db.org/) database [ , ] , used to search for related proteins or pathways that had been reported. finally, all the proteins and its partners were recorded in excel in order to import information and build a ppi network in cytoscape software. cytoscape is a free, open-source, java application for visualizing molecular networks and integrating them with gene expression profiles [ , ] . plugins are available for network and molecular profiling analyses, new layouts, additional file format support, making connections with figure . crucial requirement of cardiotoxicity mechanism was obtained from the ligand-based d-qstr and structure-based molecular docking study. to build the ppi network of aconitine alkaloids, literature from january to february was retrieved from pubmed (http://pubmed.cn/) and web of science (http://www.isiknowledge.com/) with the mesh word "aconitine" and "toxicity" and without language restriction. all documents about cardiotoxicity caused by aconitine alkaloids were collected. the proteins related to the aconitine alkaloids cardiotoxicity of this decade were gathered and taken as the input protein in the string (https://string-db.org/) database [ , ] , used to search for related proteins or pathways that had been reported. finally, all the proteins and its partners were recorded in excel in order to import information and build a ppi network in cytoscape software. cytoscape is a free, open-source, java application for visualizing molecular networks and integrating them with gene expression profiles [ , ] . plugins are available for network and molecular profiling analyses, new layouts, additional file format support, making connections with databases, and searching within large networks [ ] . clusterone (clustering with overlapping neighborhood expansion) of cytoscape was utilized to cluster the ppi network into overlapping sub-graphs of highly interconnected nodes. clusterone is a plugin for detecting and clustering potentially overlapping protein complexes from ppi data. the quality of a group was assessed by the number of sub-graphs, p-values, and density. the cluster was discarded when the number of sub-graphs was smaller than , the density was less than . , the quality was less than . , and the p-value was under . [ ] . the clustering results of the clusterone are instrumental to understanding how the reliability of the ppi network relates to aconitine alkaloids' cardiotoxicity. cytonca is a plugin in cytoscape integrating calculation, evaluation, and visualization analysis for multiple centrality measures. there are eight centrality measurements provided by cytonca: betweenness, closeness, degree, eigenvector, local average connectivity-based, network, subgraph, and information centrality [ ] . the primary purpose of the centrality analysis was to confirm the essential proteins in the pre-built ppi network. the three centrality measurements in the cytonca plugin-subgraph centrality, betweenness centrality, and closeness centrality-were used for evaluating and screening the essential protein in the merged target network. the subgraph centrality characterizes the participation of each node in all subgraphs in a network. smaller subgraphs are given more weight than larger ones, which makes this measurement an appropriate one for characterizing network properties. the subgraph centrality of node "u" can be calculated by [ ] µ l (u) is the uth diagonal entry of the lth power of the weight adjacency matrix of the network. v , v , . . . , v n is be an orthonormal basis composed of r n composed by eigenvectors of a associated to the eigenvalues λ , λ , . . . , λ n v u v , which is the uth component of v v [ ] . the betweenness centrality finds a wide range of applications in network theory. it represents the degree to which nodes stand between each other. betweenness centrality was devised as a general measure of centrality. it is applicable to a wide range of problems in network theory, including problems related to social networks, biology, transport, and scientific cooperation. the betweenness centrality of a node u can be calculated by [ ] ρ (s, t) is the total number of shortest paths from node s to node ρ (s, u, t), which is the number of those paths that pass through u. closeness centrality of a node is a measure of centrality in a network, calculated as the sum of the length of the shortest paths between the node and all other nodes in the graph. thus, the more central a node is, the closer it is to all other nodes. the closeness centrality of a node u can be calculated by [ ] |nu| is the number of node u's neighbors, and dist (u, v) is the distance of the shortest path from node u to node v. pharmmapper serves as a valuable tool for identifying potential targets for a novel synthetic compound, a newly isolated natural product, a compound with known biological activity, or an existing drug [ ] . of all the aconitine alkaloids in this research, compounds , , and exhibited the most toxic activity and were used for the potential target prediction. the mol format of three compounds was submitted to the pharmmapper server. the parameters of generate conformers and maximum generated conformations was set as on and , respectively. other parameters used default values. finally, the result of the clusterone and pharmmapper will be combined together to select the potential targets for the following docking study [ ] . comparative molecular field analysis (comfa) and comparative molecular similarity index analysis (comsia) are efficient tools in ligand-based drug design and are in use for contour map generation and identification of favorable and unfavorable regions in a moiety [ , ] . the comfa consists of a steric and electrostatic contour map of molecules that are correlated with toxic activity, while the comsia consists of hydrophobic field, hydrogen bond donor (hbd)/hydrogen bond acceptor (hba) [ ] , and steric/electrostatic fields that are correlated with toxic activity. the comfa and comsia have been utilized to generate a d-qstr model [ ] . all molecule models and the generation of d-qstr were performed with sybyl x . . alkaloids in mice with ld values listed in table were extracted from recent literature [ ] . the ld values of all aconitine alkaloids were converted into pld with a standard tripos force field. these pld values were used as a dependent variable, while comfa and comsia descriptors were used as an independent variable. the sketch function of sybyl x . was utilized to illustrate structure and charges, and was calculated by the gasteiger-huckel method. additionally, the tripose force field was utilized for energy minimization of these aconitine alkaloid molecules [ ] . the molecules were divided into a ratio of : . the division was done in a way that showed that both datasets are balanced and consist of both active and less active molecules [ ] . the reliability of the d-qstr model depends on the database molecular alignment. the most toxic aconitine alkaloids (compound ) was selected as the template molecule, and the tetradecahydro- h- , , -(epiethane [ , , ] triyl)- , -methanonaphtho [ , -b] azocine was selected as the common moiety. pls (partial least squares) techniques are associated with field descriptors with activity values such as [ ] leave one out (loo) values, the optimal number of components, the standard error of estimation (see), cross-validated coefficients (q ), and the conventional coefficient (r ). these statistical data are pivotal in the evaluation of the d-qstr model and can be worked out in the pls method [ ] . the model is said to be good when the q value is more than . and the r value is more than . . the q and r values reflect a model's soundness. the best model has the highest q and r values, the lowest see, and an optimal number of components [ , , ] . in the case of comfa and comsia analysis, the values of the optimal number of components, see, and q can be worked out by loo validation, with use sampls turned on and components set to , while in the process of calculating r , the use sampls was turned off and the column filtration was set to . kcal mol − in order to speed up the calculation without the need to sacrifice information content [ ] [ ] [ ] [ ] . therefore, components were set to and , respectively, which were optimal numbers of components calculated by performing a sampls run. see and r were utilized to assess the non-cross validated model. the applicability domain (ad) of the topomer comfa and comsia model was confirmed by the williams plot of residuals vs. leverage. leverage of a query chemical is proportional to its mahalanobis distance measure from the centroid of the training set [ , ] . the leverages are calculated for a given dataset x by obtaining the leverage matrix (h) with the equation below: x is the model matrix, while xt is its transpose matrix. the plot of standardized residuals vs. leverage values was drawn, and compounds with standardized residuals greater than three standard deviation units (± σ) were considered as outliers [ ] . the critical leverage value is considered p/n, where p is the number of model variables plus one, and n is the number of objects used to calculate the model. h > p/n mean predicted response is not acceptable [ ] [ ] [ ] . (cadd) software program that incorporates the functions of qsar, molecular docking, molecular dynamics, adme (absorption, distribution, metabolism, and excretion), and homologous modeling. all of these functions are regarded as conducive instruments in the field of drug discovery and biochemistry. the molecular docking and dynamics technology were performed in moe software to detect the stability and affinity between the ligands and predictive targets [ , ] . the docking process involves the prediction of ligand conformation and orientation within a targeted binding site. docking analysis is an important step in the docking process. it has been widely used to study the reasonable binding mode and obtain information of interactions between amino acids in active protein sites and ligands. the molecular docking analysis was carried out to determine the toxicity-related moiety of aconitine alkaloids through the ligand-amino-acid interaction function in moe . the pdb format of v o and vz was downloaded from the pdb (protein data bank) database (https://www.rcsb.org/), and the mol format of compounds was from the sybyl software of qstr research. the structure preparation function in moe software will be carried out to minimize the energy and optimize the structure of the protein skeleton. based on the london dg score and induced fit refinement, all compounds will be docked into the active site of every potential target by taking score values as the scoring function [ ] . the dcg (discounted cumulative gain) algorithm was utilized to examine the consistency between the ranking result of pld and our research (fit scores of dock study). they rely on the formula that refers to pld . the idcg (ideal dcg) refers to the ordered pld values. the closer the normalized discounted cumulative gain (ndcg) value is to , the better the consistency [ ] . preliminary md simulations for the model protein were performed using the program namd (nanoscale molecular dynamics program, v . ), and all files were generated using visual molecular dynamics (vmd). namd is a freely available software designed for high-performance simulation of large biomolecular systems [ ] . during the md simulation, minimization and equilibration of original and docked proteins occurred in a Å size water box. a charmm force field file was applied for energy minimization and equilibration with gasteiger-huckel charges using boltzmann initial velocity [ , ] . integrator parameters also included fs/step for all rigid bonds and nonbonded frequencies were selected for Å and full electrostatic evaluations for Å were used with steps for each cycle [ ] . the particle mesh ewald method was used for electrostatic interactions of the simulation system periodic boundary conditions with grid dimensions of . Å [ ] . the pressure was maintained at . kpa using the langevin piston and the temperature was controlled at k using langevin dynamics. covalent interactions between hydrogen and heavy atoms were constrained using the shake/rattle algorithm. finally, ns md simulations for original and docked protein were carried out for comparing and verifying the binding affinity and stability of the ligand-receptor complex. the method combining network analysis and the in silico method was carried out to illustrate the qstr and toxic mechanisms of aconitine alkaloids. the d-qstr was built in sybyl with internal robustness and external high prediction, enabling identification of pivotal molecule moieties related to toxicity in aconitine alkaloids. the comfa model had q , r , optimum component, and correlation coefficient (r ) values of . , . , , and . , respectively, and the comsia model had q , r , optimum component, and correlation coefficient (r ) values of . , . , , and . . the network was built with cytoscape software and the string database, which demonstrated the reliability of cluster analysis. the v o and vz proteins were identified as potential targets with the cytonca plugin with pharmmapper server for interactions between the aconitine alkaloids and key amino acids in the dock study. the result of the dock study demonstrates the consistency of the experimental pld . the md simulation indicated that aconitine alkaloids exhibit potent binding affinity and stability to the receptor camk g. finally, we incorporate pivotal molecule moieties and ligand-receptor interactions to realize the qstr of aconitine alkaloids. this research serves as a guideline for studies of toxicity, including neuro-, reproductive, and embryo-toxicity. with a deep understanding of the relationship between toxicity and structure of aconitine alkaloids, subsequent structural modification of aconitine alkaloids can be carried out to enhance their efficacy and to reduce their toxic side effects. based on such research, aconitine alkaloids can bring us closer to medical and clinical applications. in addition, as pointed out in past research [ ] , user-friendly and publicly accessible web servers represent the future direction of reporting various important computational analyses and findings [ ] [ ] [ ] [ ] [ ] [ ] [ ] [ ] [ ] [ ] [ ] [ ] [ ] [ ] . they have significantly enhanced the impacts of computational biology on medical science [ , ] . the research in this paper will serve as a foundation for constructing web servers for qstr studies and target identifications of compounds. immunomodulating agents of plant origin. i: preliminary screening chinese drugs plant origin aconitine poisoning: a global perspective ventricular tachycardia after ingestion of ayurveda herbal antidiarrheal medication containing aconitum fatal accidental aconitine poisoning following ingestion of chinese herbal medicine: a report of two cases five cases of aconite poisoning: toxicokinetics of aconitines a case of fatal aconitine poisoning by monkshood ingestion determination of aconitine and hypaconitine in gucixiaotong ye by capillary electrophoresis with field-amplified sample injection a clinical study in epidural injection with lappaconitine compound for post-operative analgesia therapeutic effects of il- combined with benzoylmesaconine, a non-toxic aconitine-hydrolysate, against herpes simplex virus type infection in mice following thermal injury aconitine: a potential novel treatment for systemic lupus erythematosus aconitine-containing agent enhances antitumor activity of dichloroacetate against ehrlich carcinoma complex discovery from weighted ppi networks prediction and analysis of the protein interactome in pseudomonas aeruginosa to enable network-based drug target selection the string database in : quality-controlled protein-protein association networks, made broadly accessible identification of functional modules in a ppi network by clique percolation clustering united complex centrality for identification of essential proteins from ppi networks the ppi network and cluster one analysis to explain the mechanism of bladder cancer the progress of novel drug delivery systems mitochondrial uncoupling protein structure determined by nmr molecular fragment searching structural basis for membrane anchoring of hiv- envelope spike unusual architecture of the p channel from hepatitis c virus architecture of the mitochondrial calcium uniporter structure and mechanism of the m proton channel of influenza a virus computer-aided drug design using sesquiterpene lactones as sources of new structures with potential activity against infectious neglected diseases successful in silico discovery of novel nonsteroidal ligands for human sex hormone binding globulin in silico discovery of novel ligands for antimicrobial lipopeptides for computer-aided drug design structural bioinformatics and its impact to biomedical science coupling interaction between thromboxane a receptor and alpha- subunit of guanine nucleotide-binding protein prediction of the tertiary structure and substrate binding site of caspase- study of drug resistance of chicken influenza a virus (h n ) from homology-modeled d structures of neuraminidases insights from investigating the interaction of oseltamivir (tamiflu)with neuraminidase of the h n swine flu virus prediction of the tertiary structure of a caspase- /inhibitor complex design novel dual agonists for treating type- diabetes by targeting peroxisome proliferator-activated receptors with core hopping approach heuristic molecular lipophilicity potential (hmlp): a d-qsar study to ladh of molecular family pyrazole and derivatives fragment-based quantitative structure & ndash; activity relationship (fb-qsar) for fragment-based drug design investigation into adamantane-based m inhibitors with fb-qsar hp-lattice qsar for dynein proteins: experimental proteomics ( d-electrophoresis, mass spectrometry) and theoretic study of a leishmania infantum sequence the biological functions of low-frequency phonons: . cooperative effects low-frequency collective motion in biomacromolecules and its biological functions quasi-continuum models of twist-like and accordion-like low-frequency motions in dna collective motion in dna and its role in drug intercalation biophysical aspects of neutron scattering from vibrational modes of proteins biological functions of soliton and extra electron motion in dna structure low-frequency resonance and cooperativity of hemoglobin solitary wave dynamics as a mechanism for explaining the internal motion during microtubule growth designed electromagnetic pulsed therapy: clinical applications steps to the clinic with elf emf molecular dynamics study of the connection between flap closing and binding of fullerene-based inhibitors of the hiv- protease molecular dynamics studies on the interactions of ptp b with inhibitors: from the first phosphate-binding site to the second one the cambridge structural database: a quarter of a million crystal structures and rising molecular similarity indices in a comparative analysis (comsia) of drug molecules to correlate and predict their biological activity single channel analysis of aconitine blockade of calcium channels in rat myocardiocytes conversion of the sodium channel activator aconitine into a potent alpha -selective nicotinic ligand aconitine blocks herg and kv . potassium channels inactivation of ca + release channels (ryanodine receptors ryr and ryr ) with rapid steps in [ca + ] and voltage targeted disruption of the atp a gene encoding the sarco(endo)plasmic reticulum ca + atpase isoform (serca ) impairs diaphragm function and is lethal in neonatal mice cyclic gmp-dependent protein kinase activity in rat pulmonary microvascular endothelial cells different g proteins mediate somatostatin-induced inward rectifier k + currents in murine brain and endocrine cells cardiac myocyte calcium transport in phospholamban knockout mouse: relaxation and endogenous camkii effects inhibition of camkii phosphorylation of ryr prevents induction of atrial fibrillation in fkbp . knock-out mice regulation of ca + and electrical alternans in cardiac myocytes: role of camkii and repolarizing currents the role of calmodulin kinase ii in myocardial physiology and disease excitotoxic neuroprotection and vulnerability with camkii inhibition structure of the camkiiδ/calmodulin complex reveals the molecular mechanism of camkii kinase activation a model of the complex between cyclin-dependent kinase and the activation domain of neuronal cdk activator binding mechanism of coronavirus main proteinase with ligands and its implication to drug design against sars an in-depth analysis of the biological functional studies based on the nmr m channel structure of influenza a virus molecular therapeutic target for type- diabetes novel inhibitor design for hemagglutinin against h n influenza virus by core hopping method the string database in : functional interaction networks of proteins, globally integrated and scored cytoscape: a software environment for integrated models of biomolecular interaction networks detecting overlapping protein complexes in protein-protein interaction networks cytonca: a cytoscape plugin for centrality analysis and evaluation of protein interaction networks subgraph centrality and clustering in complex hyper-networks ranking closeness centrality for large-scale social networks enhancing the enrichment of pharmacophore-based target prediction for the polypharmacological profiles of drugs comparative molecular field analysis (comfa). . effect of shape on binding of steroids to carrier proteins sample-distance partial least squares: pls optimized for many variables, with application to comfa a qsar analysis of toxicity of aconitum alkaloids recent advances in qsar and their applications in predicting the activities of chemical molecules, peptides and proteins for drug design unified qsar approach to antimicrobials. . multi-target qsar modeling and comparative multi-distance study of the giant components of antiviral drug-drug complex networks comfa qsar models of camptothecin analogues based on the distinctive sar features of combined abc, cd and e ring substitutions applicability domain for qsar models: where theory meets reality comparison of different approaches to define the applicability domain of qsar models molecular docking and qsar analysis of naphthyridone derivatives as atad bromodomain inhibitors: application of comfa, ls-svm, and rbf neural network concise applications of molecular modeling software-moe medicinal chemistry and the molecular operating environment (moe): application of qsar and molecular docking to drug discovery qsar models of cytochrome p enzyme a inhibitors using comfa, comsia and hqsar estimating a ranked list of human hereditary diseases for clinical phenotypes by using weighted bipartite network biomolecular simulation on thousands processors molecular dynamics and docking investigations of several zoanthamine-type marine alkaloids as matrix metaloproteinase- inhibitors salts influence cathechins and flavonoids encapsulation in liposomes: a molecular dynamics investigation review: recent advances in developing web-servers for predicting protein attributes irna-ai: identifying the adenosine to inosine editing sites in rna sequences iss-psednc: identifying splicing sites using pseudo dinucleotide composition irna-pseu: identifying rna pseudouridine sites ploc-mplant: predict subcellular localization of multi-location plant proteins by incorporating the optimal go information into general pseaac ploc-mhum: predict subcellular localization of multi-location human proteins via general pseaac to winnow out the crucial go information iatc-misf: a multi-label classifier for predicting the classes of anatomical therapeutic chemicals psuc-lys: predict lysine succinylation sites in proteins with pseaac and ensemble random forest approach irnam c-psednc: identifying rna -methylcytosine sites by incorporating physical-chemical properties into pseudo dinucleotide composition ikcr-pseens: identify lysine crotonylation sites in histone proteins with pseudo components and ensemble classifier iacp: a sequence-based tool for identifying anticancer peptides ploc-meuk: predict subcellular localization of multi-label eukaryotic proteins by extracting the key go information into general pseaac iatc-mhyb: a hybrid multi-label classifier for predicting the classification of anatomical therapeutic chemicals ihsp-pseraaac: identifying the heat shock protein families using pseudo reduced amino acid alphabet composition irna-psecoll: identifying the occurrence sites of different rna modifications by incorporating collective effects of nucleotides into pseknc impacts of bioinformatics to medicinal chemistry an unprecedented revolution in medicinal chemistry driven by the progress of biological science this article is an open access article distributed under the terms and conditions of the creative commons attribution (cc by) license key: cord- -hgowgq authors: zhang, ruixi; zen, remmy; xing, jifang; arsa, dewa made sri; saha, abhishek; bressan, stéphane title: hydrological process surrogate modelling and simulation with neural networks date: - - journal: advances in knowledge discovery and data mining doi: . / - - - - _ sha: doc_id: cord_uid: hgowgq environmental sustainability is a major concern for urban and rural development. actors and stakeholders need economic, effective and efficient simulations in order to predict and evaluate the impact of development on the environment and the constraints that the environment imposes on development. numerical simulation models are usually computation expensive and require expert knowledge. we consider the problem of hydrological modelling and simulation. with a training set consisting of pairs of inputs and outputs from an off-the-shelves simulator, we show that a neural network can learn a surrogate model effectively and efficiently and thus can be used as a surrogate simulation model. moreover, we argue that the neural network model, although trained on some example terrains, is generally capable of simulating terrains of different sizes and spatial characteristics. an article in the nikkei asian review dated september warns that both the cities of jakarta and bangkok are sinking fast. these iconic examples are far from being the only human developments under threat. the united nation office for disaster risk reduction reports that the lives of millions were affected by the devastating floods in south asia and that around , people died in the bangladesh, india and nepal [ ] . climate change, increasing population density, weak infrastructure and poor urban planning are the factors that increase the risk of floods and aggravate consequences in those areas. under such scenarios, urban and rural development stakeholders are increasingly concerned with the interactions between the environment and urban and rural development. in order to study such complex interactions, stakeholders need effective and efficient simulation tools. a flood occurs with a significant temporary increase in discharge of a body of water. in the variety of factors leading to floods, heavy rain is one of the prevalent [ ] . when heavy rain falls, water overflows from river channels and spills onto the adjacent floodplains [ ] . the hydrological process from rainfall to flood is complex [ ] . it involves nonlinear, time-varying interactions between rain, topography, soil types and other components associated with the physical process. several physics-based hydrological numerical simulation models, such as hec-ras [ ] , lisflood [ ] , lisflood-fp [ ] , are commonly used to simulate floods. however, such models are usually computation expensive and expert knowledge is required for both design and for accurate parameter tuning. we consider the problem of hydrological modelling and simulation. neural network models are known for their flexibility, efficient computation and capacity to deal with nonlinear correlation inside data. we propose to learn a flood surrogate model by training a neural network with pairs of inputs and outputs from the numerical model. we empirically demonstrate that the neural network can be used as a surrogate model to effectively and efficiently simulate the flood. the neural network model that we train learns a general model. with the trained model from a given data set, the neural network is capable of simulating directly spatially different terrains. moreover, while a neural network is generally constrained to a fixed size of its input, the model that we propose is able to simulate terrains of different sizes and spatial characteristics. this paper is structured as follows. section summarises the main related works regarding physics-based hydrological and flood models as well as statistical machine learning models for flood simulation and prediction. section presents our methodology. section presents the data set, parameters setting and evaluation metrics. section describes and evaluates the performance of the proposed models. section presents the overall conclusions and outlines future directions for this work. current flood models simulate the fluid movement by solving equations derived from physical laws with many hydrological process assumptions. these models can be classified into one-dimensional ( d), two-dimensional ( d) and threedimensional ( d) models depending on the spatial representation of the flow. the d models treat the flow as one-dimension along the river and solve d saint-venant equations, such as hec-ras [ ] and swmm [ ] . the d models receive the most attention and are perhaps the most widely used models for flood [ ] . these models solve different approximations of d saint-venant equations. two-dimensional models such as hec-ras d [ ] is implemented for simulating the flood in assiut plateau in southwestern egypt [ ] and bolivian amazonia [ ] . another d flow models called lisflood-fp solve dynamic wave model by neglecting the advection term and reduce the computation complexity [ ] . the d models are more complex and mostly unnecessary as d models are adequate [ ] . therefore, we focus our work on d flow models. instead of a conceptual physics-based model, several statistical machine learning based models have been utilised [ , ] . one state-of-the-art machine learning model is the neural network model [ ] . tompson [ ] uses a combination of the neural network models to accelerate the simulation of the fluid flow. bar-sinai [ ] uses neural network models to study the numerical partial differential equations of fluid flow in two dimensions. raissi [ ] developed the physics informed neural networks for solving the general partial differential equation and tested on the scenario of incompressible fluid movement. dwivedi [ ] proposes a distributed version of physics informed neural networks and studies the case on navier-stokes equation for fluid movement. besides the idea of accelerating the computation of partial differential equation, some neural networks have been developed in an entirely data-driven manner. ghalkhani [ ] develops a neural network for flood forecasting and warning system in madarsoo river basin at iran. khac-tien [ ] combines the neural network with a fuzzy inference system for daily water levels forecasting. other authors [ , ] apply the neural network model to predict flood with collected gauge measurements. those models, implementing neural network models for one dimension, did not take into account the spatial correlations. authors of [ , ] use the combinations of convolution and recurrent neural networks as a surrogate model of navier-stokes equations based fluid models with a higher dimension. the recent work [ ] develops a convolutional neural network model to predict flood in two dimensions by taking the spatial correlations into account. the authors focus on one specific region in the colorado river. it uses a convolutional neural network and a conditional generative adversarial network to predict water level at the next time step. the authors conclude neural networks can achieve high approximation accuracy with a few orders of magnitude faster speed. instead of focusing on one specific region and learning a model specific to the corresponding terrain, our work focuses on learning a general surrogate model applicable to terrains of different sizes and spatial characteristics with a datadriven machine learning approach. we propose to train a neural network with pairs of inputs and outputs from an existing flood simulator. the output provides the necessary supervision. we choose the open-source python library landlab, which is lisflood-fp based. we first define our problem in subsect. . . then, we introduce the general ideas of the numerical flood simulation model and landlab in subsect. . . finally, we present our solution in subsect. . . we first introduce the representation of three hydrological parameters that we use in the two-dimensional flood model. a digital elevation model (dem) d is a w × l matrix representing the elevation of a terrain surface. a water level h is a w × l matrix representing the water elevation of the corresponding dem. a rainfall intensity i generally varies spatially and should be a matrix representing the rainfall intensity. however, the current simulator assumes that the rainfall does not vary spatially. in our case, i is a constant scalar. our work intends to find a model that can represent the flood process. the flood happens because the rain drives the water level to change on the terrain region. the model receives three inputs: a dem d, the water level h t and the rainfall intensity i t at the current time step t. the model outputs the water level h t+ as the result of the rainfall i t on dem d. the learning process can be formulated as learning the function l: physics-driven hydrology models for the flood in two dimensions are usually based on the two-dimensional shallow water equation, which is a simplified version of navier-stokes equations with averaged depth direction [ ] . by ignoring the diffusion of momentum due to viscosity, turbulence, wind effects and coriolis terms [ ] , the two-dimensional shallow water equations include two parts: conservation of mass and conservation of momentum shown in eqs. and , where h is the water depth, g is the gravity acceleration, (u, v) are the velocity at x, y direction, z(x, y) is the topography elevation function and s fx , s fy are the friction slopes [ ] which are estimated with friction coefficient η as for the two-dimensional shallow water equations, there are no analytical solutions. therefore, many numerical approximations are used. lisflood-fp is a simplified approximation of the shallow water equations, which reduces the computational cost by ignoring the convective acceleration term (the second and third terms of two equations in eq. ) and utilising an explicit finite difference numerical scheme. the lisflood-fp firstly calculate the flow between pixels with mass [ ] . for simplification, we use the d version of the equations in x-direction shown in eq. , the result of d can be directly transferable to d due to the uncoupled nature of those equations [ ] . then, for each pixel, its water level h is updated as eq. , to sum up, for each pixel at location i, j, the solution derived from lisflood-fp can be written in a format shown in eq. , where h t i,j is the water level at location i, j of time step t, or in general as h t+ = Θ (d, h t , i t ) . however, the numerical solution as Θ is computationally expensive including assumptions for the hydrology process in flood. there is an enormous demand for parameter tuning of the numerical solution Θ once with high-resolution two-dimensional water level measurements mentioned in [ ] . therefore, we use such numerical model to generate pairs of inputs and outputs for the surrogate model. we choose the lisflood-fp based opensource python library, landlab [ ] since it is a popular simulator in regional two-dimensional flood studies. landlab includes tools and process components that can be used to create hydrological models over a range of temporal and spatial scales. in landlab, the rainfall and friction coefficients are considered to be spatially constant and evaporation and infiltration are both temporally and spatially constant. the inputs of the landlab is a dem and a time series of rainfall intensity. the output is a times series of water level. we propose here that a neural network model can provide an alternative solution for such a complex hydrology dynamic process. neural networks are well known as a collection of nonlinear connected units, which is flexible enough to model the complex nonlinear mechanism behind [ ] . moreover, a neural network can be easily implemented on general purpose graphics processing units (gpus) to boost its speed. in the numerical solution of the shallow water equation shown in subsect. . , the two-dimensional spatial correlation is important to predict the water level in flood. therefore, inspired by the capacity to extract spatial correlation features of the neural network, we intend to investigate if a neural network model can learn the flood model l effectively and efficiently. we propose a small and flexible neural network architecture. in the numerical solution eq. , the water level for each pixel of the next time step is only correlated with surrounding pixels. therefore, we use, as input, a × sliding window on the dem with the corresponding water levels and rain at each time step t. the output is the corresponding × water level at the next time step t + . the pixels at the boundary have different hydrological dynamic processes. therefore, we pad both the water level and dem with zero values. we expect that the neural network model learns the different hydrological dynamic processes at boundaries. one advantage of our proposed architecture is that the neural network is not restricted by the input size of the terrain for both training and testing. therefore, it is a general model that can be used in any terrain size. figure illustrates the proposed architecture on a region with size × . in this section, we empirically evaluate the performance of the proposed model. in subsect. . , we describe how to generate synthetic dems. subsect. . presents the experimental setup to test our method on synthetic dems as a micro-evaluation. subsect. . presents the experimental setup on the case in onkaparinga catchment. subsect. . presents details of our proposed neural network. subsect. . shows the evaluation metrics of our proposed model. in order to generate synthetic dems, we modify alexandre delahaye's work . we arbitrarily set the size of the dems to × and its resolution to metres. we generate three types of dems in our data set that resembles real world terrains surface as shown in fig. a , namely, a river in a plain, a river with a mountain on one side and a plain on the other and a river in a valley with mountains on both sides. we evaluate the performance in two cases. in case , the network is trained and tested with one dem. this dem has a river in the valley with mountains on both sides, as shown in fig. a right. in case , the network is trained and tested with different synthetic dems. the data set is generated with landlab. for all the flood simulations in landlab, the boundary condition is set to be closed on four sides. this means that rainfall is the only source of water in the whole region. the roughness coefficient is set to be . . we control the initial process, rainfall intensity and duration time for each sample. the different initial process is to ensure different initial water level in the whole region. after the initial process, the system run for h with no rain for stabilisation. we run the simulation for h and record the water levels every min. therefore, for one sample, we record a total of time steps of water levels. table summarises the parameters for generating samples in both case and case . the onkaparinga catchment, located at lower onkaparinga river, south of adelaide, south australia, has experienced many notable floods, especially in and . many research and reports have been done in this region [ ] . we get two dem data with size × and × from the australia intergovernmental committee on surveying and mapping's elevation information system . figure b shows the dem of lower onkaparinga river. we implement the neural network model under three cases. in case , we train and test on × onkaparinga river dem. in case , we test × onkaparinga river dem directly with case trained model. in case , we test × onkaparinga river dem directly with case trained model. we generate the data set for both × and × dem from landlab. the initial process, rainfall intensity and rain duration time of both dem are controlled the same as in case . the architecture of the neural network model is visualized as in fig. . it firstly upsamples the rain input into × and concatenates it with × water level input. then, it is followed by several batch normalisation and convolutional layers. the activation functions are relu and all convolutional layers have the same size padding. the total parameters for the neural network are . the model is trained by adam with the learning rate as − . the batch size for training is . the data set has been split with ratio : : for training, validation and testing. the training epoch is for case and case and for case . we train the neural network model on a machine with a ghz amd ryzen tm - -core processor. it has a gb ddr memory and an nvidia gtx ti gpu card with cuda cores and gb memory. the operating system is ubuntu . os. in order to evaluate the performance of our neural network model, we use global measurements metrics for the overall flood in the whole region. these metrics are global mean squared error: case is to test the scalability of our model for the different size dem. in table b , for global performance, the mape of case is around % less than both case and case , and for local performance, the mape of case is . %. similarly, without retraining the existed model, the trained neural network from case can be applied directly on dem with different size with a good global performance. we present the time needed for the flood simulation of one sample in landlab and in our neural network model (without the training time) in table . the average time of the neural network model for a × dem is around . s, while it takes s in landlab. furthermore, for a × dem, landlab takes more time than the neural network model. though the training of the neural network model is time consuming, it can be reused without further training or tuning terrains of different sizes and spatial characteristics. it remains effective and efficient (fig. ). we propose a neural network model, which is trained with pairs of inputs and outputs of an off-the-shelf numerical flood simulator, as an efficient and effective general surrogate model to the simulator. the trained network yields a mean absolute percentage error of around %. however, the trained network is at least times faster than the numerical simulator that is used to train it. moreover, it is able to simulate floods on terrains of different sizes and spatial characteristics not directly represented in the training. we are currently extending our work to take into account other meaningful environmental elements such as the land coverage, geology and weather. hec-ras river analysis system, user's manual, version the landlab v . overlandflow component: a python tool for computing shallow-water flow across watersheds improving the stability of a simple formulation of the shallow water equations for -d flood modeling a review of surrogate models and their application to groundwater modeling learning data-driven discretizations for partial differential equations a simple raster-based model for flood inundation simulation a simple inertial formulation of the shallow water equations for efficient two-dimensional flood inundation modelling rainfall-runoff modelling: the primer hec-ras river analysis system hydraulic userś manual numerical solution of the two-dimensional shallow water equations by the application of relaxation methods distributed physics informed neural network for data-efficient solution to partial differential equations integrating gis and hec-ras to model assiut plateau runoff flood hydrology processes and their variabilities application of surrogate artificial intelligent models for real-time flood routing extreme flood estimation-guesses at big floods? water down under : surface hydrology and water resources papers the data-driven approach as an operational real-time flood forecasting model analysis of flood causes and associated socio-economic damages in the hindukush region deep fluids: a generative network for parameterized fluid simulations fully convolutional networks for semantic segmentation optimisation of the twodimensional hydraulic model lisfood-fp for cpu architecture neural network modeling of hydrological systems: a review of implementation techniques physics informed data driven model for flood prediction: application of deep learning in prediction of urban flood development application of d numerical simulation for the analysis of the february bolivian amazonia flood: application of the new hec-ras version physics-informed neural networks: a deep learning framework for solving forward and inverse problems involving nonlinear partial differential equations storm water management model-user's manual v. . . us environmental protection agency hydrologic engineering center hydrologic modeling system, hec-hms: interior flood modeling decentralized flood forecasting using deep neural networks flood inundation modelling: a review of methods, recent advances and uncertainty analysis accelerating eulerian fluid simulation with convolutional networks comparison of the arma, arima, and the autoregressive artificial neural network models in forecasting the monthly inflow of dez dam reservoir lisflood: a gis-based distributed model for river basin scale water balance and flood simulation real-time waterlevel forecasting using dilated causal convolutional neural networks latent space physics: towards learning the temporal evolution of fluid flow in-situ water level measurement using nirimaging video camera acknowledgment. this work is supported by the national university of singapore institute for data science project watcha: water challenges analytics. abhishek saha is supported by national research foundation grant number nrf vsg-at dcm - . key: cord- -cxua o t authors: wang, rui; jin, yongsheng; li, feng title: a review of microblogging marketing based on the complex network theory date: - - journal: international conference in electrics, communication and automatic control proceedings doi: . / - - - - _ sha: doc_id: cord_uid: cxua o t microblogging marketing which is based on the online social network with both small-world and scale-free properties can be explained by the complex network theory. through systematically looking back at the complex network theory in different development stages, this chapter reviews literature from the microblogging marketing angle, then, extracts the analytical method and operational guide of microblogging marketing, finds the differences between microblog and other social network, and points out what the complex network theory cannot explain. in short, it provides a theoretical basis to effectively analyze microblogging marketing by the complex network theory. as a newly emerging marketing model, microblogging marketing has drawn the domestic academic interests in the recent years, but the relevant papers are scattered and inconvenient for a deep research. on the microblog, every id can be seen as a node, and the connection between the different nodes can be seen as an edge. these nodes, edges, and relationships inside form the social network on microblog which belongs to a typical complex network category. therefore, reviewing the literature from the microblogging marketing angle by the complex network theory can provide a systematic idea to the microblogging marketing research. in short, it provides a theoretical basis to effectively analyze microblogging marketing by the complex network theory. the start of the complex network theory dates from the birth of small-world and scale-free network model. these two models provide the network analysis tools and information dissemination interpretation to the microblogging marketing. "six degrees of separation" found by stanley milgram and other empirical studies show that the real network has a network structure of high clustering coefficient and short average path length [ ] . watts and strogatz creatively built the smallworld network model with this network structure (short for ws model), reflecting human interpersonal circle focus on acquaintances to form the high clustering coefficient, but little exchange with strangers to form the short average path length [ ] . every id in microblog has strong ties with acquaintance and weak ties with strangers, which matches the ws model, but individuals can have a large numbers of weak ties in the internet so that the online microblog has diversity with the real network. barabàsi and albert built a model by growth mechanism and preferential connection mechanism to reflect that the real network has degree distribution following the exponential distribution and power-law. because power-law has no degree distribution of the characteristic scale, this model is called the scale-free network model (short for ba model) [ ] . exponential distribution exposes that most nodes have low degree and weak impact while a few nodes have high degree and strong impact, confirming "matthew effect" in sociology and satisfying the microblog structure that celebrities have much greater influence than grassroots, which the small-world model cannot describe. in brief, the complex network theory pioneered by the small-world and scalefree network model overcomes the constraints of the network size and structure of regular network and random network, describes the basic structural features of high clustering coefficient, short average path length, power-law degree distribution, and scale-free characteristics. the existing literature analyzing microblogging marketing by the complex network theory is less, which is worth further study. the complex network theory had been evoluted from the small-world scale-free model to some major models such as the epidemic model and game model. the diffusion behavior study on these evolutionary complex network models is valuable and can reveal the spread of microblogging marketing concept in depth. epidemic model divides the crowd into three basic types: susceptible (s), infected (i), and removed (r), and build models according to the relationship among different types during the disease spread in order to analyze the disease transmission rate, infection level, and infection threshold to control the disease. typical epidemic models are the sir model and the sis model. differences lie in that the infected (i) in the sir model becomes the removed (r) after recovery, so the sir model is used for immunizable diseases while the infected (i) in the sis model has no immunity and only becomes the susceptible (s) after recovery. therefore, the sis model is used for unimmunizable diseases. these two models developed other epidemic model: sir model changes to sirs model when the removed (r) has been the susceptible (s); sis model changes to si model presenting the disease outbreaks in a short time when the infected (i) is incurable. epidemic model can be widely seen in the complex network, such as the dissemination of computer virus [ ] , information [ ] , knowledge [ ] . guimerà et al. finds the hierarchical and community structure in the social network [ ] . due to the hierarchical structure, barthélemy et al. indicate that the disease outbreak followed hierarchical dissemination from the large-node degree group to the small-node degree group [ ] . due to the community structure, liu et al. indicate the community structure has a lower threshold and greater steady-state density of infection, and is in favor of the infection [ ] ; fu finds that the real interpersonal social network has a positive correlation of the node degree distribution, but the real interpersonal social network has negative [ ] . the former expresses circles can be formed in celebrities except grassroots, but the latter expresses contacts can be formed in celebrities and grassroots on the microblog. the game theory combined with the complex network theory can explain the interpersonal microlevel interaction such as tweet release, reply, and retweet because it can analyze the complex dynamic process between individuals such as the game learning model, dynamic evolutionary game model, local interaction model, etc.( ) game learning model: individuals make the best decision by learning from others in the network. learning is a critical point to decision-making and game behavior, and equilibrium is the long-term process of seeking the optimal results by irrational individuals [ ] . bala and goyal draw the "neighbor effect" showing the optimal decision-making process based on the historical information from individuals and neighbors [ ] . ( ) dynamic evolutionary game model: the formation of the social network seems to be a dynamic outcome due to the strategic choice behavior between edge-breaking and edge-connecting based on the individual evolutionary game [ ] . fu et al. add reputation to the dynamic evolutionary game model and find individuals are more inclined to cooperate with reputable individuals in order to form a stable reputation-based network [ ] . ( ) local interaction model: local network information dissemination model based on the strong interactivity in local community is more practical to community microblogging marketing. li et al. restrain preferential connection mechanism in a local world and propose the local world evolutionary network model [ ] . burke et al. construct a local interaction model and find individual behavior presents the coexistence of local consistency and global decentrality [ ] . generally speaking, microblog has characteristics of the small-world, scale-free, high clustering coefficient, short average path length, hierarchical structure, community structure, and node degree distribution of positive and negative correlation. on one hand, the epidemic model offers the viral marketing principles to microblogging marketing, such as the sirs model can be used for the long-term brand strategy and the si model can be used for the short-term promotional activity; on the other hand, the game model tells microblogging marketing how to find opinion leaders in different social circles to develop strategies for the specific community to realize neighbor effect and local learning to form global microblog coordination interaction. rationally making use of these characteristics can preset effective strategies and solutions for microblogging marketing. the complex network theory is applied to biological, technological, economic, management, social, and many other fields by domestic scholars. zhou hui proves the spread of sars rumors has a typical small-world network features [ ] . duan wenqi studies new products synergy diffusion in the internet economy by the complex network theory to promote innovation diffusion [ ] . wanyangsong ( ) analyzes the dynamic network of banking crisis spread and proposes the interbank network immunization and optimization strategy [ ] . although papers explaining microblogging marketing by the complex network theory have not been found, these studies have provided the heuristic method, such as the study about the online community. based on fu's study on xiao nei sns network [ ] , hu haibo et al. carry out a case study on ruo lin sns network and conclude that the online interpersonal social network not only has almost the same network characteristics as the real interpersonal social network, but also has a negative correlation of the node degree distribution while the real interpersonal social network has positive. this is because the online interpersonal social network is more easier for strangers to establish relationships so that that small influence people can reach the big influence people and make weak ties in plenty through breaking the limited range of real world [ ] . these studies can be used to effectively develop marketing strategies and control the scope and effectiveness of microblogging marketing. there will be a great potential to research on the emerging microblog network platform by the complex network theory. the complex network theory describes micro and macro models analyzing the marketing process to microblogging marketing. the complex network characteristics of the small-world, scale-free, high clustering coefficient, short average path length, hierarchical structure, community structure, node degree distribution of positive and negative correlation and its application in various industries provide theoretical and practical methods to conduct and implement microblogging marketing. the basic research idea is: extract the network topology of microblog by the complex network theory; then, analyze the marketing processes and dissemination mechanism by the epidemic model, game model, or other models while taking into account the impact of macro and micro factors; finally, find out measures for improving or limiting the marketing effect in order to promote the beneficial activities and control the impedimental activities for enterprizes' microblogging marketing. because the macro and micro complexity and uncertainty of online interpersonal social network, the previous static and dynamic marketing theory cannot give a reasonable explanation. based on the strong ties and weak ties that lie in individuals of the complex network, goldenberg et al. find: ( ) after the external short-term promotion activity, strong ties and weak ties turn into the main force driving product diffusion; ( ) strong ties have strong local impact and weak transmission ability, while weak ties have strong transmission ability and weak local impact [ ] . therefore, the strong local impact of strong ties and strong transmission ability of weak ties are required to be rationally used for microblogging marketing. through system simulation and data mining, the complex network theory can provide explanation framework and mathematical tools to microblogging marketing as an operational guide. microblogging marketing is based on online interpersonal social network, having difference with the nonpersonal social network and real interpersonal social network. therefore, the corresponding study results cannot be simply mixed if involved with human factors. pastor-satorras et al. propose the target immunization solution to give protection priority to larger degree node according to sis scale-free network model [ ] . this suggests the importance of cooperation with the large influential ids as opinion leaders in microblogging marketing. remarkably, the large influential ids are usually considered as large followers' ids on the microblog platform that can be seen from the microblog database. the trouble is, as scarce resources, the large influential ids have a higher cooperative cost, but the large followers' ids are not all large influential ids due to the online public relations behaviors such as follower purchasing and watering. this problem is more complicated than simply the epidemic model. the complex network theory can be applied in behavior dynamics, risk control, organizational behavior, financial markets, information management, etc.. microblogging marketing can learn the analytical method and operational guide from these applications, but the complex network theory cannot solve all the problems of microblogging marketing, mainly: . the complexity and diversity of microblogging marketing process cannot completely be explained by the complex network theory. unlike the natural life-like virus, individuals on microblog are bounded rational, therefore, the decisionmaking processes are impacted by not only the neighbor effect and external environment but also by individuals' own values, social experience, and other subjective factors. this creates a unique automatic filtering mechanism of microblogging information dissemination: information recipients reply and retweet the tweet or establish and cancel contact only dependent on their interests, leading to the complexity and diversity. therefore, interaction-worthy topics are needed in microblogging marketing, and the effective followers' number and not the total followers' number of id is valuable. this cannot be seen in disease infection. . there are differences in network characteristics between microblog network and the real interpersonal social network. on one hand, the interpersonal social network is different from the natural social network in six points: ( ) social network has smaller network diameter and average path length; ( ) social network has higher clustering coefficient than the same-scale er random network; ( ) the degree distribution of social network has scale-free feature and follows power-law; ( ) interpersonal social network has positive correlation of node degree distribution but natural social network has negative; ( ) local clustering coefficient of the given node has negative correlation of the node degree in social network; ( ) social network often has clear community structure [ ] . therefore, the results of the natural social network are not all fit for the interpersonal social network. on the other hand, as the online interpersonal social network, microblog has negative correlation of the node degree distribution which is opposite to the real interpersonal social network. this means the results of the real interpersonal social network are not all fit for microblogging marketing. . there is still a conversion process from information dissemination to sales achievement in microblogging marketing. information dissemination on microblog can be explained by the complex network models such as the epidemic model, but the conversion process from information dissemination to sales achievement cannot be simply explained by the complex network theory, due to not only individual's external environment and neighborhood effect, but also consumer's psychology and willingness, payment capacity and convenience, etc.. according to the operational experience, conversion rate, retention rates, residence time, marketing topic design, target group selection, staged operation program, and other factors are needed to be analyzed by other theories. above all, microblogging marketing which attracts the booming social attention cannot be analyzed by regular research theories. however, the complex network theory can provide the analytical method and operational guide to microblogging marketing. it is believed that microblogging marketing on the complex network theory has a good study potential and prospect from both theoretical and practical point of view. the small world problem collective dynamics of 'small-world' networks emergence of scaling in random networks how viruses spread among computers and people information exchange and the robustness of organizational networks network structure and the diffusion of knowledge team assembly mechanisms determine collaboration network structure and team performance romualdo pastor-satorras, alessandro vespignani: velocity and hierarchical spread of epidemic outbreaks in scale-free networks epidemic spreading in community networks social dilemmas in an online social network: the structure and evolution of cooperation the theory of learning in games learning from neighbors a strategic model of social and economic networks reputation-based partner choice promotes cooperation in social networks a local-world evolving network model the emergence of local norms in networks research of the small-world character during rumor's propagation study on coordinated diffusion of new products in internet market doctoral dissertation of shanghai jiaotong university structural analysis of large online social network talk of the network: a complex systems look at the underlying process of word-of-mouth immunization of complex networks meeting strangers and friends of friends: how random are socially generated networks key: cord- -lg ha authors: kang, nan; zhang, xuesong; cheng, xinzhou; fang, bingyi; jiang, hong title: the realization path of network security technology under big data and cloud computing date: - - journal: signal and information processing, networking and computers doi: . / - - - - _ sha: doc_id: cord_uid: lg ha this paper studies the cloud and big data technology based on the characters of network security, including virus invasion, data storage, system vulnerabilities, network management etc. it analyzes some key network security problems in the current cloud and big data network. above all, this paper puts forward technical ways of achieving network security. cloud computing is a service that based on the increased usage and delivery of the internet related services, it promotes the rapidly development of the big data information processing technology, improves the processing and management abilities of big data information. with tie rapid development of computer technology, big data technology brings not only huge economic benefits, but the evolution of social productivity. however, serials of safety problems appeared. how to increase network security has been become the key point. this paper analyzes and discusses the technical ways of achieving network security. cloud computing is a kind of widely-used distributed computing technology [ ] [ ] [ ] . its basic concept is to automatically divide the huge computing processing program into numerous smaller subroutines through the network, and then hand the processing results back to the user after searching, calculating and analyzing by a large system of multiple servers [ ] [ ] [ ] . with this technology, web service providers can process tens of millions, if not billions, of information in a matter of seconds, reaching a network service as powerful as a supercomputer [ , ] . cloud computing is a resource delivery and usage model, it means get resource (hardware, software) via network. the network of providing resource is called 'cloud'. the hardware resource in the 'cloud' seems scalable infinitely and can be used whenever [ ] [ ] [ ] . cloud computing is the product of the rapid development of computer science and technology. however, the problem of computer network security in the background of cloud computing brings a lot of trouble to people's life, work and study [ ] [ ] [ ] . therefore, scientific and effective management measures should be taken in combination with the characteristics of cloud computing technology to minimize the risk of computer network security and improve the stability and security of computer network. this paper briefly introduces cloud computing, analyzes the network security problem of computer under cloud computing, and expounds the network security protection measures under cloud computing. processing data by cloud computing can save the energy expenditure and reduce the dealing cost of big data, so that it can improve the healthy development of cloud computing technology. analyzing big data by cloud computing technology can be shown by a directed acyclic data flow graph g ¼ ðv; eÞ, and the cloud service module in the parallel selection mechanism is made up by a serial group v ¼ fiji ¼ ; ; . . .; vg and a serial of remote data transfer hidden channels e ¼ fði; jÞji; j vg. assuming the date transmission distance of the data flow model in c=s framework is the directed graph model gp ¼ ðvp; ep; scapÞ explanation, ep represent lkset, the vp cross channel bearing the physical node set, the scap explains the quantity of data unit of physical node. besides, assuming undirected graph gs ¼ ðvs; es; sarsÞ expresses data packet markers input by application. the process of link mapping between cloud computing components and overall architecture can be explained by: for the different customer demands, building an optimized resource-allocated model to build the application model that processed by big data. the built-in network link structure for big data information processing as follows: in fig. , the i th transmission package in the cloud computer is i th . let ti represent the transmission time of i th . the interval of component is mapped to thread or process is showed by j i ¼ t i À t d , when j i ¼ t i À t d in the range of (−∞, ∞), the weight of node i is w i which computing time, the detail application model of big data information processing is shown in fig. in the mobile cloud system model, the grid architecture that relies on local computing resources and the wireless network to build cloud computing, which will select the components of data flow graph to migrate to the cloud, computer data processing cloud computing formula modeling, fgðv; eÞ; si; di; jg is the given data flow applications, assuming that the channel capacity is infinite, the problem of using cloud computing technology to optimize big data information processing is described as follows maxmax xi;yi;jxi;yi;j among them: the energy overhead of data flow migrating between groups in mobile cloud computing is described as: main characteristics of network security technology in the context of big data, cloud computing, users can save the data in the cloud and then process and manage the data. compared with the original network technology, it has certain data network risks, but its security coefficient is higher. cloud security technology can utilize modern network security technology to realize centralizing upgrade and guarantee the overall security of big data. since the data is stored in the cloud, enhancing the cloud management is the only way to ensure the security of the data. big data stored in the cloud usually affects network data. most enterprises will connect multiple servers so as to build computing terminals with strong performance. cloud computing itself has the convenience. customers of its hardware facilities do not need to purchase additional services. they only need to purchase storage and computing services. due to its particularity, cloud computing can effectively reduce resource consumption and is also a new form of energy conservation and environmental protection. when local computers encounter risks, data stored in the cloud will not be affected, nor will it be lost, and at the same time these data will be shared. the sharing and transfer of raw data is generally based on physical connections, and then data transfer is implemented. compared with the original data research, data sharing in big data cloud computing can be realized by using the cloud. users can collect data with the help of various terminals, so as to have a strong data sharing function. most computer networks have risks from system vulnerabilities. criminals use illegal means to make use of system vulnerabilities to invade other systems. system vulnerabilities not only include the vulnerabilities of the computer network system itself, but also can easily affect the computer system due to the user's downloading of unknown plug-ins, thus causing system vulnerability problems. with the continuous development of the network, its virus forms are also diverse, but mainly refers to a destructive program created by human factors. due to the diversity of the virus, the degree of impact is also different. customer information and files of enterprises can be stolen by viruses, resulting in huge economic losses, and some of the viruses are highly destructive, which will not only damage the relevant customer data, but also cause network system paralysis. in the context of big data cloud computing, external storage of the cloud computing platform can be realized through various distributed facilities. the service characteristic index of the system is mainly evaluated through high efficiency, security and stability. storage security plays a very important role in the computer network system. computer network system has different kinds, large storage, the data has diversified characteristics. the traditional storage methods have been unable to meet the needs of social development. optimizing the data encryption methods cannot meet the demand of the network. the deployment of cloud computing data and finishing need data storage has certain stability and security, to avoid economic losses to the user. in order to ensure data security, it is necessary to strengthen computer network management. all computer managers and application personnel are the main body of computer network security management. if the network management personnel do not have a comprehensive understanding of their responsibilities and adopt an unreasonable management method, data leakage will occur. especially for enterprise, government and other information management, network security management is very important. in the process of application, many computers do not pay enough attention to network security management, leading to the crisis of computer intrusion, thus causing data exposure problems. ways to achieve network security one of the main factors influencing the big data cloud save system is data layout. exploring it at the present stage is usually combined with the characteristics of the data to implement the unified layout. management and preservation function are carried out through data type distribution, and the data is encrypted. the original data stored in more than one cloud, different data management level has different abilities to resist attacks. for cloud computing, data storage, transmission and sharing can apply encryption technology. during data transmission, the party receiving the data can decrypt the encrypted data, so as to prevent the data from being damaged or stolen during the transmission. the intelligent firewall can identify the data through statistics, decision-making, memory and other ways, and achieve the effect of access control. by using the mathematical concept, it can eliminate the large-scale computing methods applied in the matching verification process and realize the mining of the network's own characteristics, so as to achieve the effect of direct access and control. the intelligent firewall technology includes risk identification, data intrusion prevention and outlaw personnel supply warning. compared with the original firewall technology, the intelligent firewall technology can further prevent the network system from being damaged by human factors and improve the security of network data. the system encryption technology is generally divided into public key and private key with the help of encryption algorithm to prevent the system from being attacked. meanwhile, service operators are given full attention to monitor the network operation and improve the overall security of the network. in addition, users should improve their operation management of data. in the process of being attacked by viruses, static and dynamic technologies are used. dynamic technologies are efficient in operation and can support multiple types of resources. safety isolation system is usually called virtualizes distributed firewalls (vdfw). it made up of security isolation system centralized management center and security service virtual machine (svm). the main role of this system is to achieve network security. the key functions of the system are as follows. access control functions analyze source/destination ip addresses, mac address, port and protocol, time, application characteristics, virtual machine object, user and other dimensions based on state detection access control. meanwhile, it supports many functions, including the access control policy grouping, search, conflict detection. intrusion prevention module judge the intrusion behavior by using protocol analysis and pattern recognition, statistical threshold and comprehensive technical means such as abnormal traffic monitoring. it can accurately block eleven categories of more than kinds of network attacks, including overflow attacks, rpc attack, webcgi attack, denial of service, trojans, worms, system vulnerabilities. moreover, it supports custom rules to detect and alert network attack traffic, abnormal messages in traffic, abnormal traffic, flood and other attacks. it can check and kill the trojan, worm, macro, script and other malicious codes contained in the email body/attachments, web pages and download files based on streaming and transparent proxy technology. it supports ftp, http, pop , smtp and other protocols. it identifies the traffic of various application layers, identify over protocols; its built-in thousands of application recognition feature library. this paper studies the cloud and big data technology. in the context of large data cloud computing, the computer network security problem is gradually a highlight, and in this case, the computer network operation condition should be combined with the modern network frame safety technology, so as to ensure the security of the network information, thus creating a safe network operation environment for users. application and operation of computer network security prevention under the background of big data era research on enterprise network information security technology system in the context of big data self-optimised coordinated traffic shifting scheme for lte cellular systems network security technology in big data environment data mining for base station evaluation in lte cellular systems user-vote assisted self-organizing load balancing for ofdma cellular systems discussion on network information security in the context of big data telecom big data based user offloading self-optimisation in heterogeneous relay cellular systems application of cloud computing technology in computer secure storage user perception aware telecom data mining and network management for lte/lte-advanced networks selfoptimised joint traffic offloading in heterogeneous cellular networks network information security control mechanism and evaluation system in the context of big data mobility load balancing aware radio resource allocation scheme for lte-advanced cellular networks wcdma data based lte site selection scheme in lte deployment key: cord- -l wo t authors: gao, chao; liu, jiming; zhong, ning title: network immunization and virus propagation in email networks: experimental evaluation and analysis date: - - journal: knowl inf syst doi: . /s - - - sha: doc_id: cord_uid: l wo t network immunization strategies have emerged as possible solutions to the challenges of virus propagation. in this paper, an existing interactive model is introduced and then improved in order to better characterize the way a virus spreads in email networks with different topologies. the model is used to demonstrate the effects of a number of key factors, notably nodes’ degree and betweenness. experiments are then performed to examine how the structure of a network and human dynamics affects virus propagation. the experimental results have revealed that a virus spreads in two distinct phases and shown that the most efficient immunization strategy is the node-betweenness strategy. moreover, those results have also explained why old virus can survive in networks nowadays from the aspects of human dynamics. the internet, the scientific collaboration network and the social network [ , ] . in these networks, nodes denote individuals (e.g. computers, web pages, email-boxes, people, or species) and edges represent the connections between individuals (e.g. network links, hyperlinks, relationships between two people or species) [ ] . there are many research topics related to network-like environments [ , , ] . one interesting and challenging subject is how to control virus propagation in physical networks (e.g. trojan viruses) and virtual networks (e.g. email worms) [ , , ] . currently, one of the most popular methods is network immunization where some nodes in a network are immunized (protected) so that they can not be infected by a virus or a worm. after immunizing the same percentages of nodes in a network, the best strategy can minimize the final number of infected nodes. valid propagation models can be used in complex networks to predict potential weaknesses of a global network infrastructure against worm attacks [ ] and help researchers understand the mechanisms of new virus attacks and/or new spreading. at the same time, reliable models provide test-beds for developing or evaluating new and/or improved security strategies for restraining virus propagation [ ] . researchers can use reliable models to design effective immunization strategies which can prevent and control virus propagation not only in computer networks (e.g. worms) but also in social networks (e.g. sars, h n , and rumors). today, more and more researchers from statistical physics, mathematics, computer science, and epidemiology are studying virus propagation and immunization strategies. for example, computer scientists focus on algorithms and the computational complexities of strategies, i.e. how to quickly search a short path from one "seed" node to a targeted node just based on local information, and then effectively and efficiently restrain virus propagation [ ] . epidemiologists focus on the combined effects of local clustering and global contacts on virus propagation [ ] . generally speaking, there are two major issues concerning virus propagation: . how to efficiently restrain virus propagation? . how to accurately model the process of virus propagation in complex networks? in order to solve these problems, the main work in this paper is to ( ) systematically compare and analyze representative network immunization strategies in an interactive email propagation model, ( ) uncover what the dominant factors are in virus propagation and immunization strategies, and ( ) improve the predictive accuracy of propagation models through using research from human dynamics. the remainder of this paper is organized as follows: sect. surveys some well-known network immunization strategies and existing propagation models. section presents the key research problems in this paper. section describes the experiments which are performed to compare different immunization strategies with the measurements of the immunization efficiency, the cost and the robustness in both synthetic networks (including a synthetic community-based network) and two real email networks (the enron and a university email network), and analyze the effects of network structures and human dynamics on virus propagation. section concludes the paper. in this section, several popular immunization strategies and typical propagation models are reviewed. an interactive email propagation model is then formulated in order to evaluate different immunization strategies and analyze the factors that influence virus propagation. network immunization is one of the well-known methods to effectively and efficiently restrain virus propagation. it cuts epidemic paths through immunizing (injecting vaccines or patching programs) a set of nodes from a network following some well-defined rules. the immunized nodes, in most published research, are all based on node degrees that reflect the importance of a node in a network, to a certain extent. in this paper, the influence of other properties of a node (i.e. betweenness) on immunization strategies will be observed. pastor-satorras and vespignani have studied the critical values in both random and targeted immunization [ ] . the random immunization strategy treats all nodes equally. in a largescale-free network, the immunization critical value is g c → . simulation results show that % of nodes need to be immunized in order to recover the epidemic threshold. dezso and barabasi have proposed a new immunization strategy, named as the targeted immunization [ ] , which takes the actual topology of a real-world network into consideration. the distributions of node degrees in scale-free networks are extremely heterogeneous. a few nodes have high degrees, while lots of nodes have low degrees. the targeted immunization strategy aims to immunize the most connected nodes in order to cut epidemic paths through which most susceptible nodes may be infected. for a ba network [ ] , the critical value of the targeted immunization strategy is g c ∼ e − mλ . this formula shows that it is always possible to obtain a small critical value g c even if the spreading rate λ changes drastically. however, one of the limitations of the targeted immunization strategy is that it needs to know the information of global topology, in particular the ranking of the nodes must be clearly defined. this is impractical and uneconomical for handling large-scale and dynamic-evolving networks, such as p p networks or email networks. in order to overcome this shortcoming, a local strategy, namely the acquaintance immunization [ , ] , has been developed. the motivation for the acquaintance immunization is to work without any global information. in this strategy, p % of nodes are first selected as "seeds" from a network, and then one or more of their direct acquaintances are immunized. because a node with higher degree has more links in a scale-free network, it will be selected as a "seed" with a higher probability. thus, the acquaintance immunization strategy is more efficient than the random immunization strategy, but less than the targeted immunization strategy. moreover, there is another issue which limits the effectiveness of the acquaintance immunization: it does not differentiate nodes, i.e. randomly selects "seed" nodes and their direct neighbors [ ] . another effective distributed strategy is the d-steps immunization [ , ] . this strategy views the decentralized immunization as a graph covering problem. that is, for a node v i , it looks for a node to be immunized that has the maximal degree within d steps of v i . this method only uses the local topological information within a certain range (e.g. the degree information of nodes within d steps). thus, the maximal acquaintance strategy can be seen as a -step immunization. however, it does not take into account domain-specific heuristic information, nor is it able to decide what the value of d should be in different networks. the immunization strategies described in the previous section are all based on node degrees. the way different immunized nodes are selected is illustrated in fig. an illustration of different strategies. the targeted immunization will directly select v as an immunized node based on the degrees of nodes. suppose that v is a "seed" node. v will be immunized based on the maximal acquaintance immunization strategy, and v will be indirectly selected as an immunized node based on the d-steps immunization strategy, where d = fig. an illustration of betweenness-based strategies. if we select one immunized node, the targeted immunization strategy will directly select the highest-degree node, v . the node-betweenness strategy will select v as it has the highest node betweenness. the edge-betweenness strategy will select one of v , v and v because the edges, l and l , have the highest edge betweenness the highest-degree nodes from a network, many approaches cut epidemic paths by means of increasing the average path length of a network, for example by partitioning large-scale networks based on betweenness [ , ] . for a network, node (edge) betweenness refers to the number of the shortest paths that pass through a node (edge). a higher value of betweenness means that the node (edge) links more adjacent communities and will be frequently used in network communications. although [ ] have analyzed the robustness of a network against degree-based and betweenness-based attacks, the spread of a virus in a propagation model is not considered, so the effects of different measurements on virus propagation is not clear. is it possible to restrain virus propagation, especially from one community to another, by immunizing nodes or edges which have higher betweenness. in this paper, two types of betweenness-based immunization strategies will be presented, i.e. the node-betweenness strategy and the edge-betweenness strategy. that is, the immunized nodes are selected in the descending order of node-and edge-betweenness, in an attempt to better understand the effects of the degree and betweenness centralities on virus propagation. figure shows that if v is immunized, the virus will not propagate from one part of the network to another. the node-betweenness strategy will select v as an immunized node, which has the highest node betweenness, i.e. . the edge-betweenness strategy will select the terminal nodes of l or l (i.e. v , v or v , v ) as they have the highest edge betweenness. as in the targeted immunization, the betweenness-based strategies also require information about the global betweenness of a network. the experiments presented in this paper is to find a new measurement that can be used to design a highly efficient immunization strategy. the efficiency of these strategies is compared both in synthetic networks and in real-world networks, such as the enron email network described by [ ] . in order to compare different immunization strategies, a propagation model is required to act as a test-bed in order to simulate virus propagation. currently, there are two typical models: ( ) the epidemic model based on population simulation and ( ) an interactive email model which utilizes individual-based simulation. lloyd and may have proposed an epidemic propagation model to characterize virus propagation, a typical mathematical model based on differential equations [ ] . some specific epidemic models, such as si [ , ] , sir [ , ] , sis [ ] , and seir [ , ] , have been developed and applied in order to simulate virus propagation and study the dynamic characteristics of whole systems. however, these models are all based on the mean-filed theory, i.e. differential equations. this type of black-box modeling approach only provides a macroscopic understanding of virus propagation-they do not give much insight into microscopic interactive behavior. more importantly, some assumptions, such as a fully mixed (i.e. individuals that are connected with a susceptible individual will be randomly chosen from the whole population) [ ] and equiprobable contacts (i.e. all nodes transmit the disease with the same probability and no account is taken of the different connections between individuals) may not be valid in the real world. for example, in email networks and instant message (im) networks, communication and/or the spread of information tend to be strongly clustered in groups or communities that have more closer relationships rather than being equiprobable across the whole network. these models may also overestimate the speed of propagation [ ] . in order to overcome the above-mentioned shortcomings, [ ] have built an interactive email model to study worm propagation, in which viruses are triggered by human behavior, not by contact probabilities. that is to say, the node will be infected only if a user has checked his/her email-box and clicked an email with a virus attachment. thus, virus propagation in the email network is mainly determined by two behavioral factors: email-checking time intervals (t i ) and email-clicking probabilities (p i ), where i ∈ [ , n ] , n is the total number of users in a network. t i is determined by a user's own habits; p i is determined both by user security awareness and the efficiency of the firewall. however, the authors do not provide much information about how to restrain worm propagation. in this paper, an interactive email model is used as a test-bed to study the characteristics of virus propagation and the efficiency of different immunization strategies. it is readily to observe the microscopic process of worm propagating through this model, and uncover the effects of different factors (e.g. the power-law exponent, human dynamics and the average path length of the network) on virus propagation and immunization strategies. unlike other models, this paper mainly focuses on comparing the performance of degree-based strategies and betweenness-based strategies, replacing the critical value of epidemics in a network. a detailed analysis of the propagation model is given in the following section. an email network can be viewed as a typical social network in which a connection between two nodes (individuals) indicates that they have communicated with each other before [ , ] . generally speaking, a network can be denoted as e = v, l , where v = {v , v , . . . , v n } is a set of nodes and l = { v i , v j | ≤ i, j ≤ n} is a set of undirected links (if v i in the hit-list of v j , there is a link between v i and v j ). a virus can propagate along links and infect more nodes in a network. in order to give a general definition, each node is represented as a tuple . -id: the node identifier, v i .i d = i. -state: the node state: i f the node has no virus, danger = , i f the node has virus but not in f ected, in f ected = , i f the node has been in f ected, immuni zed = , i f the node has been immuni zed. -nodelink: the information about its hit-list or adjacent neighbors, i.e. v i .n odelink = { i, j | i, j ∈ l}. -p behavior : the probability that a node will to perform a particular behavior. -b action : different behaviors. -virusnum: the total number of new unchecked viruses before the next operation. -newvirus: the number of new viruses a node receives from its neighbors at each step. in addition, two interactive behaviors are simulated according to [ ] , i.e. the emailchecking time intervals and the email-clicking probabilities both follow gaussian distributions, if the sample size goes to infinity. for the same user i, the email-checking interval t i (t) in [ ] has been modeled by a poisson distribution, i.e. t i (t) ∼ λe −λt . thus, the formula for p behavior in the tuple can be written as p behavior = click prob and p behavior = checkt ime. -clickprob is the probability of an user clicking a suspected email, -checkrate is the probability of an user checking an email, -checktime is the next time the email-box will be checked, v i .p behavior = v i .checkt ime = ex pgenerator(v i .check rate). b action can be specified as b action = receive_email, b action = send_email, and b action = update_email. if a user receives a virus-infected email, the corresponding node will update its state, i.e. v i .state ← danger. if a user opens an email that has a virus-infected attachment, the node will adjust its state, i.e. v i .state ← in f ected, and send this virus email to all its friends, according to its hit-list. if a user is immunized, the node will update its state to v i .state ← immuni zed. in order to better characterize virus propagation, some assumptions are made in the interactive email model: -if a user opens an infected email, the node is infected and will send viruses to all the friends on its hit-list; -when checking his/her mailbox, if a user does not click virus emails, it is assumed that the user deletes the suspected emails; -if nodes are immunized, they will never send virus emails even if a user clicks an attachment. the most important measurement of the effectiveness of an immunization strategy is the total number of infected nodes after virus propagation. the best strategy can effectively restrain virus propagation, i.e. the total number of infected nodes is kept to a minimum. in order to evaluate the efficiency of different immunization strategies and find the relationship between local behaviors and global dynamics, two statistics are of particular interest: . sid: the sum of the degrees of immunized nodes that reflects the importance of nodes in a network . apl: the average path length of a network. this is a measurement of the connectivity and transmission capacity of a network where d i j is the shortest path between i and j. if there is no path between i and j, d i j → ∞. in order to facilitate the computation, the reciprocal of d i j is used to reflect the connectivity of a network: if there is no path between i and j, d − i j = . based on these definitions, the interactive email model given in sect. . can be used as a test-bed to compare different immunization strategies and uncover the effects of different factors on virus propagation. the specific research questions addressed in this paper can be summarized as follows: . how to evaluate network immunization strategies? how to determine the performance of a particular strategy, i.e. in terms of its efficiency, cost and robustness? what is the best immunization strategy? what are the key factors that affect the efficiency of a strategy? . what is the process of virus propagation? what effect does the network structure have on virus propagation? . what effect do human dynamics have on virus propagation? the simulations in this paper have two phases. first, a existing email network is established in which each node has some of the interactive behaviors described in sect. . . next, the virus propagation in the network is observed and the epidemic dynamics are studied when applying different immunization strategies. more details can be found in sect. . in this section, the simulation process and the structures of experimental networks are presented in sects. . and . . section . uses a number of experiments to evaluate the performance (e.g. efficiency, cost and robustness) of different immunization strategies. specifically, the experiments seek to address whether or not betweenness-based immunization strategies can restrain worm propagation in email networks, and which measurements can reflect and/or characterize the efficiency of immunization strategies. finally, sects. . and . presents an in-depth analysis in order to determine the effect of network structures and human dynamics on virus propagation. the experimental process is illustrated in fig. . some nodes are first immunized (protected) from the network using different strategies. the viruses are then injected into the network in order to evaluate the efficiency of those strategies by comparing the total number of infected nodes. two methods are used to select the initial infected nodes: random infection and malicious infection, i.e. infecting the nodes with maximal degrees. the user behavior parameters are based on the definitions in sect. . , where μ p = . , σ p = . , μ t = , and σ t = . since the process of email worm propagation is stochastic, all results are averaged over runs. the virus propagation algorithm is specified in alg. . many common networks have presented the phenomenon of scale-free [ , ] , where nodes' degrees follow a power-law distribution [ ] , i.e. the fraction of nodes having k edges, p(k), decays according to a power law p(k) ∼ k −α (where α is usually between and ) [ ] . recent research has shown that email networks also follow power-law distributions with a long tail [ , ] . therefore, in this paper, three synthetic power-law networks and a synthetic community-based network, generated using the glp algorithm [ ] where the power can be tuned. the three synthetic networks all have nodes with α = . , . , and . , respectively. the statistical characteristics and visualization of the synthetic community-based network are shown in table and fig. c , f, respectively. in order to reflect the characteristics of a real-world network, the enron email network which is built by andrew fiore and jeff heer, and the university email network which is complied by the members of the university rovira i virgili (tarragona) will also be studied. the structure and degree distributions of these networks are shown in table and fig. . in particular, the cumulative distributions are estimated with maximum likelihood using the method provided by [ ] . the degree statistics are shown in table . in this section, a comparison is made of the effectiveness of different strategies in an interactive email model. experiments are then used to evaluate the cost and robustness of each strategy. input: nodedata[nodenum] stores the topology of an email network. timestep is the system clock. v is the set of initially infected nodes. output: simnum[timestep] [k] stores the number of infected nodes in the network in the k th simulation. ( ) for k= to runtime //we run times to obtain an average value ( ) nodedata[nodenum] ← initializing an email network as well as users' checking time and clicking probability; ( ) nodedata[nodenum] ← choosing immunized nodes based on different immunization strategies and adjusting their states; ( ) while timestep < endsimul //there are steps at each time ( ) for i= to nodenum ( ) if nodedata[i].checktime== ( ) prob← computing the probability of opening a virus-infected email based on user's clickprob and virusnum ( ) if send a virus to all friends according to its hit-list ( ) endif ( ) endif ( ) endfor ( ) for i= to nodenum ( ) update the next checktime based on user's checkrate ( ) nodedata the immunization efficiency of the following immunization strategies are compared: the targeted and random strategies [ ] , the acquaintance strategy (random and maximal neighbor) [ , ] , the d-steps strategy (d = and d = ) [ , ] (which is introduced in sect. . ), the bridges between different communities: the whole network: α= . , k = . and the proposed betweenness-based strategy (node-and edge-betweenness). in the initial set of experiments, the proportion of immunized nodes ( , , and %) are varied in the synthetic networks and the enron email network. table shows the simulation results in the enron email network which is initialized with two infected nodes. figure shows the average numbers of infected nodes over time. tables , , and show the numerical results in three synthetic networks, respectively. the simulation results show that the node-betweenness immunization strategy yields the best results (i.e. the minimum number of infected nodes, f) except for the case where % of the nodes in the enron network are immunized under a malicious attack. the average degree of the enron network is k = . . this means that only a few nodes have high degrees, others have low degrees (see table ). in such a network, if nodes with maximal degrees are infected, viruses will rapidly spread in the network and the final number of infected nodes will be larger than in other cases. the targeted strategy therefore does not perform any better than the node-betweenness strategy. in fact, as the number of immunized nodes increases, the efficiency of the node-betweenness immunization increases proportionally there are two infected nodes with different attack modes. if there is no immunization, the final number of infected nodes is with a random attack and with a malicious attack, and ap l = . ( − ). the total simulation time t = more than the targeted strategy. therefore, if global topological information is available, the node-betweenness immunization is the best strategy. the maximal s i d is obtained using the targeted immunization. however, the final number of infected nodes (f) is consistent with the average path length (ap l) but not with the s i d. that is to say, controlling a virus epidemic does not depend on the degrees of immunized nodes but on the path length of a whole network. this also explains why the efficiency of the node-betweenness immunization strategy is better than that of the targeted immunization strategy. the node-betweenness immunization selects nodes based on the average path length, while the targeted immunization strategy selects based on the size of degrees. a more in-depth analysis is undertaken by comparing the change of the ap l with respect to the different strategies used in the synthetic networks. the results are shown in fig. . figure a , b compare the change of the final number of infected nodes over time, which correspond to fig. c , d, respectively. these numerical results validate the previous assertion that the average path length can be used as a measurement to design an effective immunization strategy. the best strategy is to divide the whole network into different sub-networks and increase the average path length of a network, hence cut the epidemic paths. in this paper, all comparative results are the average over runs using the same infection model (i.e. the virus propagation is compared for both random and malicious attacks) and user behavior model (i.e. all simulations use the same behavior parameters, as shown in sect. . ). thus, it is more reasonable and feasible to just evaluate how the propagation of a virus is affected by immunization strategies, i.e. avoiding the effects caused by the stochastic process, the infection model and the user behavior. it can be seen that the edge-betweenness strategy is able to find some nodes with high degrees of centrality and then integrally divide a network into a number of sub-networks (e.g. v in fig. ) . however, compared with the nodes (e.g. v in fig. ) selected by the node-betweenness strategy, the nodes with higher edge betweenness can not cut the epidemic paths as they can not effectively break the whole structure of a network. in fig. , the synthetic community-based network and the university email network are used as examples to illustrate why the edge-betweenness strategy can not obtain the same immunization efficiency as the node-betweenness strategy. to select two nodes as immunized nodes from fig. , the node-betweenness immunization will select {v , v } by using the descending order of node betweenness. however, the edge-betweenness strategy can select {v , v } or {v , v } because the edges, l and l , have the highest edge betweenness. this result shows that the node-betweenness strategy can not only effectively divide the whole network into two communities, but also break the interior structure of communities. although the edgebetweenness strategy can integrally divided the whole network into two parts, viruses can also propagate in each community. many networks commonly contain the structure shown in fig. , for example, the enron email network and university email networks. table and fig. present the results of the synthetic community-based network. table compares different strategies in the university email network, which also has some self-similar community structures [ ] . these results further validate the analysis stated above. from the above experiments, the following conclusions can be made: tables - , ap l can be used as a measurement to evaluate the efficiency of an immunization strategy. thus, when designing a distributed immunization strategy, attentions should be paid on those nodes that have the largest impact on the apl value. . if the final number of infected nodes is used as a measure of efficiency, then the nodebetweenness immunization strategy is more efficient than the targeted immunization strategy. . the power-law exponent (α) affects the edge-betweenness immunization strategy, but has a little impact on other strategies. in the previous section, the efficiency of different immunization strategies is evaluated in terms of the final number of infected nodes when the propagation reaches an equilibrium state. by doing experiments in synthetic networks, synthetic community-based network, the enron email network and the university email network, it is easily to find that the node-betweenness immunization strategy has the highest efficiency. in this section, the performance of the different strategies will be evaluated in terms of cost and robustness, as in [ ] . it is well known that the structure of a social network or an email network constantly evolves. it is therefore interesting to evaluate how changes in structure affect the efficiency of an immunization strategy. -the cost can be defined as the number of nodes that need to be immunized in order to achieve a given level of epidemic prevalence ρ. generally, ρ → . there are some parameters which are of particular interest: f is the fraction of nodes that are immunized; f c is the critical value of the immunization when ρ → ; ρ is the infection density when no immunization strategy is implemented; ρ f is the infection density with a given immunization strategy. figure shows the relationship between the reduced prevalence ρ f /ρ and f. it can be seen that the node-betweenness immunization has the lowest prevalence for the smallest number of protected nodes. the immunization cost increases as the value of α increases, i.e. in order to achieve epidemic prevalence ρ → , the node-betweenness immunization strategy needs , , and % of nodes to be immunized, respectively, in the three synthetic networks. this is because the node-betweenness immunization strategy can effectively break the network structure and increase the path length of a network with the same number of immunized nodes. -the robustness shows a plot of tolerance against the dynamic evolution of a network, i.e. the change of power-law exponents (α). figure shows the relationship between the immunized threshold f c and α. a low level of f c with a small variation indicates that the immunization strategy is robust. the robustness is important when an immunization strategy is deployed into a scalable and dynamic network (e.g. p p and email networks). figure also shows the robustness of the d-steps immunization strategy is close to that of the targeted immunization; the node-betweenness strategy is the most robust. [ ] have compared virus propagation in synthetic networks with α = . and α = . , and pointed out that initial worm propagation has two phases. however, they do not give a detailed explanation of these results nor do they compare the effect of the power-law exponent on different immunization strategies during virus propagation. table presents the detailed degree statistics for different networks, which can be used to examine the effect of the power-law exponent on virus propagation and immunization strategies. first, virus propagation in non-immunized networks is discussed. figure a shows the changes of the average number of infected nodes over time; fig. b gives the average degree of infected nodes at each time step. from the results, it can be seen that . the number of infected nodes in non-immunized networks is determined by attack modes but not the power-law exponent. in figs. a , b, three distribution curves (α = . , . , and . ) overlap with each other in both random and malicious attacks. the difference between them is that the final number of infected nodes with a malicious attack is larger than that with a random attack, as shown in fig. a , reflecting the fact that a malicious attack is more dangerous than a random attack. . a virus spreads more quickly in a network with a large power-law exponent than that with a small exponent. because a malicious attack initially infects highly connected nodes, the average degree of the infected nodes decreases in a shorter time comparing to a random attack (t < t ). moreover, the speed and range of the infection is amplified by those highly connected nodes. in phase i, viruses propagate very quickly and infect most nodes in a network. however, in phase ii, the number of total infected nodes grows slowly (fig. a) , because viruses aim to infect those nodes with low degrees (fig. b) , and a node with fewer links is more difficult to be infected. in order to observe the effect of different immunization strategies on the average degree of infected nodes in different networks, % of the nodes are initially protected against random and malicious attacks. figure shows the simulation results. from this experiment, it can be concluded that . the random immunization has no effect on restraining virus propagation because the curves of the average degree of the infected nodes are basically coincident with the curves in the non-immunization case. . comparing fig. a , b, c and d, e, f, respectively, it can be seen that the peak value of the average degree is the largest in the network with α= . and the smallest in the network with α= . . this is because the network with a lower exponent has more highly connected nodes (i.e. the range of degrees is between and ), which serve as amplifiers in the process of virus propagation. . as α increases, so does the number of infected nodes and the virus propagation duration (t < t < t ). because a larger α implies a larger ap l , the number of infected nodes will increase; if the network has a larger exponent, a virus need more time to infect those nodes with medium or low degrees. fig. the average number of infected nodes and the average degree of infected nodes, with respect to time when virus spreading in different networks. we apply the targeted immunization to protect % nodes in the network first, consider the process of virus propagation in the case of a malicious attack where % of the nodes are immunized using the edge-betweenness immunization strategy. there are two intersections in fig. a . point a is the intersection of two curves net and net , and point b is the intersection of net and net . under the same conditions, fig. a shows that the total number of infected nodes is the largest in net in phase i. corresponding to fig. b , the average degree of infected nodes in net is the largest in phase i. as time goes on, the rate at which the average degree falls is the fastest in net , as shown in fig. b . this is because there are more highly connected nodes in net than in the others (see table ). after these highly connected nodes are infected, viruses attempt to infect the nodes with low degrees. therefore, the average degree in net that has the smallest power-law exponent is larger than those in phases ii and iii. the total number of infected nodes in net continuously increases, exceeding those in net and net . the same phenomenon also appears in the targeted immunization strategy, as shown in fig. . the email-checking intervals in the above interactive email model (see sect. . ) is modeled using a poisson process. the poisson distribution is widely used in many real-world models to statistically describe human activities, e.g. in terms of statistical regularities on the frequency of certain events within a period of time [ , ] . statistics from user log files to databases that record the information about human activities, show that most observations on human behavior deviate from a poisson process. that is to say, when a person engages in certain activities, his waiting intervals follow a power-law distribution with a long tail [ , ] . vazquez et al. [ ] have tried to incorporate an email-sending interval distribution, characterized by a power-law distribution, into a virus propagation model. however, their model assumes that a user is instantly infected after he/she receives a virus email, and ignores the impact of anti-virus software and the security awareness of users. therefore, there are some gaps between their model and the real world. in this section, the statistical properties associated with a single user sending emails is analyzed based on the enron dataset [ ] . the virus spreading process is then simulated using an improved interactive email model in order to observe the effect of human behavior on virus propagation. research results from the study of statistical regularities or laws of human behavior based on empirical data can offer a valuable perspective to social scientists [ , ] . previous studies have also used models to characterize the behavioral features of sending emails [ , , ] , but their correctness needs to be further empirically verified, especially in view of the fact that there exist variations among different types of users. in this paper, the enron email dataset is used to identify the characteristics of human email-handling behavior. due to the limited space, table presents only a small amount of the employee data contained in the database. as can be seen from the table, the interval distribution of email sent by the same user is respectively measured using different granularities: day, hour, and minute. figure shows that the waiting intervals follow a heavy-tailed distribution. the power-law exponent as the day granularity is not accurate because there are only a few data points. if more data points are added, a power-law distribution with long tail will emerge. note that, there is a peak at t = as measured at an hour granularity. eckmann et al. [ ] have explained that the peak in a university dataset is the interval between the time people leave work and the time they return to their offices. after curve fitting, see fig. , the waiting interval exponent is close to . , i.e. α ≈ . ± . . although it has been shown that an email-sending distribution follows a power-law by studying users in the enron dataset, it is still not possible to assert that all users' waiting intervals follow a power-law distribution. it can only be stated that the distribution of waiting intervals has a long-tail characteristic. it is also not possible to measure the intervals between email checking since there is no information about login time in the enron dataset. however, combing research results from human web browsing behavior [ ] and the effect of non-poisson activities on propagation in the barabasi group [ ] , it can be found that there are similarities between the distributions of email-checking intervals and email-sending intervals. the following section uses a power-law distribution to characterize the behavior associated with email-checking in order to observe the effect human behavior has on the propagation of an email virus. based on the above discussions, a power-law distribution is used to model the email-checking intervals of a user i, instead of the poisson distribution used in [ ] , i.e. t i (τ ) ∼ τ −α . an analysis of the distribution of the power-law exponent (α) for different individuals in web browsing [ ] and in the enron dataset shows that the power-law exponent is approximately . . in order to observe and quantitatively analyze the effect that the email-checking interval has on virus propagation, the email-clicking probability distribution (p i ) in our model is consistent with the one used by [ ] , i.e. the security awareness of different users in the network follows a normal distribution, p i ∼ n ( . , . ). figure shows that following a random attack viruses quickly propagate in the enron network if the email-checking intervals follow a power-law distribution. the results are consistent with the observed trends in real computer networks [ ] , i.e. viruses initially spread explosively, then enter a long latency period before becoming active again following user activity. the explanation for this is that users frequently have a short period of focused activity followed by a long period of inactivity. thus, although old viruses may be killed by anti-virus software, they can still intermittently break out in a network. that is because some viruses are hidden by inactive users, and cannot be found by anti-virus software. when the inactive users become active, the virus will start to spread again. the effect of human dynamics on virus propagation in three synthetic networks is also analyzed by applying the targeted [ ] , d-steps [ ] and aoc-based strategy [ ] . the numerical results are shown in table. and fig. . from the above experiments, the following conclusions can be made: . based on the enron email dataset and recent research on human dynamics, the emailchecking intervals in an interactive email model should be assigned based on a power-law distribution. . viruses can spread very quickly in a network if users' email-checking intervals follow a power-law distribution. in such a situation, viruses grow explosively at the initial stage and then grow slowly. the viruses remain in a latent state and await being activated by users. in this paper, a simulation model for studying the process of virus propagation has been described, and the efficiency of various existing immunization strategies has been compared. in particular, two new betweenness-based immunization strategies have been presented and validated in an interactive propagation model, which incorporates two human behaviors based on [ ] in order to make the model more practical. this simulation-based work can be regarded as a contribution to the understanding of the inter-reactions between a network structure and local/global dynamics. the main results are concluded as follows: . some experiments are used to systematically compare different immunization strategies for restraining epidemic spreading, in synthetic scale-free networks including the community-based network and two real email networks. the simulation results have shown that the key factor that affects the efficiency of immunization strategies is apl, rather than the sum of the degrees of immunized nodes (sid). that is to say, immunization strategy should protect nodes with higher connectivity and transmission capability, rather than those with higher degrees. . some performance metrics are used to further evaluate the efficiency of different strategies, i.e. in terms of their cost and robustness. simulation results have shown that the d-steps immunization is a feasible strategy in the case of limited resources and the nodebetweenness immunization is the best if the global topological information is available. . the effects of power-law exponents and human dynamics on virus propagation are analyzed. more in-depth experiments have shown that viruses spread faster in a network with a large power-law exponent than that with a small one. especially, the results have explained why some old viruses can still propagate in networks up till now from the perspective of human dynamics. the mathematical theory of infectious diseases and its applications emergence of scaling in random networks the origin of bursts and heavy tails in human dynamics cluster ranking with an application to mining mailbox networks small worlds' and the evolution of virulence: infection occurs locally and at a distance on distinguishing between internet power law topology generators power-law distribution in empirical data efficient immunization strategies for computer networks and populations halting viruses in scale-free networks dynamics of information access on the web a simple model for complex dynamical transitions in epidemics distance-d covering problem in scalefree networks with degree correlation entropy of dialogues creates coherent structure in email traffic epidemic threshold in structured scale-free networks on power-law relationships of the internet topology improving immunization strategies immunization of real complex communication networks self-similar community structure in a network of human interactions attack vulnerability of complex networks targeted local immunization in scale-free peer-to-peer networks the large scale organization of metabolic networks probing human response times periodic subgraph mining in dynamic networks. knowledge and information systems autonomy-oriented search in dynamic community networks: a case study in decentralized network immunization characterizing web usage regularities with information foraging agents how viruses spread among computers and people on universality in human correspondence activity enhanced: simple rules with complex dynamics network motifs simple building blocks of complex networks epidemics and percolation in small-world network code-red: a case study on the spread and victims of an internet worm the structure of scientific collaboration networks the spread of epidemic disease on networks the structure and function of complex networks email networks and the spread of computer viruses partitioning large networks without breaking communities epidemic spreading in scale-free networks epidemic dynamics and endemic states in complex networks immunization of complex networks computer virus propagation models the enron email dataset database schema and brief statistical report exploring complex networks modeling bursts and heavy tails in human dynamics impact of non-poissonian activity patterns on spreading process predicting the behavior of techno-social systems a decentralized search engine for dynamic web communities a twenty-first century science an environment for controlled worm replication and analysis modeling and simulation study of the propagation and defense of internet e-mail worms chao gao is currently a phd student in the international wic institute, college of computer science and technology, beijing university of technology. he has been an exchange student in the department of computer science, hong kong baptist university. his main research interests include web intelligence (wi), autonomy-oriented computing (aoc), complex networks analysis, and network security. department at hong kong baptist university. he was a professor and the director of school of computer science at university of windsor, canada. his current research interests include: autonomy-oriented computing (aoc), web intelligence (wi), and self-organizing systems and complex networks, with applications to: (i) characterizing working mechanisms that lead to emergent behavior in natural and artificial complex systems (e.g., phenomena in web science, and the dynamics of social networks and neural systems), and (ii) developing solutions to large-scale, distributed computational problems (e.g., distributed scalable scientific or social computing, and collective intelligence). prof. liu has contributed to the scientific literature in those areas, including over journal and conference papers, and authored research monographs, e.g., autonomy-oriented computing: from problem solving to complex systems modeling (kluwer academic/springer) and spatial reasoning and planning: geometry, mechanism, and motion (springer). prof. liu has served as the editor-in-chief of web intelligence and agent systems, an associate editor of ieee transactions on knowledge and data engineering, ieee transactions on systems, man, and cybernetics-part b, and computational intelligence, and a member of the editorial board of several other international journals. laboratory and is a professor in the department of systems and information engineering at maebashi institute of technology, japan. he is also an adjunct professor in the international wic institute. he has conducted research in the areas of knowledge discovery and data mining, rough sets and granular-soft computing, web intelligence (wi), intelligent agents, brain informatics, and knowledge information systems, with more than journal and conference publications and books. he is the editor-in-chief of web intelligence and agent systems and annual review of intelligent informatics, an associate editor of ieee transactions on knowledge and data engineering, data engineering, and knowledge and information systems, a member of the editorial board of transactions on rough sets. key: cord- - clslvqb authors: wang, xiaoqi; yang, yaning; liao, xiangke; li, lenli; li, fei; peng, shaoliang title: selfrl: two-level self-supervised transformer representation learning for link prediction of heterogeneous biomedical networks date: - - journal: biorxiv doi: . / . . . sha: doc_id: cord_uid: clslvqb predicting potential links in heterogeneous biomedical networks (hbns) can greatly benefit various important biomedical problem. however, the self-supervised representation learning for link prediction in hbns has been slightly explored in previous researches. therefore, this study proposes a two-level self-supervised representation learning, namely selfrl, for link prediction in heterogeneous biomedical networks. the meta path detection-based self-supervised learning task is proposed to learn representation vectors that can capture the global-level structure and semantic feature in hbns. the vertex entity mask-based self-supervised learning mechanism is designed to enhance local association of vertices. finally, the representations from two tasks are concatenated to generate high-quality representation vectors. the results of link prediction on six datasets show selfrl outperforms state-of-the-art methods. in particular, selfrl reveals great performance with results close to in terms of auc and aupr on the neodti-net dataset. in addition, the pubmed publications demonstrate that nine out of ten drugs screened by selfrl can inhibit the cytokine storm in covid- patients. in summary, selfrl provides a general frame-work that develops self-supervised learning tasks with unlabeled data to obtain promising representations for improving link prediction. in recent decades, networks have been widely used to represent biomedical entities (as nodes) and their relations (as edges). predicting potential links in heterogeneous biomedical networks (hbns) can be beneficial to various significant biology and medicine problems, such as target identification, drug repositioning, and adverse drug reaction predictions. for example, network-based drug repositioning methods have already offered promising insights to boost the effective treatment of covid- disease (zeng et al. ; xiaoqi et al. ) , since it outbreak in december of . many network-based learning approaches have been developed to facilitate link prediction in hbns. in particularly, network representation learning methods, that aim at converting high-dimensionality networks into a low-dimensional space while maximally preserve structural * to whom correspondence should be addressed: fei li (pitta-cus@gmail.com) , and shaoliang peng (slpeng@hnu.edu.cn). properties (cui et al. ) , have provided effective and potential paradigms for link prediction li et al. ) . nevertheless, most of the network representation learning-based link prediction approaches heavily depend on a large amount of labeled data. the requirement of large-scale labeled data may not be met in many real link prediction for biomedical networks (su et al. ) . to address this issue, many studies have focused on developing unsupervised representation learning algorithms that use the network structure and vertex attributes to learn low-dimension vectors of nodes in networks (yuxiao et al. ) , such as grarep (cao, lu, and xu ) , tadw (cheng et al. ) , line (tang et al. ) , and struc vec (ribeiro, saverese, and figueiredo ) . however, these network presentation learning approaches are aimed at homogeneous network, and cannot applied directly to the hbns. therefore, a growth number of studies have integrated meta paths, which are able to capture topological structure feature and relevant semantic, to develop representation learning approaches for heterogeneous information networks. dong et al. used meta path based random walk and then leveraged a skip-gram model to learn node representation (dong, chawla, and swami ). shi et al. proposed a fusion approach to integrate different representations based on different meta paths into a single representation (shi et al. ). ji et al. developed the attention-based meta path fusion for heterogeneous information network embedding (ji, shi, and wang ) . wang et al. proposed a meta path-driven deep representation learning for a heterogeneous drug network (xiaoqi et al. ) . unfortunately, most of the meta path-based network representation approaches focused on preserving vertex-level information by formalizing meta paths and then leveraging a word embedding model to learn node representation. therefore, the global-level structure and semantic information among vertices in heterogeneous networks is hard to be fully modeled. in addition, these representation approaches is not specially designed for link prediction, thus resulting in learning an inexplicit representation for link prediction. on the other hand, self-supervised learning, which is a form of unsupervised learning, has been receiving more and more attention. self-supervised representation learning for-mulates some pretext tasks using only unlabeled data to learn representation vector without any manual annotations (xiao et al. ) . self-supervised representation learning technologies have been widely use for various domains, such as natural language processing, computer vision, and image processing. however, very few approaches have been generalized for hbns because the structure and semantic information of heterogeneous networks is significantly differ between domains, and the model trained on a pretext task may be unsuitable for link prediction tasks. based on the above analysis, there are two main problems in link prediction based on network representation learning. the first one is how to design a self-supervised representation learning approach based on a great amount of unlabeled data to learn low-dimension vectors that integrate the differentview structure and semantic information of hbns. the second one is how to ensure the pretext tasks in self-supervised representation learning be beneficial for link prediction of hbns. in order to overcome the mentioned issues, this study proposes a two-level self-supervised representation learning (selfrl) for link prediction in heterogeneous biomedical networks. first, a meta path detection self-supervised learning mechanism is developed to train a deep transformer encoder for learning low-dimensional representations that capture the path-level information on hbns. meanwhile, sel-frl integrates the vertex entity mask task to learn local association of vertices in hbns. finally, the representations from the entity mask and meta path detection is concatenated for generating the embedding vectors of nodes in hbns. the results of link prediction on six datasets show that the proposed selfrl is superior to state-of-the-art methods. in summary, the contributions of the paper are listed below: • we proposed a two-level self-supervised representation learning method for hbns, where this study integrates the meta path detection and vertex entity mask selfsupervised learning task based on a great number of unlabeled data to learn high quality representation vector of vertices. • the meta path detection self-supervised learning task is developed to capture the global-level structure and semantic feature of hbns. meanwhile, vertex entity-masked model is designed to learn local association of nodes. therefore, the representation vectors of selfrl integrate two-level structure and semantic feature of hbns. • the meta path detection task is specifically designed for link prediction. the experimental results indicate that selfrl outperforms state-of-the-art methods on six datasets. in particular, selfrl reveals great performance with results close to in terms of auc and aupr on the neodti-net dataset. heterogeneous biomedical network a heterogeneous biomedical network is defined as g = (v, e) where v denotes a biomedical entity set, and e rep-resents a biomedical link set. in a heterogeneous biomedical network, using a mapping function of vertex type φ(v) : v → a and a mapping function of relation type ψ(e) : e → r to associate each vertex v and each edge e, respectively. a and r denote the sets of the entity and relation types, where |a| + |r| > . for a given heterogeneous network g = (v, e), the network schema t g can be defined as a directed graph defined over object types a and link types r, that is, t g = (a, r). the schema of a heterogeneous biomedical network expresses all allowable relation types between different type of vertices, as shown in figure . figure : schema of the heterogeneous biomedical network that includes four types of vertices (i.e., drug, protein, disease, and side-effect). network representation learning plays a significant role in various network analysis tasks, such as community detection, link prediction, and node classification. therefore, network representation learning has been receiving more and more attention during recent decades. network representation learning aims at learning low-dimensional representations of network vertices, such that proximities between them in the original space are preserved (cui et al. ). the network representation learning approaches can be roughly categorized into three groups: matrix factorizationbased network representation learning approaches, random walk-based network representation learning approaches, and neural network-based network representation learning approaches (yue et al. ). the matrix factorization-based network representation learning methods extract an adjacency matrix, and factorize it to obtain the representation vectors of vertices, such as, laplacian eigenmaps (belkin and niyogi ) and the locally linear embedding methods (roweis and saul ) . the traditional matrix factorization has many variants that often focus on factorizing the high-order data matrix, such as, grarep (cao, lu, and xu ) and hope (ou et al. ) . inspired by the word vec (mikolov et al. ) ... ), node vec (grover and leskovec ) , and metap-ath vec/metapath vec++ (dong, chawla, and swami ), in which a network is transformed into node sequences. these models were later extended by struc vec (ribeiro, saverese, and figueiredo ) for the purpose of better modeling the structural identity. over the past years, neural network models have been widely used in various domains, and they have also been applied to the network representation learning areas. in neural network-based network representation learning, different methods adopt different learning architectures and various network information as input. for example, the line (tang et al. ) aims at embedding by preserving both local and global network structure properties. the sdne (wang, cui, and zhu ) and dngr (cao ) were developed using deep autoencoder architecture. the graphgan (wang et al. ) adopts generative adversarial networks to model the connectivity of nodes. predicting potential links in hbns can greatly benefit various important biomedical problems. this study proposes selfrl that is a two-level self-supervised representation learning algorithm, to improve the quality of link prediction. the flowchart of the proposed selfrl is shown in figure . considering meta path reflecting heterogeneous characteristics and rich semantics, selfrl first uses a random walk strategy guided by meta-paths to generate node sequences that are treated as the true paths of hbns. meanwhile, an equal number of false paths is produced by randomly replacing some of the nodes in each of true path. then, based on the true paths, this work proposes vertex entity masked as self-supervised learning task to train deep transformer encoder for learning entity-level representations. in addition, a meta path detection-based self-supervised learning task based on all true and false paths is designed to train a deep transformer encoder for learning path-level representation vectors. finally, the representations obtained from the twolevel self-supervised learning task are concatenated to generate the embedding vectors of vertices in hbns, and then are used for link prediction. true path generation a meta-path is a composite relation denoting a sequence of adjacent links between nodes a and a i in a heterogeneous network, and can be expressed in the where r i represents a schema between two objects. different adjacent links indicate distinct semantics. in this study, all the meta paths are reversible, and no longer than four nodes. this is based on the results of the previous studies that meta paths longer than four nodes may be too long to contribute to the informative feature (fu et al. ). in addition, sun et al. have suggested that short meta paths are good enough, and that long meta paths may even reduce the quality of semantic meanings (sun et al. ) . in this work, each network vertex and meta path are regarded as vocabulary and sentence, respectively. indeed, a large percentage of meta paths are biased to highly visible objects. therefore, three key steps are defined to keep a balance between different semantic types of meta paths, and they are as follows: ( ) generate all sequences according to meta paths whose positive and reverse directional sampling probabilities are the same and equal to . . ( ) count the total number of meta paths of each type, and calculate their median value (n ); ( ) randomly select n paths if the total number of meta paths of each type is larger than n ; otherwise, select all sequences. the selected paths are able to reflect topological structures and interaction mechanisms between vertices in hbns, and will be used to design selfsupervised learning task to learn low-dimensional representations of network vertices. false path generation the paths selected using the above procedure are treated as the true paths in hbns. the equal number of false paths are produced by randomly replacing some nodes in each of the true paths. in other words, each true path corresponds to a false path. there is no relation between the permutation nodes and context in false paths, and the number of replaced nodes is less than the length of the current path. for instance, a true path (i.e., d p d e ) is shown in figure (b). during the generation procedure of false paths, the st and rd tokens that correspond to d and d , respectively, are randomly chosen, and two nodes from the hbns which correspond to d and d , respectively, are also randomly chosen. if there is a relationship between d and p , node d is replaced with p . if there is a relationship between d and p , another node from the network is chosen until the mentioned conditions are satisfied. similarly, node d is replaced with d , because there are no relations between d and e (or p ). finally, the path (i.e., d p d e ) is treated as a false path. meta path detection in general language understanding evaluation, the corpus of linguistic acceptability (cola) is a binary classification task, where the goal is to predict whether a sentence is linguistically acceptable or not ). in addition, perozzi et al. have suggested that paths generated by short random walks can be regarded as short sentences (perozzi, alrfou, and skiena ) . inspired by their work, this study assumes that true paths can be treated as linguistically acceptable sentences, while the false paths can be regarded as linguistically unacceptable sentences. based on this hypothesis, we proposes the meta path detection task where the goal is to predict whether a path is acceptable or not. in the proposed selfrl, a set of true and false paths is fed into the deep transformer encoder for learning path-level representation vector. selfrl maps a path of symbol representations to the output vector of continuous representations that is fed into the softmax function to predict whether a path is a true or false path. apparently, the only distinction between true and false paths is whether there is an association between nodes of path sequence. therefore, the meta path detection task is the extension of the link prediction to a certain extent. especially, when a path includes only two nodes, the meta path detection is equal to the link prediction. for instance, judging whether a path is a true or false path, e.g., d s in figure b , is the same as predicting whether there is a relation between d and s . however, the meta path detection task is generally more difficult compared to link prediction, because it requires the understanding of long-range composite relationships between vertices of hbns. therefore, the meta path detection-based self-supervised learning task encourages to capturing high-level structure and semantic information in hbns, thus facilitating the performance of link prediction. in order to capture the local information on hbns, this study develops the vertex entity mask-based self-supervised learning task, where nodes in true paths are randomly masked, and then predicting those masked nodes. the vertex entity mask task has been widely applied to natural language processing. however, using the vertex entity mask task to drive the heterogeneous biomedical network representation model is a less explored research. in this work, the vertex entity mask task fellows the implementation described in the bert, and the implementation is almost identical to the original (devlin et al. ) . in brief, % of the vertex en-tities in true paths are randomly chosen for prediction. for each selected vertex entity, there are three different operations for improving the model generalization performance. the selected vertex entity is replaced with the ¡mask¿ token for % time, and is replaced with a random node for % time. furthermore, it has % chance to keep the original vertex. finally, the masked path is used for training a deep transformer encoder model according to the vertex entity mask task where the last hidden vectors corresponding to the mask vertex entities are fed into the softmax function to predict their original vertices with cross entropy loss. the vertex entity mask task can keep a local contextual representation of every vertex. the vertex entity mask-based self-supervised learning task captures the local association of the vertex in hbns. the meta path detection-based self-supervised learning task enhances the global-level structure and semantic features of the hbns. therefore, the two-level representations are concatenated as the final embedding vectors that integrate structure and semantics information on hbns from different level, as shown in figure (f). layer normalization the model of selfrl is a deep transformer encoder, and the implementation is almost identical to the original (vaswani et al. ) . the selfrl follows the overall architecture that includes the stacked self-attention and point-wise, fully connected layers, and softmax function, as shown in figure . multi-head attention an attention function can be described as mapping a query vectors and a set of key-value pairs to an output vectors. the multi-head attention leads table : the node and edge statistics of the datasets. here, ddi, dti, dsa, dda, pda, ppi represent the drug-drug interaction, drug-target interaction, drug-side-effect association, and drug-disease association, protein-disease association and protein-protein interaction, respectively. where w o is a parameter matrices, and h i is the attention function of i-th subspace, and is given as follows: respectively denotes the query, key, and value representations of the i-th subspace, and w is parameter matrices which represent that q, k, and v are transformed into h i subspaces, and d and d k hi represent the dimensionality of the model and h i submodel. position-wise feed-forward network in addition to multi-head attention layers, the proposed selfrl model include a fully connected feed-forward network, which includes two linear transformations with a relu activation function, is given as follows: there are the same the linear transformations for various positions, while these linear transformations use various parameters from layer to layer. residual connection for each sub-layer, a residual connection and normalization mechanism are employed. that is, the output of each sub-layer is given as follows: where x and f (x) stand for input and the transformational function of each sub-layer, respectively. in this work, the performance of selfrl is evaluated comprehensively by link prediction on six datasets. the results of selfrl is also compared with the results of methods. for neodti-net datasets, the performance of selfrl is compared with those of seven state-of-the-art methods, including mscmf ( . the details on how to set the hyperparameters in above baseline approaches can be found in neodti (wan et al. ) . for deepdr-net datasets, the link prediction results generated by selfrl are compared with that of seven baseline algorithms, including deepdr (zeng et al. ) , dtinet (luo et al. ) , kernelized bayesian matrix factorization (kbmf) (gonen and kaski ) , support vector machine (svm) (cortes and vapnik ) , random forest (rf) (l ), random walk with restart (rwr) (cao et al. ) , and katz (singhblom et al. ) . the details of the baseline approaches and hyperparameters selection can be seen in deepdr (zeng et al. ) . for single network datasets, selfrl is compared with network representation methods, that is laplacian (belkin and niyogi ) , singular value decomposition (svd), graph factorization (gf) (ahmed et al. ) , hope (ou et al. ) , grarep (cao, lu, and xu ) , deepwalk (perozzi, alrfou, and skiena ) , node vec (grover and leskovec ) , struc vec (ribeiro, saverese, and figueiredo ) , line (tang et al. ) , sdne (wang, cui, and zhu ) , and gae (kipf and welling ) . more implementation details can be found in bionev (yue et al. ) . the hyperparameters selection of baseline methods were set to default values, and the original data of neodti (wan et al. ) , deepdr (zeng et al. ) , and bionev (yue et al. ) were used in the experiments. the parameters of the proposed selfrl follows those of the bert (devlin et al. ) which the number of transformer blocks (l), the number of self-attention heads (a), and the hidden size (h) is set to , , and , respectively. for the neodti-net dataset, the embedding vectors are fed into the inductive matrix completion model (imc) (jain and dhillon ) to predict dti. the number of negative samples that are randomly chosen from negative pairs, is ten times that of positive samples according to the guidelines in neodti (wan et al. ) . then, to reduce the data bias, the ten-fold cross-validation is performed repeatedly ten times, and the average value is calculated. for the deepdr-net dataset, a collective variational autoencoder (cvae) is used to predict dda. all positive samples and the same number of negative samples that is randomly selected from unknown pairs are used to train and test the model according to the guidelines in deepdr (zeng et al. ) . then, five-fold crossvalidation is performed repeatedly times. for neodti-net and deepdr-net datasets, the area under precision recall (aupr) curve and the area under receiver operating characteristic (auc) curve are adopted to evaluate the link prediction performance generated by all approaches. for other datasets, the representation vectors are fed into the logistic regression binary classifier for link prediction, the training set ( %) and the testing set ( %) consisted of the equal number of positive samples and negative samples that is randomly selected from all the unknown interactions according to the guidelines in bionev. the performance of different methods is evaluated by accuracy (acc), auc, and f score. the overall performances of all methods for dti prediction on the neodti-net dataset are presented in figure . selfrl shows great results with the auc and aupr value close to , and significantly outperformed the baseline methods. in particular, neodti and dtinet were specially developed for the neodti-net dataset. however, selfrl is still superior to both neodti and dtinet, improving the aupr by approximately % and %, respectively. the results of dda prediction of selfrl and baseline methods are represented in figure . these experimental results demonstrate that selfrl generates better results of the dda prediction on the deepdr-net dataset than the baseline methods. however, selfrl achieves the improvements in term of auc and aupr less than %. a major reason for such a poor superiority of the selfrl to the other methods is that selfrl considers only four types of objects and edges. however, deepdr included types of vertices and types of edges of drug-related data. in addition, deepdr specially integrated multi-modal deep autoencoder (mda) and cvae model to improve the dda prediction on the deepdr-net dataset. unfortunately, the selfrl+cvae combination maybe reduce the original balance between the mda and cvae. the above results and analysis indicate that the proposed selfrl is a powerful network representation approach for complex heterogeneous networks, and that can achieve very promising results in link prediction. such a good performance of the proposed selfrl is due to the following facts: ( ) selfrl designs a two-level self-supervised learning task to integrate the local association of a node and the global level information of hbns. ( ) meta path detection selfsupervised learning task that is an extension of link prediction, is specially designed for link prediction. in particular, path detection of two nodes is equal to link prediction. therefore, the representation generated by meta path detection is able to facilitate the link prediction performance. ( ) selfrl uses meta paths to integrate the structural and semantic features of hbns. in this section, the link prediction results on four single network datasets are presented to further verify the represen- table , and the best results are marked in boldface. selfrl shows higher accuracy in link prediction on four single networks compared to the other baseline approaches. especially, the proposed selfrl can achieves an approximately % improvement in terms of auc and acc over the second best method on the string-ppi dataset. the auc value of link prediction on the ndfrt-dda dataset is improved from . to . when selfrl is compared with grarep. however, grarep only achieves an enhancement of . compared to line that is the third best method on the string-ppi dataset. therefore, the improvement of selfrl is significant in comparison to the enhancement of grarep compared to line. meanwhile, we also notice that selfrl have poor superiority to the second best method on the ctd-dda and drugbank-ddi datasets. one possible reason for this result can be that the structure and semantic of the ctd-dda and drugbank-ddi datasets are simple and monotonous, so most of the network representation approaches are able to achieve good performance on them. consequently, the proposed selfrl is a potential representation method for the single network datasets, and can contribute to link prediction by introducing a two-level self-supervised learning task. in the neodti and deepdr, low-dimensional representations of nodes in hbns are first learned by network representation approaches, and then are fed into classifier models for predicting potential link among vertices. to further examine the contribution of the network representation approaches, the low-dimensional representation vector is fed into svm that is a traditional and popular classifier for link prediction. the experimental results of these combinations are shown in table . selfrl achieves the best per-formance in link prediction for complex heterogeneous networks, providing a great improvement of over % with regard to auc and aupr compared to the neodti and deepdr. with the change of classifiers, the result of sel-frl in link prediction reduced from . to . on the neodti-net dataset, while the auc value of neodti approximately reduce by %. interestingly, the results on the deepdr-net dataset are similar. therefore, the experimental results indicate that the network representation performance of selfrl is more robust and better than those of the other embedding approaches. this is mainly because selfrl integrates a two-level self-supervised learning model to fuse the rich structure and semantic information from different views. meanwhile, path detection is an extension of link prediction, yielding to better representation in link prediction. the emergence and rapid expansion of covid- have posed a global health threat. recent studies have demonstrated that the cytokine storm, namely the excessive inflammatory response, is a key factor leading to death in patients with covid- . therefore, it is urgent and important to discover potential drugs that prevent the cy- tokine storm in covid- patients. meanwhile, it has been proven that interleukin(il)- is a potential target of antiinflammatory response, and drugs targeting il- are promising agents blocking cytokine storm for severe covid- patients (mehta et al. ). in the experiments, selfrl is used for drug repositioning for covid- disease which aim to discovery agents binding to il- for blocking cytokine storm in patients. the low-dimensional representation vectors generated by selfrl are fed into the imc algorithm for predicting the confidence scores between il- and each drug in neodti-net dataset. then, the top- agents with the highest confidence scores are selected as potential therapeutic agents for covid- patients. the candidate drugs and their anti-inflammatory mechanisms of action in silico is shown in table . the knowledge from pubmed publications demonstrates that nine out of ten drugs are able to reduce the release and express of il- for exerting anti-inflammatory effects in silico. meanwhile, there are three drugs (i.e., dasatinib, carvedilol, and indomethacin) that inhibit the release of il- by reducing the mrna levels of il- . however, imatinib inhibits the function of human monocytes to prevent the expression of il- . in addition, although the anti-inflammatory mechanisms of action of five agents (i.e., arsenic trioxide, irbesartan, amiloride, propranolol, sorafenib) are uncertain, these agents can still reduce the release or expression of il- for preforming anti-inflammatory effects. therefore, the top ten agents predicted by selfrl-based drug repositioning is able to be used for inhibiting cytokine storm in patients with covid- , and should be taken into consideration in clinical studies on covid- . these results further indicate that the proposed selfrl is a powerful network representation learning approach, and can facilitate the link prediction in hbns. in this study, selfrl uses transformer encoders to learn representation vectors by the proposed vertex entity mask and meta path detection tasks. meanwhile, the entity-and path- table : the dti and dda prediction result of selfrl and baseline methods on the neodti-net and deepdr-net datasets. the mlth and clth stand for the mean and concatenation values of representation from the last two hidden layers, respectively. atlre denotes the mean value of the two-level representation from the last hidden layer. table . selfrl achieves the best performance. meanwhile, the results show that the two-level representation are superior to the single level representation. interestingly, the concatenation of vectors from the lth layers is beneficial to improving the link prediction performance compared to the mean value of the vectors from the lth layers for each level representation model. this is intuitive since two-level representation can fuse the structural and semantic information from different views in hbns. meanwhile, larger number of dimensions can provide more and richer information. this study proposes a two-level self-supervised representation learning, termed selfrl, for link prediction in heterogeneous biomedical networks. the proposed selfrl designs a meta path detection-based self-supervised learning task, and integrates vertices entity-level mask tasks to capture the rich structure and semantics from two-level views of hbns. the results of link prediction indicate that selfrl is superior to state-of-the-art approaches on six datasets. in the future, we will design more self-supervised learning tasks with unable data to improve the representation performance of the model. in addition, we will also developed the effective multi-task learning framework in the proposed model. distributed large-scale natural graph factorization drug-target interaction prediction through domain-tuned network-based inference laplacian eigenmaps and spectral techniques for embedding and clustering laplacian eigenmaps for dimensionality reduction and data representation new directions for diffusion-based network prediction of protein function: incorporating pathways with confidence deep neural network for learning graph representations grarep: learning graph representations with global structural information network representation learning with rich text information support-vector networks a survey on network embedding bert: pre-training of deep bidirectional transformers for language understanding predicting drug target interactions using meta-pathbased semantic network analysis kernelized bayesian matrix factorization node vec: scalable feature learning for networks provable inductive matrix completion attention based meta path fusion forheterogeneous information network embedding variational graph auto-encoders. arxiv:machine learning random forests deepcas: an end-to-end predictor of information cascades predicting drug-target interaction using a novel graph neural network with d structure-embedded graph representation a network integration approach for drug-target interaction prediction and computational drug repositioning from heterogeneous information covid- : consider cytokine storm syndromes and immunosuppression. the lancet drug?target interaction prediction by learning from local information and neighbors distributed representations of words and phrases and their compositionality asymmetric transitivity preserving graph embedding deepwalk: online learning of social representations struc vec: learning node representations from structural identity. in knowledge discovery and data mining nonlinear dimensionality reduction by locally linear embedding heterogeneous information network embedding for recommendation prediction and validation of gene-disease associations using methods inspired by social network analyses network embedding in biomedical data science pathsim: meta path-based top-k similarity search in heterogeneous information networks line: large-scale information network embedding neodti: neural integration of neighbor information from a heterogeneous network for discovering new drug-target interactions glue: a multi-task benchmark and analysis platform for natural language understanding structural deep network embedding graphgan: graph representation learning with generative adversarial nets shine:signed heterogeneous information network embedding for sentiment link prediction semisupervised drug-protein interaction prediction from heterogeneous biological spaces self-supervised learning: generative or contrastive. arxiv doi network representation learning-based drug mechanism discovery and anti-inflammatory response against a novel approach for drug response prediction in cancer cell lines via network representation learning graph embedding on biomedical networks: methods, applications and evaluations heterogeneous network representation learning using deep learning deepdr: a network-based deep learning approach to in silico drug repositioning key: cord- -fgk n z authors: holme, petter title: objective measures for sentinel surveillance in network epidemiology date: - - journal: nan doi: . /physreve. . sha: doc_id: cord_uid: fgk n z assume one has the capability of determining whether a node in a network is infectious or not by probing it. then problem of optimizing sentinel surveillance in networks is to identify the nodes to probe such that an emerging disease outbreak can be discovered early or reliably. whether the emphasis should be on early or reliable detection depends on the scenario in question. we investigate three objective measures from the literature quantifying the performance of nodes in sentinel surveillance: the time to detection or extinction, the time to detection, and the frequency of detection. as a basis for the comparison, we use the susceptible-infectious-recovered model on static and temporal networks of human contacts. we show that, for some regions of parameter space, the three objective measures can rank the nodes very differently. this means sentinel surveillance is a class of problems, and solutions need to chose an objective measure for the particular scenario in question. as opposed to other problems in network epidemiology, we draw similar conclusions from the static and temporal networks. furthermore, we do not find one type of network structure that predicts the objective measures, i.e., that depends both on the data set and the sir parameter values. infectious diseases are a big burden to public health. their epidemiology is a topic wherein the gap between the medical and theoretical sciences is not so large. several concepts of mathematical epidemiology-like the basic reproductive number or core groups [ ] [ ] [ ] -have entered the vocabulary of medical scientists. traditionally, authors have modeled disease outbreaks in society by assuming any person to have the same chance of meeting anyone else at any time. this is of course not realistic, and improving this point is the motivation for network epidemiology: epidemic simulations between people connected by a network [ ] . one can continue increasing the realism in the contact patterns by observing that the timing of contacts can also have structures capable of affecting the disease. studying epidemics on time-varying contact structures is the basis of the emerging field of temporal network epidemiology [ ] [ ] [ ] [ ] . one of the most important questions in infectious disease epidemiology is to identify people, or in more general terms, units, that would get infected early and with high likelihood in an infectious outbreak. this is the sentinel surveillance problem [ , ] . it is the aspect of node importance, which is the one most actively used in public health practice. typically, it works by selecting some hospitals (clinics, cattle farms, etc.) to screen, or more frequently test, for a specific infection [ ] . defining an objective measure-a quantity to be maximized or minimized-for sentinel surveillance is not trivial. it depends on the particular scenario one considers and the means of interventions at hand. if the goal for society is to detect as many outbreaks as possible, it makes sense to choose sentinels to * holme@cns.pi.titech.ac.jp maximize the fraction of detected outbreaks [ ] . if the objective rather is to discover outbreaks early, then one could choose sentinels that, if infected, are infected early [ , ] . finally, if the objective is to stop the disease as early as possible, it makes sense to measure the time to extinction or detection (infection of a sentinel) [ ] . see fig. for an illustration. to restrict ourselves, we will focus on the case of one sentinel. if one has more than one sentinel, the optimal set will most likely not be the top nodes of a ranking according to the three measures above. their relative positions in the network also matter (they should not be too close to each other) [ ] . in this paper, we study and characterize our three objective measures. we base our analysis on empirical data sets of contacts between people. we analyze them both in temporal and static networks. the reason we use empirical contact data, rather than generative models, as the basis of this study is twofold. first, there are so many possible structures and correlations in temporal networks that one cannot tune them all in models [ ] . it is also hard to identify the most important structures for a specific spreading phenomenon [ ] . second, studying empirical networks makes this paper-in addition to elucidating the objective measures of sentinel surveillance-a study of human interaction. we can classify data sets with respect how the epidemic dynamics propagate on them. as mentioned above, in practical sentinel surveillance, the network in question is rather one of hospitals, clinics or farms. one can, however, also think of sentinel surveillance of individuals, where high-risk individuals would be tested extra often for some diseases. in the remainder of the paper, we will describe the objective measures, the structural measures we use for the analysis, and the data sets, and we will present the analysis itself. we will primarily focus on the relation between the measures, secondarily on the structural explanations of our observations. assume that the objective of society is to end outbreaks as soon as possible. if an outbreak dies by itself, that is fine. otherwise, one would like to detect it so it could be mitigated by interventions. in this scenario, a sensible objective measure would be the time for a disease to either go extinct or be detected by a sentinel: the time to detection or extinction t x [ ] . suppose that, in contrast to the situation above, the priority is not to save society from the epidemics as soon as possible, but just to detect outbreaks fast. this could be the case if one would want to get a chance to isolate a pathogen, or start producing a vaccine, as early as possible, maybe to prevent future outbreaks of the same pathogen at the earliest possibility. then one would seek to minimize the time for the outbreak to be detected conditioned on the fact that it is detected: the time to detection t d . for the time to detection, it does not matter how likely it is for an outbreak to reach a sentinel. if the objective is to detect as many outbreaks as possible, the corresponding measure should be the expected frequency of outbreaks to reach a node: the frequency of detection f d . note that for this measure a large value means the node is a good sentinel, whereas for t x and t d a good sentinel has a low value. this means that when we correlate the measures, a similar ranking between t x and f d or t d and f d yields a negative correlation coefficient. instead of considering the inverse times, or similar, we keep this feature and urge the reader to keep this in mind. there are many possible ways to reduce our empirical temporal networks to static networks. the simplest method would be to just include a link between any pair of nodes that has at least one contact during the course of the data set. this would however make some of the networks so dense that the static network structure of the node-pairs most actively in contact would be obscured. for our purpose, we primarily want our network to span many types of network structures that can impact epidemics. without any additional knowledge about the epidemics, the best option is to threshold the weighted graph where an edge (i, j ) means that i and j had more than θ contacts in the data set. in this work, we assume that we do not know what the per-contact transmission probability β is (this would anyway depend on both the disease and precise details of the interaction). rather we scan through a very large range of β values. since we anyway to that, there is no need either to base the choice of θ on some epidemiological argument, or to rescale β after the thresholding. note that the rescaled β would be a non-linear function of the number of contacts between i and j . (assuming no recovery, for an isolated link with ν contacts, the transmission probability is − ( − β ) ν .) for our purpose the only thing we need is that the rescaled β is a monotonous function of β for the temporal network (which is true). to follow a simple principle, we omit all links with a weight less than the median weight θ . we simulate disease spreading by the sir dynamics, the canonical model for diseases that gives immunity upon recovery [ , ] . for static networks, we use the standard markovian version of the sir model [ ] . that is, we assume that diseases spread over links between susceptible and infectious nodes the infinitesimal time interval dt with a probability β dt. then, an infectious node recovers after a time that is exponentially distributed with average /ν. the parameters β and ν are called infection rate and recovery rate, respectively. we can, without loss of generality, put ν = /t (where t is the duration of the sampling). for other ν values, the ranking of the nodes would be the same (but the values of the t x and t d would be rescaled by a factor ν). we will scan an exponentially increasing progression of values of β, from − to . the code for the disease simulations can be downloaded [ ] . for the temporal networks, we use a definition as close as possible to the one above. we assume an exponentially distributed duration of the infectious state with mean /ν. we assume a contact between an infectious and susceptible node results in a new infection with probability β. in the case of temporal networks, one cannot reduce the problem to one parameter. like for static networks, we sample the parameter values in exponential sequences in the intervals . β and . ν/t respectively. for temporal networks, with our interpretation of a contact, β > makes no sense, which explains the upper limit. furthermore, since temporal networks usually are effectively sparser (in terms of the number of possible infection events per time), the smallest β values will give similar results, which is the reason for the higher cutoff in this case. for both temporal and static networks, we assume the outbreak starts at one randomly chosen node. analogously, in the temporal case we assume the disease is introduced with equal probability at any time throughout the sampling period. for every data set and set of parameter values, we sample runs of epidemic simulations. as motivated in the introduction, we base our study on empirical temporal networks. all networks that we study record contacts between people and falls into two classes: human proximity networks and communication networks. proximity networks are, of course, most relevant for epidemic studies, but communication networks can serve as a reference (and it is interesting to see how general results are over the two classes). the data sets consist of anonymized lists of two identification numbers in contact and the time since the beginning of the contact. many of the proximity data sets we use come from the sociopatterns project [ ] . these data sets were gathered by people wearing radio-frequency identification (rfid) sensors that detect proximity between and . m. one such datasets comes from a conference, hypertext , (conference ) [ ] , another two from a primary school (primary school) [ ] and five from a high school (high school) [ ] , a third from a hospital (hospital) [ ] , a fourth set of five data sets from an art gallery (gallery) [ ] , a fifth from a workplace (office) [ ] , and a sixth from members of five families in rural kenya [ ] . the gallery data sets consist of several days where we use the first five. in addition to data gathered by rfid sensors, we also use data from the longer-range (around m) bluetooth channel. the cambridge [ ] and [ ] datasets were measured by the bluetooth channel of sensors (imotes) worn by people in and around cambridge, uk. st andrews [ ] , conference [ ] , and intel [ ] are similar data sets tracing contacts at, respectively, the university of st. andrews, the conference infocom , and the intel research laboratory in cambridge, uk. the reality [ ] and copenhagen bluetooth [ ] data sets also come from bluetooth data, but from smartphones carried by university students. in the romania data, the wifi channel of smartphones was used to log the proximity between university students [ ] , whereas the wifi dataset links students of a chinese university that are logged onto the same wifi router. for the diary data set, a group of colleagues and their family members were self-recording their contacts [ ] . our final proximity data, the prostitution network, comes from from self-reported sexual contacts between female sex workers and their male sex buyers [ ] . this is a special form of proximity network since contacts represent more than just proximity. among the data sets from electronic communication, facebook comes from the wall posts at the social media platform facebook [ ] . college is based on communication at a facebook-like service [ ] . dating shows interactions at an early internet dating website [ ] . messages and forum are similar records of interaction at a film community [ ] . copenhagen calls and copenhagen sms consist of phone calls and text messages gathered in the same experiment as copenhagen bluetooth [ ] . finally, we use four data sets of e-mail communication. one, e-mail , recorded all e-mails to and from a group of accounts [ ] . the other three, e-mail [ ] , [ ] , and [ ] recorded e-mails within a set of accounts. we list basic statistics-sizes, sampling durations, etc.-of all the data sets in table i . to gain further insight into the network structures promoting the objective measures, we correlate the objective measures with quantities describing the position of a node in the static networks. since many of our networks are fragmented into components, we restrict ourselves to measures that are well defined for disconnected networks. otherwise, in our selection, we strive to cover as many different aspects of node importance as we can. degree is simply the number of neighbors of a node. it usually presented as the simplest measure of centrality and one of the most discussed structural predictors of importance with respect to disease spreading [ ] . (centrality is a class of measures of a node's position in a network that try to capture what a "central" node is; i.e., ultimately centrality is not more well-defined than the vernacular word.) it is also a local measure in the sense that a node is able to estimate its degree, which could be practical when evaluating sentinel surveillance in real networks. subgraph centrality is based on the number of closed walks a node is a member of. (a walk is a path that could be overlapping itself.) the number of paths from node i to itself is given by a λ ii , where a is the adjacency matrix and λ is the length of the path. reference [ ] argues that the best way to weigh paths of different lengths together is through the formula as mentioned, several of the data sets are fragmented (even though the largest connected component dominates components of other sizes). in the limit of high transmission table i. basic statistics of the empirical temporal networks. n is the number of nodes, c is the number of contacts, t is the total sampling time, t is the time resolution of the data set, m is the number of links in the projected and thresholded static networks, and θ is the threshold. probabilities, all nodes in the component of the infection seed will be infected. in such a case it would make sense to place a sentinel in the largest component (where the disease most likely starts). closeness centrality builds on the assumption that a node that has, on average, short distances to other nodes is central [ ] . here, the distance d(i, j ) between nodes i and j is the number of links in the shortest paths between the nodes. the classical measure of closeness centrality of a node i is the reciprocal average distance between i and all other nodes. in a fragmented network, for all nodes, there will be some other node that it does not have a path to, meaning that the closeness centrality is ill defined. (assigning the distance infinity to disconnected pairs would give the closeness centrality zero for all nodes.) a remedy for this is, instead of measuring the reciprocal average of distances, measuring the average reciprocal distance [ ] , where d − (i, j ) = if i and j are disconnected. we call this the harmonic closeness by analogy to the harmonic mean. vitality measures are a class of network descriptor that capture the impact of deleting a node on the structure of the entire network [ , ] . specifically, we measure the harmonic closeness vitality, or harmonic vitality, for short. this is the change of the sum of reciprocal distances of the graph (thus, by analogy to the harmonic closeness, well defined even for disconnected graphs): here the denominator concerns the graph g with the node i deleted. if deleting i breaks many shortest paths, then c c (i) decreases, and thus c v (i) increases. a node whose removal disrupts many shortest paths would thus score high in harmonic vitality. our sixth structural descriptor is coreness. this measure comes out of a procedure called k-core decomposition. first, remove all nodes with degree k = . if this would create new nodes with degree one, delete them too. repeat this until there are no nodes of degree . then, repeat the above steps for larger k values. the coreness of a node is the last level when it is present in the network during this process [ ] . like for the static networks, in the temporal networks we measure the degree of the nodes. to be precise, we define the degree as the number of distinct other nodes a node in contact with within the data set. strength is the total number of contacts a node has participated in throughout the data set. unlike degree, it takes the number of encounters into account. temporal networks, in general, tend to be more disconnected than static networks. for node i to be connected to j in a temporal networks there has to be a time-respecting path from i to j , i.e., a sequence of contacts increasing in time that (if time is projected out) is a path from i to j [ , ] . thus two interesting quantities-corresponding to the component sizes of static networks-are the fraction of nodes reachable from a node by time-respecting paths forward (downstream component size) and backward in time (upstream component size) [ ] . if a node only exists in the very early stage of the data, the sentinel will likely not be active by the time the outbreak happens. if a node is active only at the end of the data set, it would also be too late to discover an outbreak early. for these reasons, we measure statistics of the times of the contacts of a node. we measure the average time of all contacts a node participates in; the first time of a contact (i.e., when the node enters the data set); and the duration of the presence of a node in the data (the time between the first and last contact it participates in). we use a version of the kendall τ coefficient [ ] to elucidate both the correlations between the three objective measures, and between the objective measures and network structural descriptors. in its basic form, the kendall τ measures the difference between the number of concordant (with a positive slope between them) and discordant pairs relative to all pairs. there are a few different versions that handle ties in different ways. we count a pair of points whose error bars overlap as a tie and calculate where n c is the number of concordant pairs, n d is the number of discordant pairs, and n t is the number of ties. we start investigating the correlation between the three objective measures throughout the parameter space of the sir model for all our data sets. we use the time to detection and extinction as our baseline and compare the other two objective measures with that. in fig. , we plot the τ coefficient between t x and t d and between t x and f d . we find that for low enough values of β, the τ for all objective measures coincide. for very low β the disease just dies out immediately, so the measures are trivially equal: all nodes would be as good sentinels in all three aspects. for slightly larger β-for most data sets . < β < . -both τ (t x , t d ) and τ (t x , f d ) are negative. this is a region where outbreaks typically die out early. for a node to have low t x , it needs to be where outbreaks are likely to survive, at least for a while. this translates to a large f d , while for t d , it would be beneficial to be as central as possible. if there are no extinction events at all, t x and t d are the same. for this reason, it is no surprise that, for most of the data sets, τ (t x , t d ) becomes strongly positively correlated for large β values. the τ (t x , f d ) correlation is negative (of a similar magnitude), meaning that for most data sets the different methods would rank the possible sentinels in the same order. for some of the data sets, however, the correlation never becomes positive even for large β values (like copenhagen calls and copenhagen sms). these networks are the most fragmented onesm meaning that one sentinel unlikely would detect the outbreak (since it probably happens in another component). this makes t x rank the important nodes in a way similar to f d , but since diseases that do reach a sentinel do it faster in a small component than a large one, t x and t d become anticorrelated. in fig. , we perform the same analysis as in the previous section but for static networks. the picture is to some extent similar, but also much richer. just as for the case of static networks, τ (t x , f d ) is always nonpositive, meaning the time to detection or extinction ranks the nodes in a way positively correlated with the frequency of detection. furthermore, like the static networks, τ (t x , t d ) can be both positively and negatively correlated. this means that there are regions where t d ranks the nodes in the opposite way than the t x . these regions of negative τ (t x , t d ) occur for low β and ν. for some data sets-for example the gallery data sets, dating, copenhagen calls, and copenhagen sms-the correlations are negative throughout the parameter space. among the data sets with a qualitative difference between the static and temporal representations, we find prostitution and e-mail both have strongly positive values of τ (t x , t d ) for large β values in the static networks but moderately negative values for temporal networks. in this section, we take a look at how network structures affect our objective measures. in fig. , we show the correlation between our three objective measures and the structural descriptors as a function of β for the office data set. panel (a) shows the results for the time to detection or extinction. there is a negative correlation between this measure and traditional centrality measures like degree or subgraph centrality. this is because t x is a quantity one wants to minimize to find the optimal sentinel, whereas for all the structural descriptors a large value means that a node is a candidate sentinel node. we see that degree and subgraph centrality are the two quantities that best predict the optimal sentinel location, while coreness is also close (at around − . ). this in line with research showing that certain biological problems are better determined by degree than more elaborate centrality measures [ ] . over all, the τ curves are rather flat. this is partly explained by τ being a rank correlation for t d [ fig. (b) ], most curves change behavior around β = . . this is the region when larger outbreaks could happen, so one can understand there is a transition to a situation similar to t x [ fig. (a) ]. f d [fig. (c) ] shows a behavior similar to t d in that the curves start changing order, and what was a correlation at low β becomes an anticorrelation at high β. this anticorrelation is a special feature of this particular data set, perhaps due to its pronounced community structure. nodes of degree , , and have a strictly increasing values of f d , but for some of the high degree nodes (that all have f d close to one) the ordering gets anticorrelated with degree which makes kendall's τ negative. since rank-based correlations are more principled for skew-distributed quantities common in networks, we keep them. we currently investigate what creates these unintuitive anticorrelations among the high degree nodes in this data set. next, we proceed with an analysis of all data sets. we summarize plots like fig. by the structural descriptor with the largest magnitude of the correlation |τ |. see fig. . we can see, that there is not one structural quantity that uniquely determines the ranking of nodes, there is not even one that dominates over ( ) degree is the strongest structural determinant of all objective measures at low β values. this is consistent with ref. [ ] . ( ) component size only occurs for large β. in the limit of large β, f d is only determined by component size (if we would extend the analysis to even larger β, subgraph centrality would have the strongest correlation for the frequency of detection). ( ) harmonic vitality is relatively better as a structural descriptor for t d , less so for t x and f d . t x and f d capture the ability of detecting an outbreak before it dies, so for these quantities one can imagine more fundamental quantities like degree and the component size are more important. ( ) subgraph centrality often shows the strongest correlation for intermediate values of β. this is interesting, but difficult to explain since the rationale of subgraph centrality builds on cycle counts and there is no direct process involving cycles in the sir model. ( ) harmonic closeness rarely gives the strongest correlation. if it does, it is usually succeeded by coreness and the data set is typically rather large. ( ) datasets from the same category can give different results. perhaps college and facebook is the most conspicuous example. in general, however, similar data sets give similar results. the final observation could be extended. we see that, as β increases, one color tends to follow another. this is summarized in fig. , where we show transition graphs of the different structural descriptors such that the size corresponds to their frequency in fig. , and the size of the arrows show how often one structural descriptor is succeeded by another as β is increased. for t x , the degree and subgraph centrality are the most important structural descriptors, and the former is usually succeeded by the latter. for t d , there is a common peculiar sequence of degree, subgraph centrality, coreness component size, and harmonic vitality that is manifested as the peripheral, clockwise path of fig. (b) . finally, f d is similar to t x except that there is a rather common transition from degree to coreness, and harmonic vitality is, relatively speaking, a more important descriptor. in fig. , we show the figure for temporal networks corresponding to fig. . just like the static case, even though every data set and objective measure is unique, we can make some interesting observations. ( ) strength is most important for small ν and β. this is analogous to degree dominating the static network at small parameter values. ( ) upstream component size dominates at large ν and β. this is analogous to the component size of static networks. since temporal networks tend to be more fragmented than static ones [ ] , this dominance at large outbreak sizes should be even more pronounced for temporal networks. ( ) most of the variation happens in the direction of larger ν and β. in this direction, strength is succeeded by degree which is succeeded by upstream component size. ( ) like the static case, and the analysis of figs. and , t x and f d are qualitatively similar compared to t d . ( ) temporal quantities, such as the average and first times of a node's contacts, are commonly the strongest predictors of t d . ( ) when a temporal quantity is the strongest predictor of t x and f d it is usually the duration. it is understandable that this has little influence on t d , since the ability to be infected at all matters for these measures; a long duration is beneficial since it covers many starting times of the outbreak. ( ) similar to the static case, most categories of data sets give consistent results, but some differ greatly (facebook and college is yet again a good example). the bigger picture these observations paint is that, for our problem, the temporal and static networks behave rather similarly, meaning that the structures in time do not matter so much for our objective measures. at the same time, there is not only one dominant measure for all the data sets. rather are there several structural descriptors that correlate most strongly with the objective measures depending on ν and β. in this paper, we have investigated three different objective measures for optimizing sentinel surveillance: the time to detection or extinction, the time to detection (given that the detection happens), and the frequency of detection. each of these measures corresponds to a public health scenario: the time to detection or extinction is most interesting to minimize if one wants to halt the outbreak as quickly as possible, and the frequency of detection is most interesting if one wants to monitor the epidemic status as accurately as possible. the time to detection is interesting if one wants to detect the outbreak early (or else it is not important), which could be the case if manufacturing new vaccine is relatively time consuming. we investigate these cases for temporal network data sets and static networks derived from the temporal networks. our most important finding is that, for some regions of parameter space, our three objective measures can rank nodes very differently. this comes from the fact that sir outbreaks have a large chance of dying out in the very early phase [ ] , but once they get going they follow a deterministic path. for this reason, it is thus important to be aware of what scenario one is investigating when addressing the sentinel surveillance problem. another conclusion is that, for this problem, static and temporal networks behave reasonably similarly (meaning that the temporal effects do not matter so much). naturally, some of the temporal networks respond differently than the static ones, but compared to, e.g., the outbreak sizes or time to extinction [ ] [ ] [ ] , differences are small. among the structural descriptors of network position, there is no particular one that dominates throughout the parameter space. rather, local quantities like degree or strength (for the temporal networks) have a higher predictive power at low parameter values (small outbreaks). for larger parameter values, descriptors capturing the number of nodes reachable from a specific node correlate most with the objective measures rankings. also in this sense, the static network quantities dominate the temporal ones, which is in contrast to previous observations (e.g., refs. [ ] [ ] [ ] ). for the future, we anticipate work on the problem of optimizing sentinel surveillance. an obvious continuation of this work would be to establish the differences between the objective metrics in static network models. to do the same in temporal networks would also be interesting, although more challenging given the large number of imaginable structures. yet an open problem is how to distribute sentinels if there are more than one. it is known that they should be relatively far away [ ] , but more precisely where should they be located? modern infectious disease epidemiology infectious diseases in humans temporal network epidemiology a guide to temporal networks principles and practices of public health surveillance stochastic epidemic models and their statistical analysis pretty quick code for regular (continuous time, markovian) sir on networks, github.com/pholme/sir proceedings, acm sigcomm -workshop on challenged networks (chants) crawdad dataset st_andrews/sassy third international conference on emerging intelligent data and web technologies proc. natl. acad. sci. usa proceedings of the nd acm workshop on online social networks, wosn ' proceedings of the tenth acm international conference on web search and data mining, wsdm ' proceedings of the th international conference networks: an introduction network analysis: methodological foundations distance in graphs we thank sune lehmann for providing the copenhagen data sets. this work was supported by jsps kakenhi grant no. jp h . key: cord- -e q e v authors: mishra, shreya; srivastava, divyanshu; kumar, vibhor title: improving gene-network inference with graph-wavelets and making insights about ageing associated regulatory changes in lungs date: - - journal: biorxiv doi: . / . . . sha: doc_id: cord_uid: e q e v using gene-regulatory-networks based approach for single-cell expression profiles can reveal un-precedented details about the effects of external and internal factors. however, noise and batch effect in sparse single-cell expression profiles can hamper correct estimation of dependencies among genes and regulatory changes. here we devise a conceptually different method using graph-wavelet filters for improving gene-network (gwnet) based analysis of the transcriptome. our approach improved the performance of several gene-network inference methods. most importantly, gwnet improved consistency in the prediction of generegulatory-network using single-cell transcriptome even in presence of batch effect. consistency of predicted gene-network enabled reliable estimates of changes in the influence of genes not highlighted by differential-expression analysis. applying gwnet on the single-cell transcriptome profile of lung cells, revealed biologically-relevant changes in the influence of pathways and master-regulators due to ageing. surprisingly, the regulatory influence of ageing on pneumocytes type ii cells showed noticeable similarity with patterns due to effect of novel coronavirus infection in human lung. inferring gene-regulatory-networks and using them for system-level modelling is being widely used for understanding the regulatory mechanism involved in disease and development. the interdependencies among variables in the network is often represented as weighted edges between pairs of nodes, where edge weights could represent regulatory interactions among genes. gene-networks can be used for inferring causal models [ ] , designing and understanding perturbation experiments, comparative analysis [ ] and drug discovery [ ] . due to wide applicability of network inference, many methods have been proposed to estimate interdependencies among nodes. most of the methods are based on pairwise correlation, mutual information or other similarity metrics among gene expression values, provided in a different condition or time point. however, resulting edges are often influenced by indirect dependencies owing to low but effective background similarity in patterns. in many cases, even if there is some true interaction among a pair of nodes, its effect and strength is not estimated properly due to noise, background-pattern similarity and other indirect dependencies. hence recent methods have started using alternative approaches to infer more confident interactions. such alternative approach could be based on partial correlations [ ] or aracne's method of statistical threshold of mutual information [ ] . single-cell expression profiles often show heterogeneity in expression values even in a homogeneous cell population. such heterogeneity can be exploited to infer regulatory networks among genes and identify dominant pathways in a celltype. however, due to the sparsity and ambiguity about the distribution of gene expression from single-cell rna-seq profiles, the optimal measures of gene-gene interaction remain unclear. hence recently, sknnider et al. [ ] evaluated measures of association to infer gene co-expression based network. in their analysis, they found two measures of association, namely phi and rho as having the best performance in predicting co-expression based gene-gene interaction using scrna-seq profiles. in another study, chen et al. [ ] performed independent evaluation of a few methods proposed for genenetwork inference using scrna-seq profiles such as scenic [ ] , scode [ ] , pidc [ ] . chen et al. found that for single-cell transcriptome profiles either generated from experiments or simulations, these methods had a poor performance in reconstructing the network. performance of such methods can be improved if gene-expression profiles are denoised. thus the major challenge of handling noise and dropout in scrna-seq profile is an open problem. the noise in single-cell expression profiles could be due to biological and technical reasons. the biological source of noise could include thermal fluctuations and a few stochastic processes involved in transcription and translation such as allele specific expression [ ] and irregular binding of transcription factors to dna. whereas technical noise could be due to amplification bias and stochastic detection due to low amount of rna. raser and o'shea [ ] used the term noise in gene expression as measured level of its variation among cells supposed to be identical. raser and o'shea categorised potential sources of variation in geneexpression in four types : (i) the inherent stochasticity of biochemical processes due to small numbers of molecules; (ii) heterogeneity among cells due to cell-cycle progression or a random process such as partitioning of mitochondria (iii) subtle micro-environmental differences within a tissue (iv) genetic mutation. overall noise in gene-expression profiles hinders in achieving reliable inference about regulation of gene activity in a cell-type. thus, there is demand for pre-processing methods which can handle noise and sparsity in scrna-seq profiles such that inference of regulation can be reliable. the predicted gene-network can be analyzed further to infer salient regulatory mechanisms in a celltype using methods borrowed from graph theory. calculating gene-importance in term of centrality, finding communities and modules of genes are common downstream analysis procedures [ ] . just like gene-expression profile, inferred gene network could also be used to find differences in two groups of cells(sample) [ ] to reveal changes in the regulatory pattern caused due to disease, environmental exposure or ageing. in particular, a comparison of regulatory changes due to ageing has gained attention recently due to a high incidence of metabolic disorder and infection based mortality in the older population. especially in the current situation of pandemics due to novel coronavirus (sars-cov- ), when older individuals have a higher risk of mortality, a question is haunting researchers. that question is: why old lung cells have a higher risk of developing severity due to sars-cov- infection. however, understanding regulatory changes due to ageing using gene-network inference with noisy single-cell scrna-seq profiles of lung cells is not trivial. thus there is a need of a noise and batch effect suppression method for investigation of the scrna-seq profile of ageing lung cells [ ] using a network biology approach. here we have developed a method to handle noise in gene-expression profiles for improving genenetwork inference. our method is based on graphwavelet based filtering of gene-expression. our approach is not meant to overlap or compete with existing network inference methods but its purpose is to improve their performance. hence, we compared other output of network inference methods with and without graph-wavelet based pre-processing. we have evaluated our approach using several bulk sample and single-cell expression profiles. we further investigated how our denoising approach influences the estimation of graph-theoretic properties of gene-network. we also asked a crucial question: how the gene regulatory-network differs between young and old individual lung cells. further, we compared the pattern in changes in the influence of genes due to ageing with differential expression in covid infected lung. our method uses a logic that cells (samples) which are similar to each other, would have a more similar expression profile for a gene. hence, we first make a network such that two cells are connected by an edge if one of them is among the top k nearest neighbours (knn) of the other. after building knn-based network among cells (samples), we use graph-wavelet based approach to filter expression of one gene at a time (see fig. ). for a gene, we use its expression as a signal on the nodes of the graph of cells. we apply a graph-wavelet transform to perform spectral decomposition of graph-signal. after graph-wavelet transformation, we choose the threshold for wavelet coefficients using sureshrink and bayesshrink or a default percentile value determined after thorough testing on multiple data-sets. we use the retained values of the coefficient for inverse graph-wavelet transformation to reconstruct a filtered expression matrix of the gene. the filtered gene-expression is used for gene-network inference and other down-stream process of analysis of regulatory differences. for evaluation purpose, we have calculated inter-dependencies among genes using different co-expression measurements, namely pearson and spearman correlations, φ and ρ scores and aracne. the biological and technical noise can both exist in a bulk sample expression profile ( [ ] ). in order to test the hypothesis that graph-based denoising could improve gene-network inference, we first evaluated the performance of our method on bulk expression data-set. we used data-sets made available by dream challenge consortium [ ] . three data-sets were based on the original expression profile of bacterium escherichia coli and the single-celled eukaryotes saccharomyces cerevisiae and s aureus. while the fourth data-set was simulated using in silico network with the help of genenetweaver, which models molecular noise in transcription and translation using chemical langevin equation [ ] . the true positive interactions for all the four data-sets are also available. we compared graph fourier based low passfiltering with graph-wavelet based denoising using three different approaches to threshold the waveletcoefficients. we achieved - % improvement in score over raw data based on dream criteria [ ] with correlation, aracne and rho based network prediction. with φ s based gene-network prediction, there was an improvement in out of dream data-sets ( fig. a) . all the network inference methods showed improvement after graphwavelet based denoising of simulated data (in silico) from dream consortium ( fig. a) . moreover, graph-wavelet based filtering had better performance than chebyshev filter-based low pass filtering in graph fourier domain. it highlights the fact that even bulk sample data of gene-expression can have noise and denoising it with graph-wavelet after making knn based graph among samples has the potential to improve gene-network inference. moreover, it also highlights another fact, well known in the signal processing field, that wavelet-based filtering is more adaptive than low pass-filtering. in comparison to bulk samples, there is a higher level of noise and dropout in single-cell expression profiles. dropouts are caused by non-detection of true expression due to technical issues. using low-pass filtering after graph-fourier transform seems to be an obvious choice as it fills in a background signal at missing values and suppresses high-frequency outlier-signal [ ] . however, in the absence of information about cell-type and cellstates, a blind smoothing of a signal may not prove to be fruitful. hence we applied graph-wavelet based filtering for processing gene-expression dataset from the scrna-seq profile. we first used scrna-seq data-set of mouse embryonic stem cells (mescs) [ ] . in order to evaluate network inference in an unbiased manner, we used gene regulatory interactions compiled by another research group [ ] . our approach of graph-wavelet based pre-processing of mesc scrna-seq data-set improved the performance of gene-network inference methods by - percentage (fig. b) . however, most often, the gold-set of interaction used for evaluation of gene-network inference is incomplete, which hinders the true assessment of improvement. figure : the flowchart of gwnet pipeline. first, a knn based network is made between samples/cell. a filter for graph wavelet is learned for the knn based network of samples/cells. gene-expression of one gene at a time is filtered using graph-wavelet transform. filtered gene-expression data is used for network inference. the inferred network is used to calculate centrality and differential centrality among groups of cells. figure : improvement in gene-network inference by graph-wavelet based denoising of gene-expression (a) performance of network inference methods using bulk gene-expression data-sets of dream challenge. three different ways of shrinkage of graph-wavelet coefficients were compared to graph-fourier based low pass filtering. the y-axis shows fold change in area under curve(auc) for receiver operating characteristic curve (roc) for overlap of predicted network with golden-set of interactions. for hard threshold, the default value of % percentile was used. (b) performance evaluation using single-cell rna-seq (scrna-seq) of mouse embryonic stem cells (mescs) based network inference after filtering the gene-expression. the gold-set of interactions was adapted from [ ] (c) comparison of graph wavelet-based denoising with other related smoothing and imputing methods in terms of consistency in the prediction of the gene-interaction network. here, phi (φ s ) score was used to predict network among genes. for results based on other types of scores see supplementary figure s . predicted networks from two scrna-seq profile of mesc were compared to check robustness towards the batch effect. hence we also used another approach to validate our method. for this purpose, we used a measure of overlap among network inferred from two scrna-seq data-sets of the same cell-type but having different technical biases and batch effects. if the inferred networks from both data-sets are closer to true gene-interaction model, they will show high overlap. for this purpose, we used two scrnaseq data-set of mesc generated using two different protocols(smartseq and drop-seq). for comparison of consistency and performance, we also used a few other imputation and denoising methods proposed to filter and predict the missing expression values in scrna-seq profiles. we evaluated other such methods; graph-fourier based filtering [ ] , magic [ ] , scimpute [ ] , dca [ ] , saver [ ] , randomly [ ] , knn-impute [ ] . graphwavelet based denoising provided better improvement in auc for overlap of predicted network with known interaction than other methods meant for imputing and filtering scrna-seq profiles (supplementary figure s a ). similarly in comparison to graph-wavelet based denoising, the other methods did not provided substantial improvement in auc for overlap among gene-network inferred by two data-sets of mesc (fig. c , supplementary figure s b ). however, graph wavelet-based filtering improved the overlap between networks inferred from different batches of scrna-seq profile of mesc even if they were denoised separately (fig. c , supplementary figure s b ). with φ s based edge scores the overlap among predicted gene-network increased by % due to graph-wavelet based denoising (fig. c ). the improvement in overlap among networks inferred from two batches hints that graph-wavelet denoising is different from imputation methods and has the potential to substantially improve gene-network inference using their expression profiles. improved gene-network inference from single-cell profile reveal agebased regulatory differences improvement in overlap among inferred genenetworks from two expression data-set for a cell type also hints that after denoising predicted networks are closer to true gene-interaction profiles. hence using our denoising approach before estimat-ing the difference in inferred gene-networks due to age or external stimuli could reflect true changes in the regulatory pattern. such a notion inspired us to compare gene-networks inferred for young and old pancreatic cells using their scrna-seq profile filtered by our tool [ ] . martin et al. defined three age groups, namely juvenile ( month- years), young adult ( - years) and aged ( - years) [ ] . we applied graph-wavelet based denoising of pancreatic cells from three different groups separately. in other words, we did not mix cells from different age groups while denoising. graph-wavelet based denoising of a singlecell profile of pancreatic cells caused better performance in terms of overlap with protein-protein interaction (ppi) (fig. a , supplementary figure s a ). even though like chen et al. [ ] we have used ppi to measure improvement in genenetwork inference, it may not be reflective of all gene-interactions. hence we also used the criteria of increase in overlap among predicted networks for same cell-types to evaluate our method for scrnaseq profiles of pancreatic cells. denoising scrnaseq profiles also increased overlap between inferred gene-network among pancreatic cells of the old and young individuals (fig. b , supplementary figure s b ). we performed quantile normalization of original and denoised expression matrix taking all age groups together to bring them on the same scale to calculate the variance of expression across cells of every gene. the old and young pancreatic alpha cells had a higher level of median variance of expression of genes than juvenile. however, after graph-wavelet based denoising, the variance level of genes across all the age groups became almost equal and had similar median value (fig. c ). notice that, it is not trivial to estimate the fraction of variances due to transcriptional or technical noise. nonetheless, graph-wavelet based denoising seemed to have reduced the noise level in single-cell expression profiles of old and young adults. differential centrality in the co-expression network has been used to study changes in the influence of genes. however, noise in single-cell expression profiles can cause spurious differences in centrality. hence we visualized the differential degree of genes in network inferred using young and old cells scrna-seq profiles. the networks inferred from non-filtered expression had a much higher number of non-zero differential degree values in comparison to the de-noised version (fig. d, supplementary figure s c ). thus denoising seems to reduce differences among centrality, which could be due to randomness of noise. next, we analyzed the properties of genes whose variance dropped most due to graphwavelet based denoising. surprisingly, we found that top genes with the highest drop in variance due to denoising in old pancreatic beta cells were significantly associated with diabetes mellitus and hyperinsulinism. whereas, top genes with the highest drop in variance in young pancreatic beta cells had no or insignificant association with diabetes (fig. e) . a similar trend was observed with pancreatic alpha cells (supplementary figure s d ) . such a result hint that ageing causes increase in stochasticity of the expression level of genes associated with pancreas function and denoising could help in properly elucidating their dependencies with other genes. improvement in gene-network inference for studying regulatory differences among young and old lung cells. studying cell-type-specific changes in regulatory networks due to ageing has the potential to provide better insight about predisposition for disease in the older population. hence we inferred genenetwork for different cell-types using scrna-seq profiles of young and old mouse lung cells published by kimmel et al. [ ] .the lower lung epithelia where a few viruses seem to have the most deteriorating effect consists of multiple types of cells such as bronchial epithelial and alveolar epithelial cells, fibroblast, alveolar macrophages, endothelial and other immune cells. the alveolar epithelial cells, also called as pneumocytes are of two major types. the type alveolar (at ) epithelial cells for major gas exchange surface of lung alveolus has an important role in the permeability barrier function of the alveolar membrane. type alveolar cells (at ) are the progenitors of type cells and has the crucial role of surfactant production. at cells ( or pneumocytes type ii) cells are a prime target of many viruses; hence it is important to understand the regulatory patterns in at cells, especially in the context of ageing. we applied our method of denoising on scrnaseq profiles of cells derived from old and young mice lung [ ] . graph wavelet based denoising lead to an increase in consistency among inferred genenetwork for young and old mice lung for multiple cell-types (fig. a) . graph-wavelet based denoising also lead to an increase in consistency in predicted gene-network from data-sets published by two different groups (fig. b) . the increase in overlap of gene-networks predicted for old and young cells scrna-seq profile, despite being denoised separately, hints about a higher likelihood of predicting true interactions. hence the chances of finding gene-network based differences among old and young cells were less likely to be dominated by noise. we studied ageing-related changes in pagerank centrality of nodes(genes). since pagerank centrality provides a measure of "popularity" of nodes, studying its change has the potential to highlight the change in the influence of genes. first, we calculated differential pagerank of genes among young and old at cells (supporting file- ) and performed gene-set enrichment analysis using enrichr [ ] . the top genes with higher pagerank in young at cells had enriched terms related to integrin signalling, ht type receptor mediated signalling, h histamine receptor-mediated signalling pathway, vegf, cytoskeleton regulation by rho gtpase and thyrotropin activating receptor signalling (fig. c) . we ignored oxytocin and thyrotropin-activating hormone-receptor mediated signalling pathways as an artefact as the expression of oxytocin and trh receptors in at cells was low. moreover, genes appearing for the terms "oxytocin receptor-mediated signalling" and "thyrotropin activating hormone-mediated signalling" were also present in gene-set for ht type receptormediated signalling pathway. we found literature support for activity in at cells for most of the enriched pathways. however, there were very few studies which showed their differential importance in old and young cells, such as bayer et al. demonstrated mrna expression of several -htr including -ht , ht and ht in alveolar epithelial cells type ii (at ) cells and their role in calcium ion mobilization. similarly, chen et al. [ ] showed that histamine receptor antagonist reduced pulmonary surfactant secretion from adult rat alveolar at cells in primary culture. vegf pathway is active in at cells, and it is known that ageing has an effect on vegf mediated angiogenesis in lung. moreover, vegf based angiogenesis is for comparing two networks it is important to reduce differences due to noise. hence the plot here shows similarity of predicted networks before and after graph-wavelet based denoising. the result shown here are for correlation-based co-expression network, while similar results are shown using ρ score in supplementary figure s . (c) variances of expression of genes across single-cells before and after denoising (filtering) is shown here. variances of genes in a cell-type was calculated separately for different stages of ageing (young, adult and old). the variance (estimate of noise) is higher in older alpha and beta cells compared to young. however, after denoising variance of genes in all ageing stage becomes equal (d) effect of noise in estimated differential centrality is shown is here. the difference in the degree of genes in network estimated for old and young pancreatic beta cells is shown here. the number of non-zero differential-degree estimated using denoised expression is lower than unfiltered expression based networks.(e) enriched panther pathway terms for top genes with the highest drop in variance after denoising in old and young pancreatic beta cells. known to decline with age [ ] . we further performed gene-set enrichment analysis for genes with increased pagerank in older mice at cells. for top genes with higher pagerank in old at cells, the terms which appeared among most enriched in both kimmel et al. and angelids et al. data-sets were t cell activation, b cell activation, cholesterol biosynthesis and fgf signaling pathway, angiogenesis and cytoskeletal regulation by rho gtpase (fig. d) . thus, there was % overlap in results from kimmel et al. and angelids et al. data-sets in terms of enrichment of pathway terms for genes with higher pagerank in older at cells (supplementary figure s a , supporting file- , supporting file- ). overall in our analysis, inflammatory response genes showed higher importance in older at cells. the increase in the importance of cholesterol biosynthesis genes hand in hand with higher inflammatory response points towards the influence of ageing on the quality of pulmonary surfactants released by at . al saedy et al. recently showed that high level of cholesterol amplifies defects in surface activity caused by oxidation of pulmonary surfactant [ ] . we also performed enrichr based analysis of differentially expressed genes in old at cells (supporting file- ). for genes up-regulated in old at cells compared to young, terms which reappeared were cholesterol biosynthesis, t cell and b cell activation pathways, angiogenesis and inflammation mediated by chemokine and cytokine signalling. whereas few terms like ras pathway, jak/stat signalling and cytoskeletal signalling by rho gt-pase did not appear as enriched for genes upregulated in old at cells ( figure b , supporting file- ). however previously, it has been shown that the increase in age changes the balance of pulmonary renin-angiotensin system (ras), which is correlated with aggravated inflammation and more lung injury [ ] . jak/stat pathway is known to be involved in the oxidative-stress induced decrease in the expression of surfactant protein genes in at cells [ ] . overall, these results indicate that even though the expression of genes involved in relevant pathways may not show significant differences due to ageing, but their regulatory influence could be changing substantially. in order to further gain insight, we analyzed the changes in the importance of transcription factors in ageing at cells. among top genes with higher pagerank in old at cells, we found several relevant tfs. however, to make a stringent list, we considered only those tfs which had nonzero value for change in degree among gene-network for old and young at cells. overall, with kimmel at el. data-set, we found tfs with a change in pagerank and degree (supplementary table- ) due to ageing for at cells (fig. e) . the changes in centrality (pagerank and degree) of tfs with ageing was coherent with pathway enrichment results. such as etv which has higher degree and pagerank in older cells, is known to be stabilized by ras signalling in at cells [ ] . in the absence of etv at cell differentiate to at cells [ ] . another tf jun (c-jun) having stronger influence in old at cells, is known to regulate inflammation lung alveolar cells [ ] . we also found jun to be having co-expression with jund and etv in old at cell (supplementary figure s ) . jund whose influence seems to increase in aged at cells is known to be involved in cytokine-mediated inflammation. among the tfs stat - which are involved in jak/stat signalling, stat showed higher degree and pagerank in old at . androgen receptor(ar) also seem to have a higher influence in older at cells (fig. e ). androgen receptor has been shown to be expressed in at cells [ ] . we further performed a similar analysis for the scrna-seq profile of interstitial macrophages(ims) in lungs and found literature support for the activity of enriched pathways (supporting file- ). whereas gene-set enrichment output for important genes in older ims had some similarity with results from at cells as both seem to have higher pro-inflammatory response pathway such as t cell activation and jak/stat signalling. however, unlike at cells, ageing in ims seem to cause an increase in glycolysis and pentose phosphate pathway. higher glycolysis and pentose phosphate pathway activity levels have been previously reported to be involved in the pro-inflammatory response in macrophages by viola et al. [ ] . in our results, ras pathway was not enriched significantly for genes with a higher importance in older macrophages. such results show that the pro-inflammatory pathways activated due to aging could vary among different cell-types in lung. for the same type of cells, the predicted networks for old and young cells seem to have higher overlap after graph-wavelet based filtering. the label "raw" here means that, both networks (for old and young) were inferred using unfiltered scrna-seq profiles. wheres the same result from denoised scrna-seq profile is shown as filtered. networks were inferred using correlation-based co-expression. in current pandemic due to sars-cov- , a trend has emerged that older individuals have a higher risk of developing severity and lung fibrosis than the younger population. since our analysis revealed changes in the influence of genes in lung cells due to ageing, we compared our results with expression profiles of lung infected with sars-cov- published by blanco-melo et al. [ ] . recently it has been shown that at cells predominantly express ace , the host cell surface receptor for sars-cov- attachment and infection [ ] . thus covid infection could have most of the dominant effect on at cells. we found that genes with significant upregulation in sars-cov- infected lung also had higher pagerank in gene-network inferred for older at cells (fig. a) . we also repeated the process of network inference and calculating differential centrality among old and young using all types of cells in the lung together (supporting file- ). we performed gene-set enrichment for genes up-regulated in sars-cov- infected lung. majority of the panther pathway terms enriched for genes up-regulated in sars-cov- infected lung also had enrichment for genes with higher pagerank in old lung cells (combined). total out of significantly enriched panther pathways for genes up-regulated in covid- infected lung, were also enriched for genes with higher pagerank in older at cells in either of the two data-sets used here ( in angelids et al., in kimmel et al. data-based results). among the top enriched wikipathway terms for genes up-regulated in covid infected lung, has significant enrichment for genes with higher pagerank in old at cells (supporting file- ). however, the term type-ii interferon signalling did not have significant enrichment for genes with higher pagerank in old at cells. we further investigated enriched motifs of transcription factors in promoters of genes up-regulated in covid infected lungs (supplementary methods). for promoters of genes up-regulated in covid infected lung top two enriched motifs belonged to irf (interferon regulatory factor) and ets family tfs. notice that etv belong to sub-family of ets groups of tfs. further analysis also revealed that most of the genes whose expression is positively cor-related with etv in old at cells is up-regulated in covid infected lung. in contrast, genes with negative correlation with etv in old at cells were mostly down-regulated in covid infected lung. a similar trend was found for stat gene. however, for erg gene with higher pagerank in young at cell, the trend was the opposite. in comparison to genes with negative correlation, positively correlated genes with erg in old at cell, had more downregulation in covid infected lung. such trend shows that a few tfs like etv , stat with higher pagerank in old at cells could be having a role in poising or activation of genes which gain higher expression level on covid infection. inferring regulatory changes in pure primary cells due to ageing and other conditions, using singlecell expression profiles has tremendous potential for various applications. such applications could be understanding the cause of development of a disorder or revealing signalling pathways and master regulators as potential drug targets. hence to support such studies, we developed gwnet to assist biologists in work-flow for graph-theory based analysis of single-cell transcriptome. gwnet improves inference of regulatory interaction among genes using graph-wavelet based approach to reduce noise due to technical issues or cellular biochemical stochasticity in gene-expression profiles. we demonstrated the improvement in gene-network inference using our filtering approach with benchmark data-sets from dream consortium and several single-cell expression profiles. using different ways for inferring network, we showed how our approach for filtering gene-expression can help genenetwork inference methods. our results of comparison with other imputation, smoothing methods and graph-fourier based filtering showed that graph-wavelet is more adaptive to changes in the expression level of genes with changing neighborhood of cells. thus graph-wavelet based denoising is a conceptually different approach for preprocessing of gene-expression profiles. there is a huge body of literature on inferring gene-networks from bulk gene-expression profile and utilizing it to find differences among two groups of samples. however, applying classical procedures on single- shown for erg, which have higher pagerank in young at cells. most of the genes which had a positive correlation with etv and stat expression in old murine at cells were up-regulated in covid infected lung. whereas for erg the trend is the opposite. genes positively correlated with erg genes in old at had more down-regulation than genes with negative correlation. such results hint that tfs whose influence (pagerank) increase during ageing could be involved activating or poising the genes up-regulated in covid infection. cell transcriptome profiles has not proved to be effective. our method seems to resolve this issue by increasing consistency and overlap among gene-networks inferred using an expression from different sources (batches) for the same cell-type even if each data-sets was filtered independently. such an increase in overlap among predicted network from independently processed data-sets from different sources hint that estimated dependencies among genes reach closer to true values after graphwavelet based denoising of expression profiles. having network prediction closer to true values increases the reliability of comparison of a regulatory pattern among two groups of cells. moreover, recently chow and chen [ ] have shown that age-associated genes identified using bulk expression profiles of the lung are enriched among those induced or suppressed by sars-cov- infection. however, they did not perform analysis with systems-level approach. our analysis highlighted ras and jak/stat pathways to be enriched for genes with stronger influence in old at cells and genes up-regulated in covid infected lung. ras/mapk signalling is considered essential for self-renewal of at cell [ ] . similarly, jak/stat pathway is known to be activated in the lung during injury [ ] and influence surfactant quality [ ] . we have used murine aging-lung scrna-seq profiles however our analysis provides an important insight that regulatory patterns and master-regulators in old at cells are in such a configuration that they could be predisposing it for a higher level of ras and jak/stat signalling. androgen receptor (ar) which has been implicated in male pattern baldness and increased risk of males towards covid infection [ ] had higher pagerank and degree in old at cells. however, further investigation is needed to associate ar with severity on covid infection due to ageing. on the other hand, in young at cells, we find a high influence of genes involved in histamine h receptor-mediated signalling, which is known to regulate allergic reactions in lungs [ ] . another benefit of our approach of analysis is that it can highlight a few specific targets of further study for therapeutics. such as a kinase that binds and phosphorylates c-jun called as jnk is being tested in clinical trials for pulmonary fibrosis [ ] . androgen deprivation therapy has shown to provide partial protection against sars-cov- infection [ ] . on the same trend, our analysis hints that etv could also be considered as drug-target to reduce the effect of ageing induced ras pathway activity in the lung. we used the term noise in gene-expression according to its definition by several researchers such as raser and o'shea [ ] ; as the measured level of variation in gene-expression among cells supposed to be identical. hence we first made a base-graph (networks) where supposedly identical cells are connected by edges. for every gene we use this basegraph and apply graph-wavelet transform to get an estimate of variation of its expression in every sample (cells) with respect to other connected samples at different levels of graph-spectral resolution. for this purpose, we first calculated distances among samples (cells). to get a better estimate of distances among samples (cells) one can perform dimension reduction of the expression matrix using tsne [ ] or principal component analysis. we considered every sample (cell) as a node in the graph and connected two nodes with an edge only when one of them was among k-nearest neighbors of the other. here we decide the value of k in the range of - , based on the number of samples(cells) in the expression data-sets. thus we calculated the preliminary adjacency matrix using k-nearest neighbours (knn) based on euclidean distance metric between samples of the expression matrix. we used this adjacency matrix to build a base-graph. thus each vertex in the base-graph corresponds to each sample and edge weights to the euclidean distance between them. the weighted graph g built using knn based adjacency matrix comprises of a finite set of vertices v which corresponds to cells (samples), a set of edges e denoting connection between samples (if exist) and a weight function which gives nonnegative weighted connections between cells (samples). this weighted matrix can also be defined as a n xn (n being number of cells) weighted adjacency matrix a where a ij is if there is no edge between cells i and j , otherwise a ij = weight(i, j) if there exist an edge between i, j. the degree of a cell in the graph is the sum of weights of edges incident on that cell. also, diagonal degree matrix d of this graph comprises of degree d(i) if i = j, otherwise. a non-normalized graph laplacian operator l for a graph is defined as l = d − a. the normalized form of graph laplacian operator is defined as : both laplacian operators produce different eigenvectors [ ] . however, we have used a normalized form of laplacian operator for the graph between cells. the graph laplacian is further used for graph fourier transformation of signals on nodes (see supplementary methods) ([ ] [ ] ). for filtering in the fourier domain, we used chebyshev-filter for gene expression profile. we took the expression of each gene at a time considering it as a signal and projected it onto the raw graph (where each vertex corresponds to each sample) object [ ] . we took forward fourier transform of signal and filtered the signal using chebyshev filter in the fourier domain and then inverse transformed the signal to calculate filtered expression. this same procedure was repeated for every gene. this would finally give us filtered gene expression. spectral graph wavelet entails choosing a nonnegative real-valued kernel function which can behave as a bandpass filter and is similar to fourier transform. the re-scaled kernel function of graph laplacian gives wavelet operator which eventually produce graph wavelet coefficients at each scale. however, using continuous functional calculus one can define a function of self adjoint operator on the basis of spectral representation of graph. although for a graph with finite dimensional laplacian, this can be achieved by eigenvalues and eigenvectors of laplacian l [ ] . the wavelet operator is given by t g = g(l). t g f gives wavelet coefficients for a signal f at scale = . this operator operates on eigenvectors u l as t g u l = g(λ l )u l . hence, for any graph signal, operator t g operates on the signal by adjusting each graph fourier coefficient as and inverse fourier transform given as the wavelet operator at every scale s is given as t s g = g(sl). these wavelet operators are localized to obtain individual wavelets by applying them to δ n , with δ n being a signal with on vertex n and zero otherwise [ ] . thus considering coefficients at every scale, the inverse transform can be obtained as here, in spite of filtering in fourier domain, we took wavelet coefficients of each gene expression signal at different scales. thresholding was applied on each scale to filter wavelet coefficients. we applied both hard and soft thresholding on wavelet coefficients. for soft thresholding, we implemented well-known methods sure shrink and bayes shrink. finding an optimal threshold for wavelet coefficients for denoising linear-signals and images has remained a subject of intensive research. we evaluated both soft and hard thresholding approaches and tested an information-theoretic criterion known as the minimum description length principle (mdl). using our tool gwnet, user can choose from multiple options of finding threshold such as visushrink, sureshrink and mdl. here, we have used hard-thresholding for most the data-sets as proper soft-thresholding of graph-wavelet coefficient is itself a topic of intensive research and may need further fine-tuning. one can also use hardthreshold value based on the best overlap among predicted gene-network and protein-protein interaction (ppi). while applying it on multiple datasets we realized that threshold cutoffs estimated by mdl criteria and best overlap of predicted network with known interaction and ppi, were in the range of - percentile. for comparing predicted network from multiple data-sets, we needed uniform percentile cutoff to threshold graph-wavelet coefficients. hence for uniform analysis of several datasets, we have set the default threshold value of percentile. hence in default mode, wavelet coefficient with absolute value less than percentile was made equal to zero. gwnet tool is flexible, and any network inferences method can be plugged in it for making regulatory inferences using a graph-theoretic approach. here, for single-cell rna-seq data, we have used gene-expression values in the form of fpkm (fragments per kilobase of exon model per million reads mapped). we pre-processed single-cell gene expression by quantile normalization and log transformation. to start with, we used spearman and pearson correlation to achieve a simple estimate of the measure of inter-dependencies among genes. we also used aracne ( algorithm for the reconstruction of accurate cellular networks) to infer network among genes. aracne first computes mutual information for each gene-pair. then it considers all possible triplet of genes and applies the data processing inequality (dpi) to remove indirect interactions. according to dpi, if gene i and gene j do not interact directly with each other but show dependency via gene k, the following inequality hold where i(g i , g j ) represents mutual information between gene i and gene j. aracne also removes interaction with mutual information less than a particular threshold eps. we have used eps value to recently skinnider et al., [ ] showed superiority of two measures of proportionality rho(ρ) and phi(φ s ) [ ] for estimating gene-coexpression network using single-cell transcriptome profile. hence we also evaluated the benefit of graph-wavelet based denoising of gene-expression with measures of proportionality ρ and φ s . the measures of proportionality φ can be defined as φ(g i , g j ) = var(g i − g j ) var(g i ) where g i is the vector containing log values of expression of a gene i across multiple samples (cells) and var() represents variance function. the symmetric version of φ can be written as whereas rho can be defined as to estimate both measures of proportionality, ρ and φ, we used 'propr' package . [ ] . the networks inferred from filtered and unfiltered gene-expression were compared to the ground truth. ground truth for dream challenge dataset was already available while for single-cell expression, we assembled the ground truth from hip-pie (human integrated protein-protein interaction reference) database [ ] . we considered all edges possible in network, sorted them based on the significance of edge weights. we calculated the area under the receiver operator curve for both raw and filtered networks by comparing against edges in the ground truth. receiver operator is a standard performance evaluation metrics from the field of machine learning, which has been used in the dream evaluation method with some modifications. the modification for receiver operating curve here is that for x-axis instead of false-positive rate, we used a number of edges sorted according to their weights. for evaluation all possible edges sorted based on their weights in network are taken from the gene-network inferred from filtered and raw graphs. we calculated improvement by measuring fold change between raw and filtered scores. we compared the results of our approach of graphwavelet based denoising with other methods meant for imputation or reducing noise in scrna-seq profiles. for comparison we used graph-fourier based filtering [ ] , magic [ ] , scimpute [ ] , dca [ ] , saver [ ] , randomly [ ] , knn-impute [ ] . brief descriptions and corresponding parameters used for other methods are written in supplementary method. the bulk gene-expression data used here evaluation was download from dream portal (http://dreamchallenges.org/project/dream- network-inference-challenge/). the single-cell expression profile of mesc generated using different protocols [ ] was downloaded for geo database (geo id: gse ). single-cell expression profile of pancreatic cells from individuals with different age groups was downloaded from geo database (geo id:gse ). the scrna-seq profile of murine aging lung published by kimmel et al. [ ] is available with geo id : gse . while aging lung scrna-seq data published by angelids et al. [ ] is available with geo id: gse . the code for graph-wavelet based filtering of gene-expression is available at http://reggen. iiitd.edu.in: /graphwavelet/index.html. the codes are present at https://github. com/reggenlab/gwnet/ and supporting files are present at https://github.com/reggenlab/ gwnet/tree/master/supporting$_$files. an integrative approach for causal gene identification and gene regulatory pathway inference singlecell transcriptomics unveils gene regulatory network plasticity chemogenomic profiling of plasmodium falciparum as a tool to aid antimalarial drug discovery supervised, semi-supervised and unsupervised inference of gene regulatory networks reverse engineering cellular networks evaluating measures of association for single-cell transcriptomics evaluating methods of inferring gene regulatory networks highlights their lack of performance for single cell gene expression data scenic: single-cell regulatory network inference and clustering scode: an efficient regulatory network inference algorithm from single-cell rna-seq during differentiation gene regulatory network inference from single-cell data using multivariate information measures characterizing noise structure in single-cell rna-seq distinguishes genuine from technical stochastic allelic expression noise in gene expression: origins, consequences, and control, science comparative assessment of differential network analysis methods murine single-cell rna-seq reveals cellidentity-and tissue-specific trajectories of aging wisdom of crowds for robust gene network inference genenetweaver: in silico benchmark generation and performance profiling of network inference methods enhancing experimental signals in single-cell rna-sequencing data using graph signal processing comparative analysis of single-cell rna sequencing methods a gene regulatory network in mouse embryonic stem cells recovering gene interactions from single-cell data using data diffusion an accurate and robust imputation method scimpute for single-cell rna-seq data single-cell rna-seq denoising using a deep count autoencoder saver: gene expression recovery for singlecell rna sequencing a random matrix theory approach to denoise single-cell data missing value estimation methods for dna microarrays single-cell analysis of human pancreas reveals transcriptional signatures of aging and somatic mutation patterns enrichr: interactive and collaborative html gene list enrichment analysis tool histamine stimulation of surfactant secretion from rat type ii pneumocytes aging impairs vegf-mediated, androgen-dependent regulation of angiogenesis dysfunction of pulmonary surfactant mediated by phospholipid oxidation is cholesterol-dependent age-dependent changes in the pulmonary renin-angiotensin system are associated with severity of lung injury in a model of acute lung injury in rats mapk and jak-stat signaling pathways are involved in the oxidative stress-induced decrease in expression of surfactant protein genes transcription factor etv is essential for the maintenance of alveolar type ii cells, proceedings of the national academy of sciences of the united states of targeted deletion of jun/ap- in alveolar epithelial cells causes progressive emphysema and worsens cigarette smoke-induced lung inflammation androgen receptor and androgen-dependent gene expression in lung the metabolic signature of macrophage responses imbalanced host response to sars-cov- drives development of covid- single cell rna sequencing of human tissues identify cell types and receptors of human coronaviruses the aging transcriptome and cellular landscape of the human lung in relation to sars-cov- jak-stat pathway activation in copd, the european androgen hazards with covid- the h histamine receptor regulates allergic lung responses late breaking abstract -evaluation of the jnk inhibitor, cc- , in a phase b pulmonary fibrosis trial androgen-deprivation therapies for prostate cancer and risk of infection by sars-cov- : a population-based study (n = ) visualizing data using t-sne discrete signal processing on graphs: frequency analysis wavelets on graphs via spectral graph theory how should we measure proportionality on relative gene expression data? propr: an r-package for identifying proportionally abundant features using compositional data analysis hippie v . : enhancing meaningfulness and reliability of protein-protein interaction networks an atlas of the aging lung mapped by single cell transcriptomics and deep tissue proteomics we thank dr gaurav ahuja for providing us valuable advice on analysis of single-cell expression profile of ageing cells. none declared.vibhor kumar is an assistant professor at iiit delhi, india. he is also an adjunct scientist at genome institute of singapore. his interest include genomics and signal processing.divyanshu srivastava completed his thesis on graph signal processing for masters degree at computational biology department in iiit delhi, india. he has applied graph signal processing on protein structures and gene-expression data-sets.shreya mishra is a phd student at computational biology department in iiit delhi, india. her interest include data sciences and genomics. • we found that graph-wavelet based denoising of gene-expression profiles of bulk samples and singlecells can substantially improve gene-regulatory network inference.• more consistent prediction of gene-network due to denoising lead to reliable comparison of predicted networks from old and young cells to study the effect of ageing using single-cell transcriptome.• our analysis revealed biologically relevant changes in regulation due to aging in lung pneumocyte type ii cells, which had similarity with effects of covid infection in human lung.• our analysis highlighted influential pathways and master regulators which could be topic of further study for reducing severity due to ageing. key: cord- -vlklgd x authors: kim, yushim; kim, jihong; oh, seong soo; kim, sang-wook; ku, minyoung; cha, jaehyuk title: community analysis of a crisis response network date: - - journal: soc sci comput rev doi: . / sha: doc_id: cord_uid: vlklgd x this article distinguishes between clique family subgroups and communities in a crisis response network. then, we examine the way organizations interacted to achieve a common goal by employing community analysis of an epidemic response network in korea in . the results indicate that the network split into two groups: core response communities in one group and supportive functional communities in the other. the core response communities include organizations across government jurisdictions, sectors, and geographic locations. other communities are confined geographically, homogenous functionally, or both. we also find that whenever intergovernmental relations were present in communities, the member connectivity was low, even if intersectoral relations appeared together within them. other or are friends, know each other, etc." which generally refers to a social circle (mokken, , p. ) , while a community is formed through concrete social relationships (e.g., high school friends) or sets of people perceived to be similar, such as the italian community and twitter community (gruzd, wellman, & takhteyev, ; hagen, keller, neely, depaula, & robert-cooperman, ) . in social network analysis, a clique is operationalized as " . . . a subset of actors in which every actor is adjacent to every other actor in the subset (borgatti, everett, & johnson, , p. ) , while communities refer to " . . . groups within which the network connections are dense, but between which they are sparser" (newman & girvan, , p. ) . the clique and its variant definitions (e.g., n-cliques and k-cores) focus on internal edges, while the community is a concept based on the distinction between internal edges and the outside. we argue that community analysis can provide useful insights about the interrelations among diverse organizations in the ern. we have not yet found any studies that have investigated cohesive subgroups in large multilevel, multisectoral erns through a community lens. with limited guidance from the literature on erns, we lack specific expectations or hypotheses about what the community structure in the network may look like. therefore, our study focuses on identifying and analyzing communities in the middle east respiratory syndrome coronavirus (mers) response in south korea as a case study. we address the following research questions: ( ) in what way were distinctive communities divided in the ern? and ( ) how did the interorganizational relations relate to the internal characteristics of the communities? by detecting and analyzing the community structure in an ern, we offer insights for future empirical studies on erns. the interrelations in erns have been examined occasionally by analyzing the entire network's structure. for example, the katrina case exhibited a large and sparse network, in which a small number of nodes had a large number of edges and a large number of nodes had a small number of edges (butts, acton, & marcum, ) . the katrina response network can be thought of as " . . . a loosely connected set of highly cohesive clusters, surrounded by an extensive 'halo' of pendant trees, small independent components, and isolates" (butts et al., , p. ) . the network was sparse and showed a tree-like structure but also included cohesive substructures. other studies on the katrina response network have largely concurred with these observations (comfort & haase, ; kapucu, arslan, & collins, ) . in identifying cohesive subgroups in the katrina response network, these studies rely on the analysis of cliques: "a maximal complete subgraph of three or more nodes" (wasserman & faust, , p. ) or clique-like (n-cliques or k-cores). the n-cliques can include nodes that are not in the clique but are accessible. similarly, k-cores refer to maximal subgraphs with a minimum degree of at least k. many cliques were identified in the katrina response network, in which federal and state agencies appeared frequently (comfort & haase, ; kapucu, ) . using k-cores analysis, butts, acton, and marcum ( ) suggest that the katrina response network's inner structure was built around a small set of cohesive subgroups that was divided along institutional lines corresponding to five state clusters (alabama, colorado, florida, georgia, and virginia), a cluster of u.s. federal organizations, and one of nongovernmental organizations. while these studies suggest the presence of cohesive subgroups in erns, we have not found any research that thoroughly discussed subsets of organizations' significance in erns. from the limited literature, we identify two different, albeit related, reasons that cohesive subgroups have interested ern researchers. in their analysis of cohesive subgroups using cliques, comfort and haase ( ) assume that a cohesive subgroup can facilitate achieving shared tasks as a group, but it can be less adept at managing the full flow of information and resources across groups and thus decreasing the entire network's coherence. kapucu and colleagues ( ) indicate that the recurrent patterns of interaction among the sets of selected organizations may be the result of excluding other organizations in decision-making, which may be a deterrent to all organizations' harmonious concerted efforts in disaster responses. comfort and haase ( ) view cliques as an indicator of " . . . the difficulty of enabling collective action across the network" (p. ), and others have adhered closely to this perspective (celik & corbacioglu, ; hossain & kuti, ; kapucu, ) . cohesive subgroups such as cliques are assumed to be a potential hindrance to the entire network's performance. the problem with this perspective is that one set of eyes can perceive cohesive subgroups in erns as a barrier, while another can regard them as a facilitator of an effective response. while disaster and emergency response plans are inherently limited and not implemented in practice as intended (clarke, ) , stakeholder organizations' responses may be performed together with presumed structures, particularly in a setting in which government entities are predominant. for example, the incident command system (ics) was designed to improve response work's efficiency by constructing a standard operating procedure (moynihan, ). structurally, one person serves as the incident commander who is responsible for directing all other responders (kapucu & garayev, ) . ics is a somewhat hierarchical command-and-control system with functional arrangements in five key resources and capabilities-that is, command, operations, planning, logistics, and finance (kapucu & garayev, ) . in an environment in which such an emergency response model is implemented, it is realistic to expect clusters and subgroups to reflect the model's structural designs and arrangements, and they may be intentionally designed to facilitate coordination, communication, and collaboration with other parts or subgroups efficiently in a large response network. others are interested in identifying cohesive subgroups because they may indicate a lack of cross-jurisdictional and cross-sectoral collaboration in erns. during these responses, public organizations in different jurisdictions participate, and a sizable number of organizations from nongovernmental sectors also become involved (celik & corbacioglu, ; comfort & haase, ; kapucu et al., ; spiro, acton, & butts, ) . organizational participation by multiple government levels and sectors is often necessary because knowledge, expertise, and resources are distributed in society. participating organizations must collaborate and coordinate their efforts. however, studies have suggested that interactions in erns are limited and primarily occur among similar organizations, particularly within the same jurisdiction. that is, public organizations tend to interact more frequently with other public organizations in specific geographic locations (butts et al., ; hossain & kuti, ; kapucu, ; tang, deng, shao, & shen, ) . these studies indicate that organizations have been insufficiently integrated across government jurisdictions (tang et al., ) or sectors (butts et al., ; hossain & kuti, ) , and the identification of cliques composed of similar organizations reinforces such a concern. in our view, there is a greater, or perhaps more interesting, question related to the crossjurisdictional and cross-sectoral integration in interorganizational response networks: how are intergovernmental relations mixed with intersectoral relations in erns? here, we use the term interorganizational relations to refer to both intergovernmental and intersectoral relations. intergovernmental relations refer to the interaction among organizations across different government levels (local, provincial, and national) , and intersectoral relations involve the interaction among organizations across different sectors (public, private, nonprofit, and civic sectors). recent studies have suggested that both intergovernmental and intersectoral relations shape erns (kapucu et al., ; kapucu & garayev, ; tang et al., ) , but few have analyzed the way the two interorganizational relations intertwine. if the relation interdependencies in the entire network are of interest to ern researchers, as is the case in this article, focusing on cliques may not necessarily be the best approach to the question because clique analysis may continue to find sets of selected organizations that are tightly linked for various reasons. the analysis of cliques is a very strict way of operationalizing cohesive subgroups from a social network perspective (moody & coleman, ) , and there are two issues with using it to identify cohesive subgroups in erns. first, clique analysis assumes complete connections of three or more subgroup members, while real-world networks tend to have many small overlapping cliques that do not represent distinct groups (moody & coleman, ) . even if substantively meaningful cliques appear, they may not necessarily imply a lack of information flow across subgroups or other organizations' exclusion, as previous ern studies have assumed (comfort & haase, ; kapucu et al., ) . second, clique analysis assumes no internal differentiation in members' structural position within the subgroup (wasserman & faust, ) . in a task-oriented network such as an ern, organizations within a subgroup may look similar (e.g., all fire organizations). however, this does not imply that they are identical in their structural positions. when these assumptions in clique analysis do not hold, identifying cohesive subgroups as cliques is inappropriate (wasserman & faust, ) . similarly, other clique-like approaches (n-cliques and k-cores) demand an answer to the question: "what is the n-or k-?" the clique and clique-like approaches have a limited ability to define and identify cohesive subgroups in a task-oriented network because they do not clearly explain why the subgroups need to be defined and identified in such a manner. we proposed a different way of thinking about and finding subsets of organizations in erns: community. when a network consists of subsets of nodes with many edges that connect nodes of the same subset, but few that lay between subsets, the network is said to have a community structure (wilkinson & huberman, ) . network researchers have developed methods with which to detect communities (fortunato, latora, & marchiori, ; latora & marchiori, ; lim, kim, & lee, ; newman & girvan, ; yang & leskovec, ) . optimization approaches, such as the louvain and leiden methods, which we use in this article, sort nodes into communities by maximizing a clustering objective function (e.g., modularity). beginning with each node in its own group, the algorithm joins groups together in pairs, choosing the pairs that maximize the increase in modularity (moody & coleman, ) . this method performs an iterative process of node assignments until modularity is maximized and leads to a hierarchical nesting of nodes (blondel, guillaume, lambiotte, & lefebvre, ) . recently, the louvain algorithm was upgraded and improved as the leiden algorithm that addresses some issues in the louvain algorithm (traag, waltman, & van eck, ) . modularity (q), which shows the quality of partitions, is measured and assessed quantitatively: in which e ii is the fraction of the intra-edges of community i over all edges, and e ij is the fraction of the inter-edges between community i and community j over all edges. modularity scores are used to compare assignments of nodes into different communities and also the final partitions. it is calculated as a normalized index value: if there is only one group in a network, q takes the value of zero; if all ties are within separate groups, q takes the maximum value of one. thus, a higher q indicates a greater portion of intra-than inter-edges, implying a network with a strong community structure (fortunato et al., ) . currently, there are two challenges in community detection studies. first, the modular structure in complex networks usually is not known beforehand (traag et al., ) . we know the community structure only after it is identified. second, there is no formal definition of community in a graph (reichardt & bornholdt, ; wilkinson & huberman, ) , it simply is a concept of relative density (moody & coleman, ) . a high modularity score ensures only that " . . . the groups as observed are distinct, not that they are internally cohesive" (moody & coleman, , p. ) and does not guarantee any formal limit on the subgroup's internal structure. thus, internal structure must be examined, especially in such situations as erns. despite these limitations, efforts to reveal underlying community structures have been undertaken with a wide range of systems, including online and off-line social systems, such as an e-mail corpus of a million messages in organizations (tyler, wilkinson, & huberman, ) , zika virus conversation communities on twitter (hagen et al., ) , and jazz musician networks (gleiser & danon, ) . further, one can exploit complex networks by identifying their community structure. for example, salathé and jones ( ) showed that community structures in human contact networks significantly influence infectious disease dynamics. their findings suggest that, in a network with a community structure, targeting individuals who bridge communities for immunization is better than intervening with highly connected individuals. we exploit the community detection and analysis to understand an ern's substructure in the context of an infectious disease outbreak. it is difficult to know the way communities in erns will form beforehand without examining clusters and their compositions and connectivity in the network. we may expect to observe communities that consist of diverse organizations because organizations' shared goal in erns is to respond to a crisis by performing necessary tasks (e.g., providing mortuary and medical services as well as delivering materials) through concerted efforts on the part of those with different capabilities (moynihan, ; waugh, ) . organizations that have different information, skills, and resources may frequently interact in a disruptive situation because one type alone, such as the government or organizations in an affected area, cannot cope effectively with the event (waugh, ) . on the other hand, we also cannot rule out the possibility shown in previous studies (butts et al., ; comfort & haase, ; kapucu, ) . organizations that work closely in normal situations because of their task similarity, geographic locations, or jurisdictions may interact more frequently and easily, even in disruptive situations (hossain & kuti, ) , and communities may be identified that correspond to those factors. a case could be made that communities in erns consist of heterogeneous organizations, but a case could also be made that communities are made up of homogeneous organizations with certain characteristics. it is equally difficult to set expectations about communities' internal structure in erns. we can expect that, regardless of their types, sectors, and locations, some organizations work and interact closely-perhaps even more so in such a disruptive situation. emergent needs for coordination, communication, and collaboration also can trigger organizational interactions that extend beyond the usual or planned structure. thus, the relations among organizations become dense and evolve into the community in which every member is connected. on the other hand, a community in the task network may not require all of the organizations within it to interact. for example, if a presumed structure is strongly established, organizations are more likely to interact with others within the planned structure following the chain of command and control. even without such a structure, government organizations may coordinate their responses following the existing chain of command and control in their routine. we may expect to observe communities with a sparse connection among organizations. thus, the way communities emerge in erns is an open empirical question that can be answered by examining the entire network. several countries have experienced novel infectious disease outbreaks over the past decade (silk, ; swaan et al., ; williams et al., ) and efforts to control such events have been more or less successful, depending upon the instances and countries. in low probability, high-consequence infectious diseases such as the mers outbreak in south korea, a concerted response among individuals and organizations is virtually the only way to respond because countermeasures-such as vaccines-are not readily available. thus, to achieve an effective response, it is imperative to understand the way individuals and organizations mobilize and respond in public health emergencies. however, the response system for a national or global epidemic is highly complex (hodge, ; sell et al., ; williams et al., ) because of several factors: ( ) the large number of organizations across multiple government levels and sectors, ( ) the diversity of and interactions among organizations for the necessary (e.g., laboratory testing) or emergent (e.g., hospital closure) tasks, and ( ) concurrent outbreaks or treatments at multiple locations attributable to the virus's rapid spread. all of these factors create challenges when responding to public health emergencies. we broadly define a response network as the relations among organizations that can act as critical channels for information, resources, and support. when two organizations engage in any mers-specific response interactions, they are considered to be related in the response. examples of interactions include taking joint actions, communicating with each other, or sharing crucial information and resources (i.e., exchanging patient information, workforce, equipment, or financial support) related to performing the mers tasks, as well as having meetings among organizations to establish a collaborative network. we collected response network data from the following two archival sources: ( ) news articles from south korea's four major newspapers published between may , , and december , (the outbreak period), and ( ) a postevent white paper that the ministry of health and welfare published in december . in august , hanyang university's research center in south korea provided an online tagging tool for every news article in the country's news articles database that included the term "mers (http://naver.com)." a group of researchers at the korea institute for health and social affairs wrote the white paper ( pages, plus appendices) based on their comprehensive research using multiple data sources and collection methods. the authors of this article and graduate research assistants, all of whom are fluent in korean, were involved in the data collection process from august to september . because of the literature's lack of specific guidance on the data to collect from archival materials to construct interorganizational network data, we collected the data through trial and error. we collected data from news articles through two separate trials (a total of , articles from the four newspapers). the authors and a graduate assistant then ran a test trial between august and april . in july , the authors developed a data collection protocol based on the test trial experience collecting the data from the news articles and white paper. then, we recollected the data from the news articles between august and september using the protocol. when we collected data by reviewing archival sources, we first tagged all apparent references within the source text to organizations' relational activities. organizations are defined as "any named entity that represents (directly or indirectly) multiple persons or other entities, and that acts as a de facto decision making unit within the context of the response" (butts et al., , p. ) . if we found an individual's name on behalf of the individual's organization (e.g., the secretary of the ministry of health and welfare), we coded the individual as the organization's representative. these organizational interactions were coded for a direct relation based on "whom" to "whom" and for "what purpose." then, these relational activity tags were rechecked. all explicit mentions of relations among organizations referred to in the tagged text were extracted into a sociomatrix of organizations. we also categorized individual organizations into different "groups" using the following criteria. first, we distinguished the entities in south korea from those outside the country (e.g., world health organization [who], centers for disease control and prevention [cdc] ). second, we sorted governmental entities by jurisdiction (e.g., local, provincial/metropolitan, or national) and then also by the functions that each organization performs (e.g., health care, police, fire). for example, we categorized local fire stations differently from provincial fire headquarters because these organizations' scope and role differ within the governmental structure. we categorized nongovernmental entities in the private, nonprofit, or civil society sectors that provide primary services in different service areas (e.g., hospitals, medical waste treatment companies, professional associations). at the end of the data collection process, organizational groups from , organizations were identified (see appendix). we employed the leiden algorithm using python (traag et al., ) , which we discussed in the previous section. the leiden algorithm is also available for gephi as a plugin (https://gephi.org/). after identifying communities, the network can be reduced to these communities. in generating the reduced graph, each community appears within a circle, the size of which varies according to the number of organizations in the community. the links between communities indicate the connections among community members. the thickness of the lines varies in proportion to the number of pairs of connected organizations. this process improves the ability to understand the network structure drastically and provides an opportunity to analyze the individual communities' internal characteristics such as the organizations' diversity and their connectivity for each community. shannon's diversity index (h) is used as a measure of diversity because uncertainty increases as species' diversity in a community increases (dejong, ) . the h index accounts for both species' richness and evenness in a community (organizational groups in a community in our case). s indicates the total number of species. the fraction of the population that constitutes a species, i, is represented by p i below and then multiplied by the natural logarithm of the proportion (lnp i ). the resulting product is then summed across species and multiplied by À : high h values represent more diverse communities. shannon's e is calculated by e ¼ h=ln s, which indicates various species' equality in a community. when all of the species are equally abundant, maximum evenness (i.e., ) is obtained. while limited, density and the average clustering coefficient can capture the basic idea of a subgraph's structural cohesion or "cliquishness" (moody & coleman, ) . a graph's density (d) is the proportion of possible edges presented in the graph, which is the ratio between the number of edges present and the maximum possible. it ranges from (no edges) to (if all possible lines are present). a graph's clustering coefficient (c) is the probability that two neighbors of a node are neighbors themselves. it essentially measures the way a node's neighbors form a -clique. c is in a graph connected fully. the mers response network in the data set consists of , organizations and , edges. table shows that most of the organizations were government organizations (approximately %) and % were nongovernmental organizations from different sectors. local government organizations constituted the largest proportion of organizations ( %). further, one international organization (i.e., who) and foreign government agencies or foreign medical centers (i.e., cdc, erasmus university medical center) appeared in the response network. organizations coordinated with approximately three other organizations (average degree: . ). however, six organizations coordinated with more than others. the country's health authorities, such as the ministry of health and welfare (mohw: edges), central mers management headquarters (cmmh: edges), and korea centers for disease control and prevention (kcdc: edges), were found to have a large number of edges. the ministry of environment ( edges) also coordinated with many other organizations in the response. the national medical center had edges, and the seoul metropolitan city government had . the leiden algorithm detected communities in the network, labeled as through in figures - and tables and . the final modularity score (q) was . , showing that the community detection algorithm partitioned and identified the communities in the network reasonably well. in real-world networks, modularity scores " . . . typically fall in the range from about . to . . high values are rare" (newman & girvan, , p. ) . the number of communities was also consistent in the leiden and louvain algorithms ( communities in the louvain algorithm). the modularity score was slightly higher in the leiden algorithm than the q ¼ . in the louvain. figure presents the mers response network with communities in different colors to show the organizations' clustering using forceatlas layout in gephi. in figure , the network's community structure is clear to the human eye. from the figures (and the community analysis in table ), we find that the mers response network was divided into two sets of communities according to which communities were at the center of the network and their nature of activity in the response, core response communities in one group and supportive functional communities in the other. the two core communities ( and ) at the center of the response network included a large number of organizations, with a knot of intergroup coordination among the groups surrounding those two. these communities included organizations across government jurisdictions, sectors, and geographic locations ( table , description) and were actively involved in the response during the mers outbreak. while not absolute, we observe that the network of a dominating organization had a "mushroom" shape of interactions with other organizations within the communities (also see figure a ). the dominant organizations were the central government authorities such as the mohw, the cmmh, and kcdc. the national health authorities led the mers response. other remaining communities were ( ) confined geographically, ( ) oriented functionally, or ( ) both. first, some communities consisted of diverse organizations in the areas where two mers hospitals are located-seoul metropolitan city and gyeonggi province (communities and ). organizations in these communities span government levels and sectors within the areas affected. second, two communities consisted of organizations with different functions and performed supportive activities (community , also see figure b ). other supportive functional communities that focus on health (community , see figure c ) or foreign affairs (community ) had a "spiderweb" shape of interactions among organizations within the communities. third, several communities consisted of a relatively small number of organizations connected to one in the center (communities , , , and ) . these consisted of local fire organizations in separate jurisdictions (see figure d ) that were both confined geographically and oriented functionally. table summarizes the characteristics of the communities in the response network. in table , we also note distinct interorganizational relations present within the communities. the two core response communities include both intergovernmental and intersectoral relations. that is, organizations across government jurisdictions or sectors were actively involved in response to the epidemic in the communities. while diverse organizations participated in these core communities, the central government agencies led and directed other organizations, which reduced member connectivity. among the supportive functional communities, those that are confined geographically showed relatively high diversity but low connectivity (communities , , and through ). these communities included intergovernmental relations within geographic locations. secondly, communities of organizations with a specialized function showed relatively high diversity or connectivity. these included organizations from governmental and nongovernmental sectors and had no leading or dominating organizations. for example, communities and had intersectoral relations but no intergovernmental relations. thirdly, within each community of fire organizations in different geographic locations, one provincial or metropolitan fire headquarters was linked to multiple local fire stations in a star network. these communities, labeled igf, had low member diversity and member connectivity, while they were organizationally and functionally coherent. table summarizes the results elaborated above. in addition to the division of communities along the lines of the nature of their response activities, we observe that the structural characteristics of communities with only intersectional or international relations showed high diversity and high connectivity. whenever intergovernmental relations were present in communities, however, the member connectivity was low, even if intersectoral relations appeared together within them. we use the community detection method to gain a better understanding of the patterns of associations among diverse response organizations in an epidemic response network. the large data sets available and increased computational power significantly transform the study of social networks and can shed light on topics such as cohesive subgroups in large networks. network studies today involve mining enormous digital data sets such as collective behavior online (hagen et al., ) , an e-mail corpus of a million messages (tyler, wilkinson, & buberman, ) , or scholars' massive citation data (kim & zhang, ) . the scale of erns in large disasters and emergencies is noteworthy (moynihan, ; waugh, ) , and over , organizations appeared in butts et al. ( ) study as well as in this research. their connections reflect both existing structural forms by design and by emergent needs. the computational power needed to analyze such large relational data is ever higher and the methods simpler now, which allows us to learn about the entire network. we find two important results. first, the national public health ern in korea split largely into two groups. the core response communities' characteristics were that ( ) they were not confined geographically, ( ) organizations were heterogeneous across jurisdictional lines as well as sectors, and ( ) the community's internal structure was sparse even if intersectoral relations were present. on the other hand, supportive functional communities' characteristics were that ( ) they were communities of heterogeneous organizations in the areas affected that were confined geographically; ( ) the communities of intersectoral, professional organizations were heterogeneous, densely connected, and not confined geographically; and ( ) the communities of traditional emergency response organizations (e.g., fire) were confined geographically, homogeneous, and connected sparsely in a centralized fashion. these findings show distinct features of the response to emerging infectious diseases. the core response communities suggest that diverse organizations across jurisdictions, sectors, and functions actually performed active and crucial mers response activities. however, these organizations' interaction and coordination inside the communities were found to be top down from the key national health authorities to all other organizations. this observation does not speak to the quality of interactions in the centralized top-down structure, but one can also ask how effective such a structure can be in a setting where diverse organizations must share authority, responsibilities, and resources. second, infectious diseases spread rapidly and can break out in multiple locations simultaneously. the subgroup patterns in response networks to infectious diseases can differ from those of location-bound natural disasters such as hurricanes and earthquakes. while some organizations may not be actively or directly involved in the response, communities of these organizations can be formed to prepare for potential outbreaks or provide support to the core response communities during the event. second, we also find that the communities' internal characteristics (diversity and connectivity) differed depending upon the types of interorganizational relations that appeared within the communities. based on these analytical results, two propositions about the community structure in the ern can be developed: ( ) if intergovernmental relations operate in a community, the community's member connectivity may be low, regardless of member diversity. ( ) if community members are functionally similar, (a) professional organization communities' (e.g., health or foreign affairs) member connectivity may be dense and (b) emergency response organization communities' (e.g., fire) member connectivity may be sparse. the results suggest that the presence of intergovernmental relations within the communities in erns may be associated with low member connectivity. however, this finding does not imply that those communities with intergovernmental relations are not organizationally or functionally cohesive. instead, we may expect a different correlation between members' functional similarity and their member connectivity depending upon the types of professions, as seen in (a) and (b). organizations' concerted efforts during a response to an epidemic is a prevalent issue in many countries (go & park, ; hodge, gostin, & vernick, ; seo, lee, kim, & lee, ; swaan et al., ) . the mers outbreak in south korea led to , suspected cases, infected cases, and deaths in the country (korea centers for disease control and prevention, ) . the south korean government's response to it was severely criticized for communication breakdowns, lack of leadership, and information secrecy (korea ministry of health and welfare, ). the findings of this study offer a practical implication for public health emergency preparedness and response in the country studied. erns' effective structure has been a fundamental question and a source of continued debate (kapucu et al., ; nowell, steelman, velez, & yang, ). the answer remains unclear, but the recent opinion leans toward a less centralized and hierarchical structure, given the complexity of making decisions in disruptive situations (brooks, bodeau, & fedorowicz, ; comfort, ; hart, rosenthal, & kouzmin, ) . our analysis shows clearly that the community structure and structures within communities in the network were highly centralized (several mushrooms) and led by central government organizations. given that the response to the outbreak was severely criticized for its poor communication and lack of coordination, it might be beneficial to include more flexibility and openness in the response system in future events. we suggest taking advice from the literature above conservatively because of the contextual differences in the event and setting. this study's limitations also deserve mention. several community detection methods have been developed with different assumptions for network partition. some algorithms take deterministic group finding approaches that partition the network based on betweenness centrality edges (girvan & newman, ) or information centrality edges (fortunato et al., ) . other algorithms take the optimization approaches we use in this article. in our side analyses, we tested three algorithms with the same data set: g-n, louvain, and leiden. the modularity scores were consistent, as reported in this article, but the number of communities in g-n and the other two algorithms differed. the deterministic group finding approach (g-n) found a substantively high number of communities. the modularity score can help make sense of the partition initially, but the approach is limited (reichardt & bornholdt, ) . thus, two questions remain: which algorithm do we choose and how do we know whether the community structure is robust (karrer, levina, & newman, ) ? in their nature, these questions do not differ from which statistical model to use given the assumptions and types of data in hand. the algorithms also require further examination and tests. while we reviewed the data sources carefully multiple times to capture the response coordination, communication, and collaboration, the process of collecting and cleaning data can never be free from human error. it was a time-consuming, labor-intensive process that required trial and error. further, the original written materials can have their own biases that reflect the source's perspective. government documents may provide richer information about the government's actions but less so about other social endeavors. media data, such as newspapers, also have their limitations as information sources to capture rich social networks. accordingly, our results must be interpreted in the context of these limitations. in conclusion, this article examines the community structure in a large ern, which is a quite new, but potentially fruitful, approach to the field. we tested a rapidly developing analytical approach to the ern to generate theoretical insights and find paths to exploit such insights for better public health emergency preparedness and response in the future. much work remains to build and refine the theoretical propositions on crisis response networks drawn from this rich case study. the katrina response network consisted of , organizations and connections with a mean degree except for the quote, comfort and haase ( ) do not provide further explanation incident command system was established originally for the response to fire and has been expanded to other disaster areas in the end, we found that the process was not helpful because of the volume and redundancy of content in news articles different newspapers published, which is not an issue in analysis because it can be filtered and handled easily using network analysis tool. because we had not confronted previous disaster response studies that collected network data from text materials, such as news articles and situation reports, and reported their reliability we also classified organizations based on specialty, such as quarantine, economy, police, tourism, and so on regardless of jurisdictions. twenty-seven specialty areas were classified. we note that the result of diversity analysis using the specialty areas did not differ from that using the organizational groups. the correlation of the diversity indices based on the two different classification criteria was r ¼ . . we report the result based on organization groups because the classification criterion can indicate better the different types of we did not measure the frequency, intensity, or quality of interorganizational relations but only the presence of either or both relations within the communities fast unfolding of communities in large networks organising for effective emergency management: lessons from research analyzing social networks network management in emergency response: articulation practices of state-level managers-interweaving up, down, and sideways interorganizational collaboration in the hurricane katrina response from linearity to complexity: emergent characteristics of the avian influenza response system in turkey comparing coordination structures for crisis management in six countries mission improbable: using fantasy documents to tame disaster crisis management in hindsight: cognition, coordination, communication communication, coherence, and collective action a comparison of three diversity indices based on their components of richness and evenness method to find community structures based on information centrality community structure in social and biological networks community structure in jazz a comparative study of infectious disease government in korea: what we can learn from the sars and the mers outbreak imagining twitter as an imagined community crisis communications in the age of social media: a network analysis of zika-related tweets crisis decision making: the centralization revisited global and domestic legal preparedness and response: ebola outbreak pandemic and all-hazards preparedness act disaster response preparedness coordination through social networks interorganizational coordination in dynamic context: networks in emergency response management examining intergovernmental and interorganizational response to catastrophic disasters: toward a network-centered approach collaborative decision-making in emergency and disaster management structure and network performance: horizontal and vertical networks in emergency management robustness of community structure in networks digital government and wicked problems subgroup analysis of an epidemic response network of organizations: mers outbreak in korea middle east respiratory syndrome coronavirus outbreak in the republic of korea the mers white paper. seoul, south korea: ministry of health and welfare efficient behavior of small-world networks blackhole: robust community detection inspired by graph drawing cliques, clubs and clans clustering and cohesion in networks: concepts and measures the network governance of crisis response: case studies of incident command systems finding and evaluating community structure in networks the structure of effective governance of disaster response networks: insights from the field when are networks truly modular? dynamics and control of diseases in networks with community structure public health resilience checklist for high-consequence infectious diseases-informed by the domestic ebola response in the united states epidemics crisis management systems in south korea infectious disease threats and opportunities for prevention extended structures of mediation: re-examining brokerage in dynamic networks ebola preparedness in the netherlands: the need for coordination between the public health and the curative sector leveraging intergovernmental and cross-sectoral networks to manage nuclear power plant accidents: a case study from from louvain to leiden: guaranteeing well-connected communities e-mail as spectroscopy: automated discovery of community structure within organizations social network analysis: methods and applications terrorism, homeland security and the national emergency management network a method for finding communities of related genes cdc's early response to a novel viral disease, middle east respiratory syndrome coronavirus structure and overlaps of communities in networks author biographies yushim kim is an associate professor at the school of public affairs at arizona state university and a coeditor of journal of policy analysis and management. her research examines environmental and urban policy issues and public health emergencies from a systems perspective jihong kim is a graduate student at the department of seong soo oh is an associate professor of public administration at hanyang university, korea. his research interests include public management and public sector human resource management he is an associate editor of information sciences and comsis journal. his research interests include data mining and databases her research focuses on information and knowledge management in the public sector and its impact on society, including organizational learning, the adoption of technology in the public sector, public sector data management, and data-driven decision-making in government jaehyuk cha is a professor at the department of computer and software, hanyang university, korea. his research interests include dbms, flash storage system the authors appreciate research assistance from jihyun byeon and useful comments from chan wang, haneul choi, and young jae won. the early idea of this article using partial data from news articles was presented at the dg.o research conference and published as conference proceeding (kim, kim, oh, kim, & ku, ) . data are available from the author at ykim@asu.edu upon request. we used python to employ the leiden community detection algorithm (see the source code: https://github.com/ vtraag/leidenalg). network measures, such as density and clustering coefficient, as well as the diversity index were calculated using python libraries (networkx, math, pandas, nump). we used gephi . . for figures and mendeley for references. the authors declared no potential conflicts of interest with respect to the research, authorship, and/or publication of this article. the authors disclosed receipt of the following financial support for the research, authorship, and/or publication of this article: this work was supported by the national research foundation of korea grant funded by the korean government (ministry of science and ict; no. r a a ). supplemental material for this article is available online. key: cord- - hp yt authors: coelho, flávio c; cruz, oswaldo g; codeço, cláudia t title: epigrass: a tool to study disease spread in complex networks date: - - journal: source code biol med doi: . / - - - sha: doc_id: cord_uid: hp yt background: the construction of complex spatial simulation models such as those used in network epidemiology, is a daunting task due to the large amount of data involved in their parameterization. such data, which frequently resides on large geo-referenced databases, has to be processed and assigned to the various components of the model. all this just to construct the model, then it still has to be simulated and analyzed under different epidemiological scenarios. this workflow can only be achieved efficiently by computational tools that can automate most, if not all, these time-consuming tasks. in this paper, we present a simulation software, epigrass, aimed to help designing and simulating network-epidemic models with any kind of node behavior. results: a network epidemiological model representing the spread of a directly transmitted disease through a bus-transportation network connecting mid-size cities in brazil. results show that the topological context of the starting point of the epidemic is of great importance from both control and preventive perspectives. conclusion: epigrass is shown to facilitate greatly the construction, simulation and analysis of complex network models. the output of model results in standard gis file formats facilitate the post-processing and analysis of results by means of sophisticated gis software. epidemic models describe the spread of infectious diseases in populations. more and more, these models are being used for predicting, understanding and developing control strategies. to be used in specific contexts, modeling approaches have shifted from "strategic models" (where a caricature of real processes is modeled in order to emphasize first principles) to "tactical models" (detailed representations of real situations). tactical models are useful for cost-benefit and scenario analyses. good examples are the foot-and-mouth epidemic models for uk, triggered by the need of a response to the epidemic [ , ] and the simulation of pandemic flu in differ-ent scenarios helping authorities to choose among alternative intervention strategies [ , ] . in realistic epidemic models, a key issue to consider is the representation of the contact process through which a disease is spread, and network models have arisen as good candidates [ ] . this has led to the development of "network epidemic models". network is a flexible concept that can be used to describe, for example, a collection of individuals linked by sexual partnerships [ ] , a collection of families linked by sharing workplaces/schools [ ] , a collection of cities linked by air routes [ ] . any of these scales may be relevant to the study and control of disease spread [ ] . networks are made of nodes and their connections. one may classify network epidemic models according to node behavior. one example would be a classification based on the states assumed by the nodes: networks with discretestate nodes have nodes characterized by a discrete variable representing its epidemiological status (for example, susceptible, infected, recovered). the state of a node changes in response to the state of neighbor nodes, as defined by the network topology and a set of transmission rules. networks with continuous-state nodes, on the other hand, have node' state described by a quantitative variable (number of susceptibles, density of infected individuals, for example), modelled as a function of the history of the node and its neighbors. the importance of the concept of neighborhood on any kind of network epidemic model stems from its large overlap with the concept of transmission. in network epidemic models, transmission either defines or is defined/constrained by the neighborhood structure. in the latter case, a neighborhood structure is given a priori which will influence transmissibility between nodes. the construction of complex simulation models such as those used in network epidemic models, is a daunting task due to the large amount of data involved in their parameterization. such data frequently resides on large geo-referenced databases. this data has to be processed and assigned to the various components of the model. all this just to construct the model, then it still has to be simulated, analyzed under different epidemiological scenarios. this workflow can only be achieved efficiently by computational tools that can automate most if not all of these time-consuming tasks. in this paper, we present a simulation software, epigrass, aimed to help designing and simulating network-epidemic models with any kind of node behavior. without such a tool, implementing network epidemic models is not a simple task, requiring a reasonably good knowledge of programming. we expect that this software will stimulate the use and development of networks models for epidemiological purposes. the paper is organized as following: first we describe the software and how it is organized with a brief overview of its functionality. then we demonstrate its use with an example. the example simulates the spread of a directly transmitted infectious disease in brazil through its transportation network. the velocity of spread of new diseases in a network of susceptible populations depends on their spatial distribution, size, susceptibility and patterns of contact. in a spatial scale, climate and environment may also impact the dynamics of geographical spread as it introduces temporal and spatial heterogeneity. understanding and predicting the direction and velocity of an invasion wave is key for emergency preparedness. epigrass is a platform for network epidemiological simulation and analysis. it enables researchers to perform comprehensive spatio-temporal simulations incorporating epidemiological data and models for disease transmission and control in order to create complex scenario analyses. epigrass is designed towards facilitating the construction and simulation of large scale metapopulational models. each component population of such a metapopulational model is assumed to be connected through a contact network which determines migration flows between populations. this connectivity model can be easily adapted to represent any type of adjacency structure. epigrass is entirely written in the python language, which contributes greatly to the flexibility of the whole system due to the dynamical nature of the language. the geo-referenced networks over which epidemiological processes take place can be very straightforwardly represented in a object-oriented framework. consequently, the nodes and edges of the geographical networks are objects with their own attributes and methods (figure ). once the archetypal node and edge objects are defined with appropriate attributes and methods, then a code representation of the real system can be constructed, where nodes (representing people or localities) and contact routes are instances of node and edge objects, respectively. the whole network is also an object with its own set of attributes and methods. in fact, epigrass also allows for multiple edge sets in order to represent multiple contact networks in a single model. figure architecture of an epigrass simulation model. a simulation object contains the whole model and all other objects representing the graph, sites and edges. site object contaim model objects, which can be one of the built-in epidemiological models or a custom model written by the user. these features leads to a compact and hierarchical computational model consisting of a network object containing a variable number of node and edge objects. it also does not pose limitations to encapsulation, potentially allowing for networks within networks, if desirable. this representation can also be easily distributed over a computational grid or cluster, if the dependency structure of the whole model does not prevent it (this feature is currently being implemented and will be available on a future release of epigrass). for the end-user, this hierarchical, object-oriented representation is not an obstacle since it reflects the natural structure of the real system. even after the model is converted into a code object, all of its component objects remain accessible to one another, facilitating the exchange of information between all levels of the model, a feature the user can easily include in his/her custom models. nodes and edges are dynamical objects in the sense that they can be modified at runtime altering their behavior in response to user defined events. in epigrass it is very easy to simulate any dynamical system embedded in a network. however, it was designed with epidemiological models in mind. this goal led to the inclusion of a collection of built-in epidemic models which can be readily used for the intra-node dynamics (sir model family). epigrass users are not limited to basing their simulations on the built-in models. user-defined models can be developed in just a few lines of python code. all simulations in epigrass are done in discrete-time. however, custom models may implement finer dynamics within each time step, by implementing ode models at the nodes, for instance. the epigrass system is driven by a graphical user interface(gui), which handles several input files required for model definition and manages the simulation and output generation (figure ). at the core of the system lies the simulator. it parses the model specification files, contained in a text file (.epg file), and builds the network from site and edge description files (comma separated values text files, csv). the simulator then builds a code representation of the entire model, simulates it, and stores the results in the database or in a couple of csv files. this output will contain the full time series of the variables in the model. additionally, a map layer (in shapefile and kml format) is also generated with summary statitics for the model (figure ). the results of an epigrass simulation can be visualized in different ways. a map with an animation of the resulting timeseries is available directly through the gui (figure ). other types of static visualizations can be generated through gis software from the shapefiles generated. the kml file can also be viewed in google earth™ or google maps™ (figure ). epigrass also includes a report generator module which is controlled through a parameter in the ".epg" file. epigrass is capable of generating pdf reports with summary statistics from the simulation. this module requires a latex installation to work. reports are most useful for general verification of expected model behavior and network structure. however, the latex source files generated workflow for a typical epigrass simulation figure workflow for a typical epigrass simulation. this diagram shows all inputs and outputs typical of an epigrass simulation session. epigrass graphical user interface figure epigrass graphical user interface. by the module may serve as templates that the user can edit to generate a more complete document. building a model in epigrass is very simple, especially if the user chooses to use one of the built-in models. epigrass includes different epidemic models ready to be used (see manual for built-in models description). to run a network epidemic model in epigrass, the user is required to provide three separate text files (optionally, also a shapefile with the map layer): . node-specification file: this file can be edited on a spreadsheet and saved as a csv file. each row is a node and the columns are variables describing the node. . edge-specification file: this is also a spreadsheet-like file with an edge per row. columns contain flow variables. . model-specification file: also referred to as the ".epg" file. this file specifies the epidemiological model to be run at the nodes, its parameters, flow model for the edges, and general parameters of the simulation. the ".epg" file is normally modified from templates included with epigrass. nodes and edges files on the other hand, have to be built from scratch for every new network. details of how to construct these files, as well as examples, can be found in the documentation accompanying the software, which is available at at the project's website [ ] in the example application, the spread of a respiratory disease through a network of cities connected by bus transportation routes is analyzed. the epidemiological scenario is one of the invasion of a new influenza-like virus. one may want to simulate the spread of this disease through the country by the transportation network to evaluate alternative intervention strategies (e.g. different vaccination strategies). in this problem, a network can be defined as a set of nodes and links where nodes represent cities and links represents transportation routes. some examples of this kind of model are available in the literature [ , ] . one possible objective of this model is to understand how the spread of such a disease may be affected by the pointof-entry of the disease in the network. to that end, we may look at variables such as the speed of the epidemic, number of cases after a fixed amount of time, the distribution of cases in time and the path taken by the spread. the example network was built from of largest cities of brazil (>= k habs). the bus routes between those cities formed the connections between the nodes of the networks. the number of edges in the network, derived from epigrass output visualized on google-earth figure epigrass output visualized on google-earth. figure epigrass animation output. sites are color coded (from red to blue) according to infection times. bright red is the seed site (on the ne). the bus routes, is . these bus routes are registered with the national agency of terrestrial transportation (antt) which provided the data used to parameterize the edges of the network. the epidemiological model used consisted of a metapopulation system with a discrete-time seir model (eq. ). for each city, s t is the number of susceptibles in the city at time t, e t is the number of infected but not yet infectious individuals, i t is the number of infectious individuals resident in the locality, n is the population residing in the locality (assumed constant throughout the simulation), and n t is the number of individuals visiting the locality, Θ t is the number of visitors who are infectious. the parameters used were taken from lipsitch et al. ( ) [ ] to represent a disease like sars with an estimated basic reproduction number (r ) of . to . ( table ) . to simulate the spread of infection between cities, we used the concept of a "forest fire" model [ ] . an infected individual, traveling to another city, acts as a spark that may trigger an epidemic in the new locality. this approach is based on the assumption that individuals commute between localities and contribute temporarily to the number of infected in the new locality, but not to its demography. implications of this approach are discussed in grenfell et al ( ) [ ] . the number of individuals arriving in a city (n t ) is based on annual total number of passengers arriving trough all bus routes leading to that city as provided by the antt (brazilian national agency for terrestrial transportation). the annual number of passengers is used to derive an average daily number of passengers simply by dividing it by . stochasticity is introduced in the model at two points: the number of new cases is draw from a poisson distribution with intensity and the number of infected individuals visiting i is modelled as binomial process: where n is the total number of passengers arriving from a given neighboring city; i k, t and n k are the current number of infectious individuals and the total population size of city k, respectively. δ is the delay associated with the duration of each bus trip. the delay δ was calculated as the number of days (rounded down) that a bus, traveling at an average speed of km/h, would take to complete a given trip. the lengths in kilometers of all bus routes were also obtained from the antt. vaccination campaigns in specific (or all) cities can be easily attained in epigrass, with individual coverages for each campaign on each city. we use this feature to explore vaccination scenarios in this model (figures and ). the files with this model's definition(the sites, edges and ".epg" files) are available as part of the additional files , and for this article. to determine the importance of the point of entry in the outcome of the epidemic, the model was run times, randomizing the point of entry of the virus. the seeding site was chosen with a probability proportional to the log of their population size. these replicates were run using epigrass' built-in support for repeated runs with the option of randomizing seeding site. for every simulation, statistics about each site such as the time it got infected and time series of incidence were saved. the time required for the epidemic to infect % of the cities was chosen as a global index to network susceptibility to invasion. to compare the relative exposure of cities to disease invasion, we also calculated the inverse of time , for all k neighbors elapsed from the beginning of the epidemic until the city registered its first indigenous case as a local measure of exposure. except for population size, all other epidemiological parameters were the same for all cities, that is, disease transmissibility and recovery rate. some positional features of each node were also derived: centrality, which is is a measure derived from the average distance of a given site to every other site in the network; betweeness, which is the number of times a node figures in the the shortest path between any other pair of nodes; and degree, which is the number of edges connected to a node. in order to analyze the path of the epidemic spread, we also recorded which cities provided the infectious cases which were responsible for the infection of each other city. if more than one source of infection exists, epigrass selects the city which contributed with the largest number cost in vaccines applied vs. benefit in cases avoided, for a simulated epidemic starting at the highest degree city (são paulo) figure cost in vaccines applied vs. benefit in cases avoided, for a simulated epidemic starting at the highest degree city (são paulo). cost in vaccines applied vs. benefit in cases avoided, for a simulated epidemic starting at a relatively low degree city(salvador) figure cost in vaccines applied vs. benefit in cases avoided, for a simulated epidemic starting at a relatively low degree city(salvador). of infectious individuals at that time-step, as the most likely infector. at the end of the simulation epigrass generates a file with the dispersion tree in graphml format, which can be read by a variety of graph plotting programs to generate the graphic seen on figure . the computational cost of running a single time step in an epigrass model, is mainly determined by the cost of calculating the epidemiological models on each site(node). therefore, time required to run models based on larger networks should scale linearly with the size of the network (order of the graph), for simulations of the same duration. the model presented here, took . seconds for a days run, on a . ghz cpu. a somewhat larger model with sites and edges took seconds for a days simulation. very large networks may be limited by the ammount of ram available. the authors are working on adapting epigrass to distribute processing among multiple cpus(in smp systems), or multiple computers in a cluster system. the memory demands can also be addressed by keeping the simulation objects on an objectoriented database during the simulation. steps in this direction are also being taken by the development team. the model presented here served maily the purpose of illustrating the capabilities of epigrass for simulating and analyzing reasonably complex epidemic scenarios. it should not be taken as a careful and complete analysis of a real epidemic. despite that, some features of the simulated epidemic are worth discussing. for example: the spread speed of the epidemic, measured as the time taken to infect % of the cities, was found to be influenced by the centrality and degree of the entry node (figures and ). the dispersion tree corresponding to the epidemic, is greatly influenced by the degree of the point of entry of spread of the epidemic starting at the city of salvador, a city with relatively small degree (that is, small number of neigh-bors) figure spread of the epidemic starting at the city of salvador, a city with relatively small degree (that is, small number of neighbors). the number next to the boxes indicated the day when each city developed its first indigenous case. effect of degree(a) and betweeness(b) of entry node to the speed of the epidemic figure effect of degree(a) and betweeness(b) of entry node to the speed of the epidemic. effect of betweeness of entry node on the speed of the epi-demic figure effect of betweeness of entry node on the speed of the epidemic. the disease in the network. figure shows the tree for the dispersion from the city of salvador. vaccination strategies must take into consideration network topology. figures and show cost benefit plots for three vaccination strategies investigated: uniform vaccination, top- degree sites only and top- degree sites only. vaccination of higher order sites offer cost/benefit advantages only in scenarios where the disease enter the network through one of these sites. epigrass facilitates greatly the simulation and analysis of complex network models. the output of model results in standard gis file formats facilitates the post-processing and analysis of results by means of sophisticated gis software. the non-trivial task of specifying the network over which the model will be run, is left to the user. but epigrass allows this structure to be provided as a simple list of sites and edges on text files, which can easily be contructed by the user using a spreadsheet, with no need for special software tools. besides invasion, network epidemiological models can also be used to understand patterns of geographical spread of endemic diseases [ ] [ ] [ ] [ ] . many infectious diseases can only be maintained in a endemic state in cities with population size above a threshold, or under appropriate environmental conditions(climate, availability of a reservoir, vectors, etc). the variables and the magnitudes associated with endemicity threshold depends on the natural history of the disease [ ] . theses magnitudes may vary from place to place as it depends on the contact structure of the individuals. predicting which cities are sources for the endemicity and understanding the path of recurrent traveling waves may help us to design optimal surveillance and control strategies. modelling vaccination strategies against foot-and-mouth disease optimal reactive vaccination strategies for a foot-and-mouth outbreak in the uk strategy for distribution of influenza vaccine to high-risk groups and children containing pandemic influenza with antiviral agents space and contact networks: capturing the locality of disease transmission interval estimates for epidemic thresholds in two-sex network models applying network theory to epidemics: control measures for mycoplasma pneumoniae outbreaks assessing the impact of airline travel on the geographic spread of pandemic influenza modeling control strategies of respiratory pathogens epigrass website containing pandemic influenza at the source transmission dynamics and control of severe acute respiratory syndrome travelling waves and spatial hierarchies in measles epidemics travelling waves in the occurrence of dengue haemorrhagic fever in thailand modelling disease outbreaks in realistic urban social networks on the dynamics of flying insects populations controlled by large scale information large-scale spatial-transmission models of infectious disease disease extinction and community size: modeling the persistence of measles the authors would like to thank the brazilian research council (cnpq) for financial support to the authors. fcc contributed with the software development, model definition and analysis as well as general manuscript conception and writing. ctc contributed with model definition and implementation, as well as with writing the manuscript. ogc, contributed with data analysis and writing the manuscript. all authors have read and approved the final version of the manuscript. key: cord- -d zqixs authors: da fontoura costa, luciano; sporns, olaf; antiqueira, lucas; das graças volpe nunes, maria; oliveira, osvaldo n. title: correlations between structure and random walk dynamics in directed complex networks date: - - journal: appl phys lett doi: . / . sha: doc_id: cord_uid: d zqixs in this letter the authors discuss the relationship between structure and random walk dynamics in directed complex networks, with an emphasis on identifying whether a topological hub is also a dynamical hub. they establish the necessary conditions for networks to be topologically and dynamically fully correlated (e.g., word adjacency and airport networks), and show that in this case zipf’s law is a consequence of the match between structure and dynamics. they also show that real-world neuronal networks and the world wide web are not fully correlated, implying that their more intensely connected nodes are not necessarily highly active. in this letter the authors discuss the relationship between structure and random walk dynamics in directed complex networks, with an emphasis on identifying whether a topological hub is also a dynamical hub. they establish the necessary conditions for networks to be topologically and dynamically fully correlated ͑e.g., word adjacency and airport networks͒, and show that in this case zipf's law is a consequence of the match between structure and dynamics. they also show that real-world neuronal networks and the world wide web are not fully correlated, implying that their more intensely connected nodes are not necessarily highly active. © american institute of physics. ͓doi: . / . ͔ we address the relationship between structure and dynamics in complex networks by taking the steady-state distribution of the frequency of visits to nodes-a dynamical feature-obtained by performing random walks along the networks. a complex network - is taken as a graph with directed edges and associated weights, which are represented in terms of the weight matrix w. the n nodes in the network are numbered as i = , , ... ,n, and a directed edge with weight m, extending from node j to node i, is represented as w͑i , j͒ = m. no self-connections ͑loops͒ are considered. the in and out strengths of a node i, abbreviated as is͑i͒ and os͑i͒, correspond to the sum of the weights of its in-and outbound connections, respectively. the stochastic matrix s for such a network is the matrix s is assumed to be irreducible; i.e., any of its nodes can be accessible from any other node, which allows the definition of a unique and stable steady state. an agent, placed at any initial node j, chooses among the adjacent outbound edges of node j with probability equal to s͑i , j͒. this step is repeated a large number of times t, and the frequency of visits to each node i is calculated as v͑i͒ = ͑number of visits during the walk͒ / t. in the steady state ͑i.e., after a long time period t͒, v = sv and the frequency of visits to each node along the random walk may be calculated in terms of the eigenvector associated with the unit eigenvalue ͑e.g., ref. ͒. for proper statistical normalization we set ͚ p v͑p͒ = . the dominant eigenvector of the stochastic matrix has theoretically and experimentally been verified to be remarkably similar to the corresponding eigenvector of the weight matrix, implying that the adopted random walk model shares several features with other types of dynamics, including linear and nonlinear summations of activations and flow in networks. in addition to providing a modeling approach intrinsically compatible with dynamics involving successive visits to nodes by a single or multiple agents, such as is the case with world wide web ͑www͒ navigation, text writing, and transportation systems, random walks are directly related to diffusion. more specifically, as time progresses, the frequency of visits to each network node approaches the activity values which would be obtained by the traditional diffusion equation. a full congruence between such frequencies and activity diffusion is obtained at the equilibrium state of the random walk process. therefore, random walks are also directly related to the important phenomenon of diffusion, which plays an important role in a large number of linear and nonlinear dynamic systems including disease spreading and pattern formation. random walks are also intrinsically connected to markov chains, electrical circuits, and flows in networks, and even dynamical models such as ising. for such reasons, random walks have become one of the most important and general models of dynamics in physics and other areas, constituting a primary choice for investigating dynamics in complex networks. the correlations between activity ͑the frequency of visits to nodes v͒ and topology ͑out strength os or in strength is͒ can be quantified in terms of the pearson correlation coefficient r. for full activity-topology correlation in directed networks, i.e., ͉r͉ = between v and os or between v and is, it is enough that ͑i͒ the network must be strongly connected, i.e., s is irreducible, and ͑ii͒ for any node, the in strength must be equal to the out strength. the proof of the statement above is as follows. because the network is strongly connected, its stochastic matrix s has a unit eigenvector in the steady state, i.e., v = sv. since s͑i , j͒ = w͑i , j͒ /os͑j͒, the ith element of the vector sos is given as by hypothesis, is͑i͒ =os͑i͒ for any i and, therefore, both os and is are eigenvectors of s associated with the unit eigenvalue. then os= is= v, implying full correlation between frequency of visits and both in and out strengths. an implication of this derivation is that for perfectly correlated networks, the frequency of symbols produced by random walks will be equal to the out strength or in strength distributions. therefore, an out strength scale-free network must produce sequences obeying zipf's law and vice versa. if, on the other hand, the node distribution is gaussian, the frequency of visits to nodes will also be a gaussian function; that is to say, the distribution of nodes is replicated in the node activation. although the correlation between node strength and random walk dynamics in undirected networks has been established before ͑including full correlation , ͒, the findings reported here are more general since they are related to any directed weighted network, such as the www and the airport network. indeed, the correlation conditions for undirected networks can be understood as a particular case of the conditions above. a fully correlated network will have ͉r͉ = . we obtained r = for texts by darwin and wodehouse and for the network of airports in the usa. the word association network was obtained by representing each distinct word as a node, while the edges were established by the sequence of immediately adjacent words in the text after the removal of stopwords and lemmatization. more specifically, the fact that word u has been followed by word v, m times during the text, is represented as w͑v , u͒ = m. zipf's law is known to apply to this type of network. the airport network presents a link between two airports if there exists at least one flight between them. the number of flights performed in one month was used as the strength of the edges. we obtained r for various real networks ͑table i͒, including the fully correlated networks mentioned above. to interpret these data, we recall that a small r means that a hub ͑large in or out strength͒ in topology is not necessarily a center of activity. notably, in all cases considered r is greater for the in strength than for the out strength. this may be understood with a trivial example of a node from which a high number of links emerge ͑implying large out strength͒ but which has only very few inbound links. this node, in a random walk model, will be rarely occupied and thus cannot be a center of activity, though it will strongly affect the rest of the network by sending activation to many other targets. understanding why a hub in terms of in strength may fail to be very active is more subtle. consider a central node receiving links from many other nodes arranged in a circle, i.e., the central node has a large in strength but with the surrounding nodes possessing small in strength. in other words, if a node i receives several links from nodes with low activity, this node i will likewise be fairly inactive. in order to further analyze the latter case, we may examine the correlations between the frequency of visits to each node i and the cumulative hierarchical in and out strengths of that node. the hierarchical degree - of a network node provides a natural extension of the traditional concept of node degree. the im-table i. number of nodes ͑no. nodes͒, number of edges ͑no. edges͒, means and standard deviations of the clustering coefficient ͑cc͒, cumulative hierarchical in strengths for levels - ͑is -is ͒, cumulative hierarchical out strengths for levels - ͑os -os ͒, and the pearson correlation coefficients between the activation and all cumulative hierarchical in strengths and out strengths ͑r is r os ͒ for the complex networks considered in the present work. for the least correlated network analyzed, viz., that of the largest strongly connected cluster in the network of www links in the domain of ref. ͑massey university, new zealand͒ ͑refs. and ͒ activity could not be related to in strength at any hierarchical level. because the pearson coefficient corresponds to a single real value, it cannot adequately express the coexistence of the many relationships between activity and degrees present in this specific network as well as possibly heterogeneous topologies. very similar results were obtained for other www networks, which indicate that the reasons why topological hubs have not been highly active cannot be identified at the present moment ͑see, however, discussion for higher correlated networks below͒. however, for the two neuronal structures of table i that are not fully correlated ͑network defined by the interconnectivity between cortical regions of the cat and network of synaptic connections in c. elegans ͒, activity was shown to increase with the cumulative first and second hierarchical in strengths. in the cat cortical network, each cortical region is represented as a node, and the interconnections are reflected by the network edges. significantly, in a previous paper, it was shown that when connections between cortex and thalamus were included, the correlation between activity and outdegree increased significantly. this could be interpreted as a result of increased efficiency with the topological hubs becoming highly active. furthermore, for the fully correlated networks, such as word associations obtained for texts by darwin and wodehouse, activity increased basically with the square of the cumulative second hierarchical in strength ͑see supplementary fig. . in ref. ͒ . in addition, the correlations obtained for these two authors are markedly distinct, as the work of wodehouse is characterized by substantially steeper increase of frequency of visits for large in strength values ͑see supplementary fig. in ref. ͒. therefore, the results considering higher cumulative hierarchical degrees may serve as a feature for authorship identification. in conclusion, we have established ͑i͒ a set of conditions for full correlation between topological and dynamical features of directed complex networks and demonstrated that ͑ii͒ zipf's law can be naturally derived for fully correlated networks. result ͑i͒ is of fundamental importance for studies relating the dynamics and connectivity in networks, with critical practical implications. for instance, it not only demonstrates that hubs of connectivity may not correspond to hubs of activity but also provides a sufficient condition for achieving full correlation. result ͑ii͒ is also of fundamental importance as it relates two of the most important concepts in complex systems, namely, zipf's law and scale-free networks. even though sharing the feature of power law, these two key concepts had been extensively studied on their own. the result reported in this work paves the way for important additional investigations, especially by showing that zipf's law may be a consequence of dynamics taking place in scalefree systems. in the cases where the network is not fully correlated, the pearson coefficient may be used as a characterizing parameter. for a network with very small correlation, such as the www links between the pages in a new zealand domain analyzed here, the reasons for hubs failing to be active could not be identified, probably because of the substantially higher complexity and heterogeneity of this network, including varying levels of clustering coefficients, as compared to the neuronal networks. this work was financially supported by fapesp and cnpq ͑brazil͒. luciano da f. costa thanks grants / - ͑fapesp͒ and / - ͑cnpq͒. markov chains: gibbs fields, monte carlo simulation, and queues ͑springer the formation of vegetable mould through the action of worms, with observations on their habits ͑murray the pothunters ͑a & c black bureau of transportation statistics: airline on-time performance data modern information retrieval ͑addison-wesley the oxford handbook of computational linguistics ͑oxford human behaviour and the principle of least effort ͑addison-wesley key: cord- -fhenhjvm authors: saha, debdatta; vasuprada, t. m. title: reconciling conflicting themes of traditionality and innovation: an application of research networks using author affiliation date: - - journal: adv tradit med (adtm) doi: . /s - - -w sha: doc_id: cord_uid: fhenhjvm innovation takes different forms: varying from path-breaking discoveries to adaptive changes that survive external shifts in the environment. our paper investigates the nature and process of innovation in the traditional knowledge system of ayurveda by tracing the footprints that innovation leaves in the academic research network of published papers from the pubmed database. traditional knowledge systems defy the application of standard measures of innovation such as patents and patent citations. however, the continuity in content of these knowledge systems, which are studied using modern publication standards prescribed by academic journals, indicate a kind of adaptive innovation that we track using an author-affiliation based measure of homophily. our investigation of this measure and its relationship with currently accepted standards of journal quality clearly shows how systems of knowledge can continue in an unbroken tradition without becoming extinct. rather than no innovation, traditional knowledge systems evolve by adapting to modern standards of knowledge dissemination without significant alteration in their content. one important platform for sharing knowledge, be it results of cutting-edge research or establishing old truths in a modern context, is journal publications (thyer ; edwards ; sandström and van den besselaar ) . medicinal sciences is of particular interest, as team collaboration is necessary to produce research outcomes (hall et al. ; gibbons ) . of the existing data-sets providing details of academic collaborations and knowledge sharing in biosciences, pubmed is one of the foremost sources (falagas et al. b; mcentyre and lipman ; anders and evans ) . with a collection of more than million citations on biomedical literature, pubmed (maintained by the us government funded us national library of medicine and national institutes of health) offers a panorama of publications of diverse qualities and topics. of great interest is the simultaneous co-existence of research papers not only from the current mainstream of bio-medicine, but also other branches of medical knowledge, such as traditional medicine. no two canons of knowledge can be as distinct from each other as bio-medicine and traditional medicine (baars and hamre ; mukharji ) , and yet academic collaborations conform to similar standards of dissemination of knowledge and is available in a common platform like pubmed. in terms of the character the importance of team collaboration for producing quality research has been documented for other disciplines and across countries. see adams ( ) for a general discussion on the impact of international collaborations on knowledge sharing. world health organization's report on traditional medicine ( ) defines traditional medicine as "the sum total of the knowledge, skills and practices based on the theories, beliefs and experiences indigenous to different cultures, whether explicable or not, used in the maintenance of health, as well as in the prevention, diagnosis, improvement or treatment of physical and mental illnesses." this definition finds resonance in fokunang et al. ( ). of the discipline, bio-medicine displays masculinity, and low power distance whereas traditional medicine strives to retain content untouched. the former is marked by schumpeterian upheavals and stark innovations from time to time (such as the development of vaccines and novel drugs for treating new disease conditions, ) whereas the latter pride in their continuity of knowledge handed down from generation to generation [see banerjee ( ) , shukla and sinclair ( ) and mathur ( ) ]. the simultaneous existence of research papers from both disciplines for journals conforming to uniform standards of publication automatically raises questions about the true nature of innovation in traditional knowledge systems like ayurveda. it is possible that it is an innovative discipline because it shares the same kind of research output space as bio-medicine publications. on the other hand, the nature of collaborations within the traditional knowledge journals might be 'non-innovative', despite publications in standard format journals. when knowledge systems adopt the platform of journal publications, the structure of information disseminated becomes a function of the standards and rules set by them. there are specific structural restrictions, such as bibliographies of specific types (green ; masic ) , journal rankings (gonzález-pereira et al. ) ], double-blind peer review systems (albers et al. ) etc., that are imposed when knowledge is shared through journal publications. this brings us to our central query: when a medicinal system which is considered 'traditional' uses modern publication standards to disseminate knowledge, what kind of collaborative structures will be observed? how does a system that conforms with these modern publication standards insulate itself from dilution in terms of content and practices? to what extent will traditional knowledge systems engage with academic collaborations as observed in other mainstream disciplines? we contextualize our query by studying the publication network in ayurveda, a rich traditional medicinal system prevalent in south asia, and largely limit ourselves to the first two questions. there are other branches of traditional medicine, such as indigenous medicine of indians in the americas or tibetan/himalayan traditional medicine systems. in fact, in recent times, the coronavirus epidemic has shown the relevance of chinese traditional medicine. we have evidence of successful treatment of viral cases in wuhan, the centre of the outbreak. the ministry of ayush, government of india, has announced a taskforce (in early april ) with members from the indian council of medical research, the council of scientific and industrial research, the department of biotechnology, the ayush ministry and the who (see a discussion in https ://scien ce.thewi re.in/the-scien ces/minis try-of-ayush -taskforce -clini cal-trial s-herbs -proph ylact ics/), to investigate the potential of ayurvedic cures for coronavirus symptoms. as a prophylactic cure for covid- , the taskforce has recommended clinical trial testing of some herbs, prominently ashwagandha (withania somnifera). this herb, which we research in detail in this paper, has been mentioned in recent times as a potential alternative to hydroxychloroquine. these efforts are in the initial stages, but the ayush ministry has established a clear protocol for registering ayurvedic formulations to establish efficacy in treating symptoms of covid- as well as warning alerts to all regarding unsubstantiated claims of efficacy of herbal cures. traditional knowledge systems exist in modern times due to its continued relevance, despite its continued and steady referencing to historical repositories of information. however, knowledge flows in a discipline are, by no means, only limited to journal publications, as books, project applications and grants (dahlander and mcfarland ) , web-and video logs and many other forms of online and open source platforms (yan ; chesbrough ; zucker et al. ) also contribute to its denouement. see the report available at http://www.xinhu anet.com/engli sh/ - / /c_ .htm, which mentions that % of the covid- patients were treated with chinese traditional medicine. this ministry was established by the government of india as recently as and is the regulatory authority for alternative medicine disciplines, such as ayurveda, siddha, unani and homeopathy. multiple medical blogs as well as new reports mention this: https ://www.expre sspha rma.in/covid -updat es/gover nment -to-condu ctrando mised -contr olled -clini cal-trial -of-ashwa gandh a/; https ://www. busin ess-stand ard.com/artic le/pti-stori es/covid - -govt-to-condu ctrando mised -contr olled -clini cal-trial -of-ashwa gandh a- _ .html; https ://times ofind ia.india times .com/life-style /healt h-fitne ss/home-remed ies/covid - -minis try-of-ayush -start s-clini cal-trial s-for-ashwa gandh a-and- -other -ayurv edic-herbs -here-is-what-youneed-to-know/photo story / .cms;https ://www.expre sspha rma.in/ayush /ashwa gandh a-can-be-effec tive-preve ntive -drug-again stcoron aviru s-iit-delhi -resea rch/. https ://www.ayush .gov.in/docs/clini cal-proto col-guide line.pdf. https ://www.ayush .gov.in/docs/ .pdf. within the space of journal publications, we have to pick the best measure to capture innovation. academic paper writing with multiple authors (as is generally the case in most disciplines) involves joint ventures between diverse researchers, who reflect on the research problem from different perspectives. we explore the nature of the interconnections between authors, as these reflect, in a reduced form, the simultaneous adaptation and continuity in the process of knowledge transmission using the platform of academic journals. we postulate that the nature of these interconnections, as captured by the notions of network density and homophily in a research network, have the potential to capture innovation in traditional knowledge systems. consider network density first. this measures the proportion of potential ties that are realized in an empirical network (newman ) . the more dense a network, the higher the number of potential ties that are actualized leading to larger flows of information. a sparse network leads to less information transmission as well as benefits and dangers of interconnections, as hearn et al. ( ) discusses. hence, in a densely connected network, with many cross-connections between researchers, while benefits of continuous knowledge is enhanced, the possibility of disruptive changes coming through the structure of the connections also become alive. this brings us to the issue of homophily in the research network and its relationship with adaptive innovation in networks with different densities. homophily, which is the literal equivalent for the idiom 'birds of a feather flock together', in a research network reveals the extent to which 'similar' researchers form collaborations. note that most of the literature on homophily relate to a study of different attributes of researchers, such as gender (shrum et al. ) , race or ethnicity (leszczensky and pink ) , language (pezzuti et al. ) etc. and interest (dahlander and mcfarland ) . the latter also differentiate between attribute and interest-based homophily of university researchers from their organizational foci (departments and research centers). note that their investigation revolves around a specific issue of tie formation versus continuations in collaborations for a particular university in the us. when it comes to traditional medicine, universities are not the optimal institutional foci for academic research, as most mainstream medical colleges teach only bio-medicine (patwardhan and patwardhan ). traditional medicine is practiced in dedicated research centers and some specific universities, as well as by independent researchers who publish in international peer-reviewed journals such as journal of ayurveda and integrative medicine (j-aim with a scimago rank of . ) or journal of ayurveda (published by the national institute of ayurveda, jaipur, india) or ayu (open access journal published by the institute for post graduate teaching & research in ayurveda, gujarat ayurved university, india) as well as others of less repute [see kotecha ( ) for concerns regarding quality of publications in ayurveda]. for our study, an appropriate measure of homophily in publications has to capture the homogeneity in the quality of information that is exchanged through academic research collaborations, as information transmission leads to the genesis of innovative ideas in the research space. the more homogeneous this exchange, the higher will be the self-referencing character of the transmitted knowledge. the challenge here is to understand how to measure similarity. we propose two ways for discussing similarity of connections in a research network: (i) a macro measure that tests for similarity in connections in the overall network and (ii) a micro measure that explores the presence of similarity in author connections for each academic paper in the overall research network. the latter measure is a marriage between organizational foci and homophily, which dahlander and mcfarland ( ) treat as two independent conditions for studying academic collaborations. our work is close to dunn et al. ( ) , who treat researchers in bio-medicine in terms of their relationship with the industry: either with industry affiliations or without these associations. this kind of bifurcation limits the analysis to a study of dyadic ties or collaborations only. we use a more flexible definition for affiliation by institution in order to accommodate collaborations between more than two authors. note that there is a trade-off between the network density and homophily: knowledge perpetuation in a densely connected network requires some form of similarity among agents exchanging information such that the content of the knowledge is not subject to drastic change. this has to be the case for traditional knowledge systems that have not become extinct, but continue to co-exist with other forms of knowledge canons. we couple our measures of homophily with a measure of quality of publications (the scimago journal rankings). modern publication standards, which equate publication quality using scimago-type of journal rankings [see gonzález-pereira et al. ( ) , falagas et al. ( a) , cite], should yield a negative relationship between low innovation possibilities (as exhibited by high homophily) in research papers and the rank of the journal publishing such papers. put together, our query about appropriate measures for innovation within traditional knowledge systems indicate certain patterns in the empirical research network. we expect to see that tm/cam research networks would be marked by an integration into modern publication standards, while feld ( ) define these organizational foci as institutions which may be social or legal entities around which collaborative activity is organized. the ministry of ayush, government of india maintains a database of journal articles published in reputed journals at http://ayush porta l.nic.in/defau lt.aspx. retaining characteristics of continuity within connections between researchers in the network. more precisely, our prediction is that research networks in ayurveda would exhibit: i conformity with modern publication standards: negative relationship of low research potential in the research network (measured using homophily) and journal publication standard (measured using scimago rankings); j higher homophily in more densely connected networks: ensuring self-preservation of knowledge in the process of transmission and exchange. we study our predictions in two research networks specific to two specific natural herbs: withania somnifera or ashwagandha and emblica officinalis or amla. most of the papers investigate the properties and effects of these herbs in a stand-alone fashion, with hardly any evidence of academic research on the combined effects of these two common ayurvedic herbs. our results corroborate the pattern we predict that perpetuates knowledge through adaptation to modern standards in publication. the more densely connected research network (emblica officinalis or amla) shows a clear causal relationship between publication standard of a journal and the lack of homophily among author connections. there is clear evidence of overall homophily in the research network, when we investigate connections between pairs of authors using the q-measure of modularity. however, this macro measure does not indicate the mechanism through which homophily is likely to result in adaptive innovation in research networks. this is possible through our perpaper affiliation-based measure of homophily. the latter is our contribution to the literature on estimating measures of homophily that allows one to study supra-dyadic collaborations (research papers with more than two authors). as most papers in journals, particularly in the sciences, contain teams of more than three or four authors, our measure provides an alternative to existing measures which only study twoperson collaborations. the discussion in this paper is organized along the following lines: "innovation and traditional medicine: a framework for analysis" section discusses the theoretical framework for understanding adaptive innovation in ayurveda. "empirical methodology: measuring channels of adaptive innovation" section details the empirical methodology, including our proposed measures for capturing innovation in research networks in ayurveda, filtered by specific herbs. "empirical results" section discusses the data sources and the empirical results, while "conclusion" section concludes the paper with a discussion of our findings as well as limitations in the light of the theoretical perspective we propose. traditional medicine based on ayurveda deals with naturally occurring ingredients, mostly plant-based extracts (yuan et al. ; gangadharan ; samy et al. ). we provide a brief description of the knowledge system of ayurveda, before investigating its positioning in modern journal publications. ayurveda, which originated years ago in india, has adapted over the years and continues to be popularly accepted as a system for retaining health as well as curing diseases (jaiswal and williams ) . this popularity was not limited to india alone in earlier times. for instance, salema et al. ( ) , in his description of colonial pharmacies in the first global age between - ce, describes the widespread application of ayurvedic herbs as medicine in many parts of the world, starting with portuguese india. he mentions that medicines originating in india, with the agency of jesuit missionaries engaging in medicinal trade, became very important in the state-sponsored health care institutions of the portuguese colonies around the world. not only medicines, research on indian medicines providing information about (i) the medicinal properties of substances from the indian sub-continent (ii) commercialization of these substances and (iii) market demand were published in the form of medical reports sponsored by the portuguese overseas council in lisbon. in fact, garcia de orta's colloquies on the samples and drugs of india, published in goa in ce, was the first printed publication on indian plants and medicines, as mentioned in salema et al. ( ) . garcia de orta was a pioneer in pharmacognosy and the first european writer on indian medicine. the outreach of this knowledge and the medicinal products covered a diverse set of regions: macau, timor, mozambique, brazil, sau tome and the continental portugal (to name a few as mentioned in salema et al. ( ) ). despite this spread, ayurvedic texts such as the charaka samhita ( - bce), the sushruta samhita ( - bce), the ashtanga hridayam ( - ce), ashtanga sangraha ( - ce) etc., are studied by practitioners till date in the original or abridged versions. till ce, traditional medicine, and particularly ayurveda, was the prevalent and respected system of medicine in countries like india and sri lanka. it was during the period of increasing british colonization, that is, from to ce, which saw various advances in western medicine and a consequent but slow loss of reputation of traditional medicine (saini ) . history has shown that the advent of western medicine has relegated traditional systems of cure such as ayurveda to a subaltern space (banerjee ; ravishankar and shukla ; saini ; salema et al. ; patwardhan ) . the slew of standards for proving efficacy of cure, safety of cures (for example, conduct of clinical trials) coupled with recent advances in biotechnology has been at the forefront of pharmaceutical innovation in western medicine. therefore, a natural conclusion about the decay of traditional medicine in the face of competition from its newer counterpart is attributable to its self-perpetuating standards of adaptions. as opposed to the slew of drastic innovations delivered through the institution of clinical trials and other enforceable standards in bio-medicine, ayurveda adapted to the niche branch of 'traditionality' that did not incorporate similar institutions and standards. an academic collaboration network can be modelled as a finite collection of nodes (representing individual researchers), who are connected through co-authorship edges to form a simple graph g: where e is the set of edges (co-author connections) and v is the set of nodes (authors). a few features of this definition are in order. first, an author with no connections proxies for single-authored papers. a paper with only two authors will be represented by a single edge connecting two nodes. a drawback of this representation is that there is no direct way of capturing a paper with more than two authors. one way around this is to break up the collaborations in the paper and treat them in a binary fashion: with three authors, consider first the link between the first and the second author, then the link between the second and the third author and at last, between the first and the third author. this loses out the ( ) g = ⟨e, v⟩ flavour of the combined effect of knowledge sharing through a team of more than two people. an effective representation here requires a modification of the simple graph to a more general network structure such as a hypergraph [see newman ( ) ]. the existing literature investigating collaborations limit the discussion to dyadic connections. our proposed micromeasure is closest to freeman and huang ( ) , who investigate homophily using author-ethnicity in . million scientific papers written in the us between and . they find that high homophily results in a lower potential for innovation. however, in order to work with simple graphs, freeman and huang ( ) restrict research alliances only to the first and last authors of scientific publications assuming that they have the maximum responsibility. while this filter on the space of authors allows the overall network to retain a simple graph structure, the loss of information in the process is likely to result in an inability to answer the research question of interest. this is particularly so for us, as we assume that the composition of the research team itself reveals innovative potential. a side issue with ethnicity as the defining characteristic for authors in the process of knowledge sharing. traditional knowledge is likely to circulate among limited ethnicities. what might matter more are constraints imposed by the institutional affiliation of the researcher. our measure of homophily is based on affiliations of co-authors, rather than ethnicity. similarity in institutional affiliation of authors results in homophily, as similar resources (research budget, institutional characteristics and knowledge depositories, like access to research databases) are involved in producing research output. dahlander and mcfarland ( ) mention five separate factors in their study of tie formation and continuation in academics: institutional foci, attribute and interest-based homophily, cumulative advantages from tie formation, triadic closure (third party reinforcement) and reinforcement of successful collaborations (tie inertia) as separate factors. however, their empirical investigation of these factors also limits itself to dyadic collaborations. in the context of citations in physics journals, bramoullé et al. ( ) notes the presence of homophily and biases, particularly in the formation of new ties, but in a dyadic setup. for studying integration of traditional knowledge systems with modern publication standards, there is no existing theory. we make a weak assumption about incentives that drive co-author incentives to form connections with heterogeneity in institutional affiliations: figures and in appendix depict the ashwagandha and amla research networks as simple graphs. similar ethnic identities of authors indicate high homophily in freeman and huang ( ) . assumption i successful publications in high quality journals drive collaboration incentives [tie inertia, as per dahlander and mcfarland ( ) ]. given a continuum of research journals in ayurveda, it is possible for a researcher to choose his/her research connection to publish papers in any journal in that continuum depending on his/her research grants. the less is the institutional support as well as lower are the benefits of publishing in high quality journals, the less will be the innovative potential in the overall research network. note here that there are no pressures or funding coming from the downstream commercial firms discovering drugs to support research incentives in this stage of research, unlike for bio-medicine [see dunn et al. ( ) on industrysponsored research in the latter]. it is the standards of research itself and an individual researcher's incentive constraints that determine the innovative potential of the research network. our first measure is network density of the herb-specific research network. this measure captures the proportion of potential connections that are actually present in the graph using the simple graph representation of the amla and ashwagandha research networks. network density varies between a maximum value of and a minimum of . second, we work out the micro and macro measures of homophily in the two networks. the micro measure is based on the by-paper homophily index defined by freeman and huang ( ) . for a given paper j, we define this measure h j as the sum of the squares of the shares of each affiliation group among the authors of the paper: where n = number of authors; s i = the share of the i th affiliation in the authors of paper j. this measure is akin to the herfindahl-hirschman index (hhi) used to measure concentration in markets, as mentioned by freeman and huang ( ) . note that freeman and huang ( ) define this index based on the ethnic concentration of authors writing a paper. this is straightforward, as an author can be mapped to his or her ethnicity uniquely. we do not work with author ethnicity, as we feel the nature of information flows in collaborations are better captured using the resource constraints represented by institutional affiliations. the affiliation types we consider are university departments, research centres, government-sponsored think tanks etc. there is a variety of such institutions for each author; sometimes authors have multiple affiliations. due to this, we have to provide a tie-breaking rule for authors with multiple affiliations. as a baseline, we assume that in cases where authors have multiple affiliations in a paper, the relevant affiliation is the: unique affiliation of any author that is not shared with any of the other author as the relevant affiliation; if the earlier option is not possible (that is, there exists no unique affiliation for the author), then we select the first of the listed affiliations of the author. this tie-breaker assumption is, of course, a bit arbitrary. in a later section, we conduct a robustness check of our results by changing this assumption to see if the regression results hold. in either case, the least homophily is exhibited when all the authors have different institutional affiliations whereas the highest degree of homophily occurs when all the authors belong to the same department in the same institution. if all of the authors on a paper have the same affiliation (i.e., they belong to the same department in an institution), then h j equals . , which is the maximum value of the homophily measure. if the paper has authors of different affiliations, then h j takes different discrete values for papers depending on the number of affiliations and number of authors on a paper. next, we follow up the by-paper homophily measure with the homophily or assortative mixing in the overall herb networks of amla and ashwagandha. in this more macromeasure, we work with a simple graph characterization and therefore, and use coarser categories for affiliation. here, we illustrate the calculation of h j for the general case first and then for cases . and . mentioned above. for the general case: consider the paper titled 'clinical efficacy of amalaki rasayana in the management of pandu (iron deficiency anemia)' co-authored by s. layeeq (department of panchakarma, uttarakhand ayurved university) and a.b. thakar (department of panchakarma, gujarat ayurved university) in the amla research network. here, h j is the sum of ( ∕ ) + ( ∕ ) which is equal to / since s i for each affiliation is . . now, in the case of tie-breaker ., for the paper from the ashwagandha research network titled 'antihyperalgesic effects of ashwagandha (withania somnifera root extract) in rat models of postoperative and neuropathic pain', two out of the four authors have multiple affiliations. all the authors are affiliated to korea food research institute, but two are additionally affiliated to the korea university of science and technology. thus, we consider the unique affiliation of the last two authors and h j is calculated as the sum of ( ∕ ) + ( ∕ ) and is equal to . . in the case of tie-breaker ., for the paper titled 'effects of withania somnifera and tinospora cordifolia extracts on the side population phenotype of human epithelial cancer cells: toward targeting multi-drug resistance in cancer' has six authors: n. maliyakkal, a. appadath beeran, s.a. balaji, n. udupa, s. ranganath pai, a. rangarajan. the first author is affiliated to the indian institute of science (iisc) as well as manipal university, the second, fourth and fifth authors are affiliated to manipal university and the third and sixth authors are affiliated to iisc. here, since there are no unique affiliations, we take that the first author is affiliated to iisc (first of the listed affiliations). thus, we calculate h j as the sum of ( ∕ ) + ( ∕ ) which is equal to . . authors are divided into four categories: authors whose institutions are based in india, sri lanka, rest of the world (not india or sri lanka) and multiple institution/country affiliations. the separate categories for india and sri lanka is due to the fact that these countries have a cultural tradition of ayurveda historically. we calculate newman's specification [see newman ( ) ] of the measure of modularity, q, based on affiliations to ascertain the presence of homophily or assortative mixing in our networks as follows: here, a ij = element of the adjacency matrix between nodes i and j; k i = degree of node (author) i, i.e, the number of authors that are connected to node i; c i = type of node i, i.e, whether the node i has an indian, sri lankan, foreign (other than indian or sri lankan) institution affiliation or multiple affiliations; m = total no. of edges in the network; = kronecker delta which is when c i = c j , i.e, when nodes i and j are of the same type. this q measure has the advantage of comparing the presence of homophily relative to a counterfactual of what kind of connections would be present if, unlike our assumption , authors randomly chose co-authors for writing research papers. the deliberate strategic choice in collaborative connections, assuming that it increases the chance of publishing in high quality journals, is captured through this measure through its two terms: the first term in the formula of q represents the actual level of assortative mixing in the empirical network and the second term is the extent of this mixing that we are likely to see if all the links in the network were created randomly. a positive value of q indicates significant assortative mixing and hence homophily in the network, whereas a near-zero value of q is indicative of very little homophily in the network. the publication standard of academic journals, whose relationship we study next in relation to homophily, is measured using the scimago rank. our assumption is that a high scimago rank is indicative of high quality of innovation. we use the scimago ranking since it based on the idea that 'not all citations are equal'. the alternative measure, average impact factor is, in fact, highly correlated with the average scimago rank. the causal relationship we test predicts the manner in which the scimago rank of a journal (dependent variable) varies with our micro measure of homophily (independent variable) with additional controls. for this purpose, we conduct a quantile or percentile regression, since the distribution of our dependent variable (scimago ranking of journals) is skewed and not normally distributed. quantile regression is based on the estimation of conditional quantile functions as against the classical linear regression which is based on minimizing sums of squared residuals. linear regression helps in estimating models for conditional means whereas quantile regression estimates models for the conditional median as well as other conditional quantiles. further, the quantile regression treats outliers and non-normal errors more robustly than the ordinary least squares (ols) regression. we contrast our results against the standard ols regression results. we expect that less 'homophilous' are author affiliations in a paper, the higher will be the innovative potential of the paper. the likelihood of publication in a higher ranked journal therefore, higher will be the scimago ranking. hence, we expect a negative relationship between h j and average scimago ranking. we use data on research papers from pubmed database , which is maintained by the us national library of medicine and national institutes of health, for a five year period ( july to july ). it contains more than million citations for biomedical literature from medline, life science journals, and online books. search string matters for all bibliometric research. we found that research papers which appear with the string search 'withania somnifera + ayurveda' are contained in 'ashwagandha + ayurveda' but not vice versa. hence, we used the former search string. for amla, we combined the searches 'amla + ayurveda' and 'emblica officinalis + ayurveda', that is for both traditional/ local name and scientific name, because the union set represents more papers than individual searches, and the brief overview of abstract also shows that the herb has been used in the analysis for the paper. we list information on articles, authors, and the country of institution of the author as well as authors' institutional affiliations. note that if the papers are not available online, we mark the authors' affiliation as not available. also, when an author has co-authored more than one paper, where for one paper the affiliation is given while for others it is not mentioned, then we take the affiliation which has been mentioned as relevant. the number of observations in the ashwagandha network is almost twice that of the amla network, though on an average, an author in each of the networks has the same degree. the graph density of the amla research network is higher (it is . ) compared to the ashwagandha network (for which graph density is . ). these figures for graph density are extremely low, particularly in comparison with a complete graph (in which every pair of nodes is connected by a unique edge) with density equal to . however, relative to the amla network, ashwagangha has more research papers written over the five year period taken in consideration. continuity of knowledge, when many authors are involved in the overall research network, is ensured by: similar per-paper homophily among authors by affiliation ( h j ) in the less dense ashwagandha network (average homophily score is . ) compared to the more densely connected amla network (average homophily score is . ). higher variation in the quality of journals in the ashwagandha network (measured by the average sjr variable). its standard deviation in the ashwagandha network is relatively high at . compared to . for the amla network. higher per-paper homophily ( h j ) in achieving higher quality publications; the value of the average scimago journal rank (sjr) is significantly higher at . for the ashwagandha network compared to . for the amla network. this implies compliance of research alliances note that the average scimago rank (average sjr), as shown in the histograms in figs. and respectively for ashwagandha and amla, are significantly skewed. most of the journal papers are clustered in intervals, as the bars of the histograms show in these figures. not only is there an interval-specific clustering, the bulk of journals in both the networks have a low scimago rank. most of our observations (the highest density of journals) are bunched towards very low values on the x-axis, much below the mean average sjr. this can also be read off from the continuous line fitted to the histograms. though both the histograms look similar, the fitted line clearly shows that almost all the papers in the amla network are below an average rank of , whereas in the ashwagandha network, there is a small presence (less than %) of papers above the scimago rank of . this reveals that the overall quality of journals and therefore, papers and their innovative potential in the amla network is worse than in the ashwagandha network. this is, of course, beholden to our assumptions about inferences of innovative potential and high quality in papers as reflected by the average scimago (sjr) ranks. we point out here that the average sjr ranking is not a paper-level metric, that is, it will only change when the journal where the paper is published changes. we have papers that are published in the same journal and therefore, we have repeat values of the ranking scores. what we find is that for both networks, the median is less than the mean for all values of average scimago ranks and that low quality publications outnumber higher quality ones in our data. as mentioned earlier, there are multiple ways to measure homophily. other than our per-paper affiliation-based measure, we can comment on the extent of assortative mixing or modularity in the entire network (see the definition in "empirical methodology: measuring channels of adaptive innovation" section). we find existence of assortative mixing or homophily in the overall research networks we study, as the value of q ( . for the amla network and . for the ashwagandha network respectively) is higher than zero. the q measure shows a higher overall homophily in the ashwagandha network relative to the amla one. the simple graphs in figs. and in appendix show the connections in these networks dyadically. these graphs reveal an empirical regularity seen in most modern publication networks: authors in disciplines like traditional medicine mostly work in small connected sub-graphs (indicating that collaborations are deliberate and non-random [see newman ( ) regarding limited number of collaborators in theoretical disciplines like high energy theory]. indian authors rarely form collaborations with sri lankan and other foreign authors. the presence of the latter type of authors is in predominance in the ashwagandha network than in the amla one: it is interesting that despite ayurveda's historical origins in india, foreign institutions outside the south asian region engage with the discipline. however, the nature of these academic endeavours is limited within their own cliques, giving rise to a higher q measure for the ashwagandha network than the amla network. now, the two measures of homophily are not directly comparable, as their objectives are different. the q-measure works out whether connections formed in the network are strategic or random in the network as a whole (the first term in the formula for q works out the extent of strategic connections in the network relative to the second term capturing random connections). the micro measure works out homophily at the level of individual research papers in the network whereas for the calculation of the q-measure, authors are classified in terms of four affiliations (indian, sri lankan, others and multiple affiliations). a positive value of q shows that research links in the networks we study are made strategically, which supports our hypothesis that adaptive innovation works out through some form of homophily in the network. the last point that deserves a further exploration is the precise relationship between journal quality and homophily, which we work out in the next section. recall that the scimago rankings were highly skewed in both the networks. hence, we investigate the effect of homophily, after controlling for other network-specific features, on the scimago ranking of journals in each network using a quantile regression. the dependent variable is the average scimago journal ranking in the network and the independent regressor of interest is the homophily index. we control for the degree of corresponding author in the respective network and total number of references and contrast our results with the ordinary least squares (ols) regression to see the effect of the skewness on the causal relationship between journal quality ranking and the independent variables. if skewness matters in the regression, then the ols regression (which predicts the effect of the independent regressor on the mean value of the independent variable, in the presence of other controls) would show a different pattern compared to the effect on the other quantiles. other than the th, the th (the median) and the th quantile, we consider a few other percentiles of the independent variable to depict the nonlinearity. we present in table the results using h j , which is our micro measure of homophily based on freeman and huang ( )'s methodology. we find that effect of homophily ( h j ) on the average scimago ranking works out differently for (i) different techniques of estimation techniques (ols as opposed to quantile regression) and (ii) different research networks, ashwagandha and amla. the goodness-of-fit measure for amla (as captured by the pseudo-r values) are much lower as compared to those for the ashwagandha network (see appendix for the results for the ashwagandha network). this could presumably be because of the relatively lower number of observations in the amla network. additionally, the homophily measure ( h j ) significantly lowers the quality of journals for the ols regression and the th, th, th and th percentiles. hence, the results of the ols average out the effect of homophily on quality of publications and the percentiles depict a comparatively accurate picture. for the ashwagandha network, we find that h j does not significantly impact the average scimago ranking in the ols regression as well as the different quantile regressions (refer to table in appendix ). the quantile regression, however, shows a non-linear impact on the different quantiles of the independent variable attributable to h j . for the th (or the th) percentile value of the independent variable, if h j increases by one unit the average sjr decreases by . (or by . ). these results hold at % level of significance. however, at the th quantile, the effect of h j is no longer significant. in "empirical methodology: measuring channels of adaptive innovation" section, we defined the affiliation-based micro homophily measure with two caveats for the case when authors of papers are affiliated to more than one department/institution. to check whether our results of the previous section hold, we change the two of our earlier assumptions regarding the tie-breaker on papers with multiple author affiliations as follows: i. instead of the unique affiliation for the author with multiple affiliations, we use the most common affiliation (that is, affiliation that is shared by at least one other co-author); ii. when there exists no common affiliation, then we take the last among the listed affiliations of the author with multiple affiliations, instead of the first one. we now redefine h j as h r and replicate our ols and quantile regressions. the results for the amla network are shown in table . comparing table with table , we find that our results have not changed in any significant way. thus, we conclude that results using our micro-measure of homophily are robust to changes in the definition of h j . our paper provides a method to understand the nature of innovation (we term this adaptive innovation) that allows a canon of knowledge to not become extinct while ensuring continuity in content. being traditional does not indicate rigidity. nijar ( ) studies customary law and its relationship with traditional knowledge and he observes that these systems are dynamic and exhibit flexibility through 'a process of natural indigenous resources management that embodies adaptive responses'. the presence of these adaptive responses allow for a specific type of dynamic pattern or innovation. hearn et al. ( ) discusses the role that innovation has in complex systems, which, we believe, carries over to traditional knowledge systems. their claim that it is, paradoxically, also true that innovation also requires some stability and security in the form of such things as organisational structure, discipline and focus. makes our research quest less blunt than whether innovation is possible within stable traditional systems of knowledge to a more nuanced search for how to understand the process of innovation in such systems and measure them. of the possible patterns that a complex system can exhibit, hearn et al. ( ) distinguishes four: i. self-referencing: a condition that leads to perpetuation and continuity in knowledge. ii. self-organization: which arises from exogenous changes resulting in adaptations to the existing body of knowledge. iii. self-transformation: that leads to drastic schumpeterian upheavals in established canons of knowledge, mostly through endogenous changes from within the system. iv. extinction: changes that result in complete demise of a system. of these four conditions, traditional medical systems display self-referencing, as processes and institutions that deal with these have resulted in preservation of knowledge for thousands of years. the fact that the last condition of extinction is not the case with traditional medicine, it must be the case that the institutional structures and interactions among practitioners over the years have adapted themselves [self-organization as per hearn et al. ( ) ], leading to selfperpetuation. the continuity of the structure of knowledge in disciplines like ayurveda also imply that schumpeterian innovations or drastic innovation, which would destroy channels for continuing embedded knowledge, are absent. this clearly shows that innovation is not antithetical to traditional knowledge systems, just that the processes of adaptation and change result in perpetuation in knowledge. while we do not expect to see drastic innovation that marks modern bio-medicine, a detailed study of these knowledge systems should reveal very nuanced forms of self-perpetuating adaptations. in the specific context of herb-specific academic paper networks in ayurveda, we find that a lower affiliation-based homophily is causally linked with higher publication ranking, as measured by the scimago ranks of journals publishing these papers. however, more diverse collaborations with low homophily are costly, as per our theory and the contentions of dahlander and mcfarland ( ) . simultaneously, low homophily breeds the possibility of content dilution in the knowledge system. therefore, as a natural response to retaining ties with low collaboration cost [as dahlander and mcfarland ( ) would argue], the research networks we study exhibit high levels of homophily, be it through the lens of assortative mixing or affiliation-based homophily measures. a resultant effect is that these ties allow continuity in the content and structure of knowledge itself, despite an adaptation to modern publication standards. this becomes an adaptation strategy for a traditional knowledge system that continues to persist at present with the retention of the basic structure of knowledge. our findings regarding institution-based homophily also resonate with the finding of dunn et al. ( ) that there is homophily among industry-affiliated researchers in bio-medicine. in comparison with non-industry-affiliated researchers, those with industry links publish more often and more so, with each other. this kind of perpetuation of connections seems to be the commonality of research themes that is necessary for research that has similar type of pharmaceutical industry-based funding. our result regarding similarity in institutional affiliations in publications, despite lowering of the quality of publications, indicates not only the ease of finding collaborators [as mentioned by dahlander and mcfarland ( ) ], but also the commonality of content that helps perpetuate knowledge. however, they find a continuation from the industry to research through collaboration links between industry-linked authors. in sharp contrast, the absence of institutions like clinical trials prevent any meaningful incentives for the drug manufacturing industry to invest in the research segment in ayurveda. note one problem with the publications space is its survival bias: we can only study successful collaboration, not the unsuccessful ones. this is a drawback of all studies that investigate collaborations through the space of academic publications [see dahlander and mcfarland ( ) ]. a different issue remains about our method of analysis: are research publications the appropriate space to look for adaptive innovation in traditional knowledge systems? undoubtedly, we use a modern standard and retrofit it to understand collaborative processes in traditional knowledge systems. these disciplines, which have survived many years of transitions are often best seen as lived traditions [see robbins and dewar ( ) for traditional indigenous medicine in the americas]. most practitioners of ayurveda still refer to the classic texts of charaka samhita as relevant texts in their practice. in sum, if traditional medicine adapts itself to modern publication standards, the path it takes is no different from other disciplines that publish in such platforms, such the existence of a small cluster of connected authors in an otherwise sparsely connected network (see figs. and in appendix . ) the modalities of the publication platform determine the quality of connections to a large extent when traditional knowledge finds these outlets for knowledge dissemination. what remains suspect is the overall engagement of traditional medicine in particular and traditional knowledge, in general, with modern publication standards. recent initiatives by the ministry of ayush, government of india, have resulted in the creation of a repository of modern journals with publications in ayurveda, just as pubmed is an international collection of such publications. however, the researcher and the practitioner is unlikely to be the same agent, as our survey of found. the rise of a culture of knowledge dissemination through journals gives rise to the possibility of a disconnect in traditional disciplines: those who publish in journals and those who practice the discipline. what the two sets of individuals believe about innovation within the discipline are likely to be very different. we have limited our analysis to the space of academic journals in this paper. the overall engagement of ayurveda with modern publication standards and what it does to the discipline is part of our future research agenda and has not been addressed in this paper. collaborations: the fourth age of research publication criteria and recommended areas of improvement within school psychology journals as reported by editors, journal board members, and manuscript authors comparison of pubmed and google scholar literature searches whole medical systems versus the system of conventional bio-medicine: a critical, narrative review of similarities, differences, and factors that promote the integration process growth rates of modern science: bibliometric analysis based on the number of publications and cited references homophily and long-run integration in social networks open innovation: a new paradigm for understanding industrial innovation ties that last: tie formation and persistence in research collaborations over time industry influenced evidence production in collaborative research communities: a network analysis dissemination of research results: on the path to practice change the direct and indirect impact of culture on innovation comparison of scimago journal rank indicator with journal impact factor comparison of pubmed, scopus, web of science, and google scholar: strengths and weaknesses the focused organization of social ties traditional medicine: past, present and future research and development prospects and integration in the national health system of cameroon collaborating with people like me: ethnic co-authorship within the united states quality of ingredients used in ayurvedic herbal preparations the new production of knowledge: the dynamics of science and research in contemporary societies locating sources in humanities scholarship: the efficacy of following bibliographic references moving the science of team science forward: collaboration and creativity phenomenological turbulence and innovation in knowledge systems a glimpse of ayurveda-the forgotten history and principles of indian traditional medicine ayurveda research publications: a serious concern keeping the doctor in the loop: ayurvedic pharmaceuticals in kerala ethnic segregation of friendship networks in school: testing a rational-choice argument of differences in ethnic homophily between classroom-and grade-level networks the importance of proper citation of references in biomedical articles who owns traditional knowledge? pubmed: bridging the information gap doctoring traditions: ayurveda, small technologies, and braided sciences scientific collaboration networks. i. network construction and fundamental results traditional knowledge systems, international law and national challenges: marginalization or emancipation? time for evidence-based ayurveda: a clarion call for action ayurveda education reforms in india does language homophily affect migrant consumers' service usage intentions? indian systems of medicine: a brief profile traditional indigenous approaches to healing and the modern welfare of traditional knowledge, spirituality and lands: a critical reflection on practices and policies taken from the canadian indigenous example physicians of colonial india ( - ) ayurveda at the crossroads of care and cure a compilation of bio-active compounds from ayurveda quantity and/or quality? the importance of publishing many papers friendship in school: gender and racial homophily becoming a traditional medicinal plant healer: divergent views of practicing and young healers on traditional medicinal plant knowledge skills in india covid- : combining antiviral and anti-inflammatory treatments preparing research articles finding knowledge paths among scientific disciplines the traditional medicine and modern medicine from natural products minerva unbound: knowledge stocks, knowledge flows and new knowledge production acknowledgements we thank rinni sharma (doctoral student at uppsala university, sweden) for her assistance in data collection for this paper. we also thank dr. binoy goswami at the faculty of economics, south asian university for giving us comments on the questions in our primary survey.funding no external funding was received for our paper. ethical statement this article does not contain any studies with human participants or animals performed by any of the authors. see table . key: cord- -kqfyasmu authors: tagore, somnath title: epidemic models: their spread, analysis and invasions in scale-free networks date: - - journal: propagation phenomena in real world networks doi: . / - - - - _ sha: doc_id: cord_uid: kqfyasmu the mission of this chapter is to introduce the concept of epidemic outbursts in network structures, especially in case of scale-free networks. the invasion phenomena of epidemics have been of tremendous interest among the scientific community over many years, due to its large scale implementation in real world networks. this chapter seeks to make readers understand the critical issues involved in epidemics such as propagation, spread and their combat which can be further used to design synthetic and robust network architectures. the primary concern in this chapter focuses on the concept of susceptible-infectious-recovered (sir) and susceptible-infectious-susceptible (sis) models with their implementation in scale-free networks, followed by developing strategies for identifying the damage caused in the network. the relevance of this chapter can be understood when methods discussed in this chapter could be related to contemporary networks for improving their performance in terms of robustness. the patterns by which epidemics spread through groups are determined by the properties of the pathogen carrying it, length of its infectious period, its severity as well as by network structures within the population. thus, accurately modeling the underlying network is crucial to understand the spread as well as prevention of an epidemic. moreover, implementing immunization strategies helps control and terminate theses epidemics. for instance, random networks, small worlds display lesser variation in terms of neighbourhood sizes, whereas spatial networks have poisson-like degree distributions. moreover, as highly connected individuals are of more importance considering disease transmission, incorporating them into the current network is of outmost importance [ ] . this is essential in case of capturing the complexities of disease spread. architecturally, scale-free networks are heterogenous in nature and can be dynamically constructed by adding new individuals to the current network structure one at a time. this strategy is similar to naturally forming links, especially in case of social networks. moreover, the newly connected nodes or individuals link to the already existent ones (with larger connections) in a manner that is preferential in nature. this connectivity can be understood by a power-law plot with the number of contacts per individual, a property which is regularly observed in case of several other networks like that of power grids, world-wide-web, to name a few [ ] . epidemiologists have worked hard on understanding the heterogeneity of scalefree networks for populations for a long time. highly connected individuals as well as hub participants have played essential roles in the spread and maintenance of infections and diseases. figure . illustrates the architecture of a system consisting of a population of individuals. it has several essential components, namely, nodes, links, newly connected nodes, hubs and sub-groups respectively. here, nodes correspond to individuals and their relations are shown as links. similarly, newly connected nodes correspond to those which are recently added to the network, such as initiation of new relations between already existing and unknown individuals [ ] . hubs are fig. . a synthetic scale-free network and its characteristics those nodes which are highly connected, such as individuals who are very popular among others and have many relations and/or friends. lastly, sub-groups correspond to certain sections of the population which have individuals with closely associated relationships, such as group of nodes which are highly dense in nature, or having high clustering coefficient. furthermore, it is important in having large number of contacts as the individuals are at greater risk of infection and, once infected, can transmit it to others. for instance, hub individuals of such high-risk individuals help in maintaining sexually transmitted diseases (stds) in different populations where majority belong to long-term monogamous relationships, whereas in case of sars epidemic, a significant proportion of all infections are due to high risk connected individuals. furthermore, the preferential attachment model proposed by barabási and albert [ ] defined the existence of individuals of having large connectivity does not require random vaccination for preventing epidemics. moreover, if there is an upper limit on the connectivity of individuals, random immunization can be performed to control infection. likewise, the dynamics of infectious diseases has been extensively studied in case of scale-free as well as small-world and random networks. in small-world networks, most of the nodes may not be direct neighbors, but can be reached from all other nodes via less number of hops, that are number of nodes between start and terminating nodes. also, in these networks distance, dist, between two random nodes increases proportionally to the logarithm of the number of nodes, tot, in the network [ ] , i.e., dist ∝ log tot ( . ) watts and strogatz [ ] identified a class of small-world networks and categorized them as random graphs. these were classified on the basis of two independent features, namely, average shortest path length and clustering coefficient. as per erdős-rényi model, random graphs have a smaller average shortest path length and small clustering coefficient. watts and strogatz on the other hand demonstrated that various real-world networks have a smaller average shortest path length along with high clustering coefficient greater than expected randomly. it has been observed that it is difficult to block and/or terminate an epidemic in scale-free networks with slow tails. it has especially been seen in case the network correlations among infections and individuals are absent. another reason for this effect is the presence of hubs, where infections could be sustained and reduced by target-specific selections [ ] . it has been well known that real-world networks ranging from social to computers are scale-free in nature, whose degree distribution follows an asymptotic power-law. these are characterized by degree distribution following a power law, for the number of connections, conn for individuals and η is an exponent. barabási and albert [ ] analyzed the topology of a portion of the world-wide-web and identified 'hubs'. the terminals had larger number of connections than others and the whole network followed a power-law distribution. they also found that these networks have heavy-tailed degree distributions and thus termed them as 'scale-free'. likewise, models for epidemic spread in static heavy-tailed networks have illustrated that with a degree distribution having moments resulted in lesser prevalence and/or termination for smaller rates of infection [ ] . moreover, beyond a particular threshold, this prevalence turns to non-zero. similarly, it has been seen that for networks following power-law, does not exist and the prevalence is non-zero for any infection rates. due to this reason, epidemics are difficult to handle and terminate in static networks having powerlaw degree distributions. likewise, in various instances, networks are not static but dynamic (i.e., they evolve in time) via some rewiring processes, in which edges are detached and reattached according to some dynamic rule. steady states of rewiring networks have been studied in the past. more often, it has been observed that depending on the average connectivity and rewiring rates, networks reach a scale-free steady state, with an exponent, η , represented using dynamical rates [ ] . the study of epidemics has always been of interest in areas where biological applications coincide with social issues. for instance, epidemics like influenza, measles, and stds, can pass through large group of individuals, populations, and/or persist over longer timescales at low levels. these might even experience sudden changes of increasing and decreasing prevalence. furthermore, in some cases, single infection outbreaks may have significant effects on a complete population group [ ] . epidemic spreading can also occur on complex networks with vertices representing individuals and the links representing interactions among individuals. thus, spreading of diseases can occur over the network of individuals as spreading of computer viruses occur over the world-wide-web. the underlying network in epidemic models is considered to be static while the individual states vary from infected to non-infected individuals according to certain probabilistic rules. furthermore, the evolution of an infected group of individuals in time can be studied by focusing on the average density of infected individuals in steady state. lastly, the spread as well as growth of epidemics can also be monitored by studying the architecture of the network of individuals as well as its statistical properties [ ] . one of the essential properties of epidemic spread is its branching pattern, thereby infecting healthy individuals over a time period. this branching pattern of epidemic progression can be classified on the basis of their infection initiation, spread and further spread ( fig. . ) [ ]. . infection initiation: if an infected individual comes in contact with a group of individuals, the infection is transmitted to each with a probability p, independent of one another. furthermore, if the same individual meets k others while being infected, these k individuals form the infected set. due to this random disease transmission from the initially infected individual, those directly connected to it get infected. if infection in a branching process reaches an individual set and fails to infect healthy individuals, then termination of the infection occurs, which leads to no further progression and infection of other healthy individuals. thus, there may be two possibilities for an infection in a branching process model. either it reaches a site infecting no further and terminating out, or it continues to infect healthy individuals through contact processes. the quantity which can be used to identify whether an infection persist or fades out is defined as basic reproductive number [ ] . this basic reproductive number, τ, is the expected number of newly infected individuals caused by a single already infected individual. in case where every individual meets k new people and infects each with probability p, the basic reproductive number is represented as it is quite essential as it helps in identifying whether or not an infection can spread through a population of healthy individuals. the concept of τ was first proposed by alfred lotka, and applied in the area of epidemiology by macdonald [ ] . for non-complex population models, τ can be identified if information for 'death rate' is present. thus, considering death rate, d, and birth rate, b, at the same time, moreover, τ can also be used to determine whether an infection will terminate, i.e., τ < or it becomes an epidemic, i.e., τ > . but, it cannot be used for comparing different infections at the same time on the basis of multiple parameters. several methods, such as identifying eigenvalues, jacobian matrix, birth rate, equilibrium states, population statistics can well be used to analyze and handle τ [ ] . there are some standard branching models that are existent for analyzing the progress of infection in a healthy population or network. the first one, reed-frost model, considers a homogeneous close set consisting of total number of individuals, tot. let num designate the number of individuals susceptible to infection at time t = and m num the number of individuals infected by the infection at any time t [ ] . here, here, eq. . is in case of a smaller population. it is assumed that an individual x is infected at time t, whereas any individual y comes in contact with x with a probability a num , where a > . likewise, if y is susceptible to infection then it becomes infected at time t + and x is removed from the population ( fig. . a ). in this figure, x or v ( * ) represents the infection start site, y(v ), v are individuals that are susceptible to infection, num = , tot = , and m num = . the second one, -clique model constructs a -clique sub-network randomly by assigning a set of tot individuals. here, for individual/vertex pair (v i , v j ) with probability p , the pair is included along with vertices triples here, g , g are two independent graphs, where g is a bernoulli graph with edge probability p and g with all possible triangles existing independently with a probability p ( fig. . b ). in this figure, ) are the three -clique sub-networks with tot = , and g = g g g respectively [ ] . the third one, household model assumes that for a given a set of tot individuals or vertices, g is a bernoulli graph consisting of tot b disjoint b−cliques, where b tot with edge probability p . thus, the network g is formed as the superposition of the graphs g and g , i.e., g = g g . moreover, g fragments the population into mutually exclusive groups whereas g describes the relations among individuals in the population. thus, g does not allow any infection spread, as there are no connections between the groups. but, when the relationship structure g is added, the groups are linked together and the infection can now spread using relationship connections ( fig. . c ). in this figure, tot = where the individuals (v to v ) are linked on the basis of randomly assigned p and b = tot = . the fig. . b-d respectively [ ] . thus, it is essential to identify the conditions which results in an epidemic spread in one network, with the presence of minimal isolated infections on other network components. moreover, depending on the parameters of individual sub-networks and their internal connectivities, connecting them to one another creates marginal effect on the spread of epidemic. thus, identifying these conditions resulting in analyzing spread of epidemic process is very essential. in this case, two different interconnected network modules can be determined, namely, strongly and weakly coupled. in the strongly coupled one, all modules are simultaneously either infection free or part of an epidemic, whereas in the weakly coupled one a new mixed phase exists, where the infection is epidemic on only one module, and not in others [ ] . generally, epidemic models consider contact networks to be static in nature, where all links are existent throughout the infection course. moreover, a property of infection is that these are contagious and spread at a rate faster than the initially infected contact. but, in cases like hiv, which spreads through a population over longer time scales, the course of infection spread is heavily dependent on the properties of the contact individuals. the reason for this being, certain individuals may have lesser contacts at any single point in time and their identities can shift significantly with the infection progress [ ] . thus, for modeling the contact network in such infections, transient contacts are considered which may not last through the whole epidemic course, but only for particular amount of time. in such cases, it is assumed that the contact links are undirected. furthermore, different individual timings do not affect those having potential to spread an infection but the timing pattern also influences the severity of the overall epidemic spread. similarly, individuals may also be involved in concurrent partnerships having two or more actively involved ones that overlap in time. thus, the concurrent pattern causes the infection to circulate vigorously through the network [ ] . in the last decade, considerable amount of work has been done in characterizing as well as analyzing and understanding the topological properties of networks. it has been established that scale-free behavior is one of the most fundamental concepts for understanding the organization various real-world networks. this scale-free property has a resounding effect on all aspect of dynamic processes in the network, which includes percolation. likewise, for a wide range of scale-free networks, epidemic threshold is not existent, and infections with low spreading rate prevail over the entire population [ ] . furthermore, properties of networks such as topological fractality etc. correlate to many aspects of the network structure and function. also, some of the recent developments have shown that the correlation between degree and betweenness centrality of individuals is extremely weak in fractal network models in comparison with non-fractal models [ ] . likewise, it is seen that fractal scale-free networks are dis-assortative, making such scale-free networks more robust against targeted perturbations on hubs nodes. moreover, one can also relate fractality to infection dynamics in case of specifically designed deterministic networks. deterministic networks allow computing functional, structural as well as topological properties. similarly, in case of complex networks, determination of topological characteristics has shown that these are scale-free as well as highly clustered, but do not display small-world features. also, by mapping a standard susceptible, infected, recovered (sir) model to a percolation problem, one can also find that there exists certain finite epidemic threshold. in certain cases, the transmission rate needs to exceed a critical value for the infection to spread and prevail. this also specifies that the fractal networks are robust to infections [ ] . meanwhile, scale-free networks exhibit various essential characteristics such as power-law degree distribution, large clustering coefficient, large-world phenomenon, to name a few [ ] . network analysis can be used to describe the evolution and spread of information in the populations along with understanding their internal dynamics and architecture. specifically, importance should be given to the nature of connections, and whether a relationship between x and y individuals provide a relationship between y and x as well. likewise, this information could be further utilized for identifying transitivitybased measures of cohesion ( fig. . ). meanwhile, research in networks also provide some quantitative tools for describing and characterizing networks. degree of a vertex is the number of connectivities for each vertex in the form of links. for instance, degree(v ) = , degree(v ) = (for undirected graph (fig. . a) ). similarly for fig. likewise, shortest path is the minimum number of links that needs to be parsed for traveling between two vertices. for instance, in fig. diameter of network is the maximum distance between any two vertices or the longest of the shortest walks. thus, in fig. [ ] . radius of a network is the minimum eccentricity (eccentricity of a vertex v i is the greatest geodesic distance), i.e., distance between two vertices in a network is the number of edges in a shortest path connecting them between v i and any other vertex of any vertex. for instance, in fig. . b, radius of network = . betweenness centrality (g(v i )) is equal to the number of shortest paths from all vertices to all others that pass through vertex v i , i.e., is the number of those paths that pass through v i . thus, in fig. similarly, closeness centrality (c(v i )) of a vertex v i describes the total distance of v i to all other vertices in the network, i.e., sum the shortest paths of v i to all other vertices in the network. for instance, in fig. . b, c( lastly, stress centrality (s(v i )) is the simple accumulation of the number of shortest paths between all vertex pairs, sometimes interchangeable with betweenness centrality [ ] . use of 'adjacency matrix', a v i v j , describing the connections within a population is also persistent. likewise, various network quantities can be ascertained from the adjacency matrix. for instance, size of a population is defined as the average number of contacts per individual, i.e., the powers of adjacency matrix can be used to calculate measures of transitivity [ ] . one of the key pre-requisites of network analysis is initial data collection. for performing a complete mixing network analysis for individuals residing in a population, every relationship information is essential. this data provides great difficulty in handling the entire population, as well as handling complicated network evaluation issues. the reason being, individuals have contacts, and recall problems are quite probable. moreover, evaluation of contacts requires certain information which may not always be readily present. likewise, in case of epidemiological networks, connections are included if they explain relationships capable of permitting the transfer of infection. but, in most of the cases, clarity of defining such relations is absent. thus, various types of relationships bestow risks and judgments that needs to be sorted for understanding likely transmission routes. one can also consider weighted networks in which links are not merely present or absent but are given scores or weights according to their strength [ ] . furthermore, different infections are passed by different routes, and a mixing network is infection specific. for instance, a network used in hiv transmission is different from the one used to examine influenza. similarly, in case of airborne infections like influenza and measles, various networks need to be considered because differing levels of interaction are required to constitute a contact. the problems with network definition and measurement imply that any mixing networks that are obtained will depend on the assumptions and protocols of the data collection process. three main standard techniques can be employed to gather such information, namely, infection searching, complete contact searching and diary-based studies [ ] . after an epidemic spread, major emphasis is laid on determining the source and spread of infection. thus, each infected individual is linked to one other from whom infection is spread as well as from whom the infection is transmitted. as all connections represent actual transmission events, infection searching methods do not suffer from problems with the link definition, but interactions not responsible for this infection transmission are removed. thus, the networks observed are of closed architecture, without any loops, walks, cliques and complete sub-graphs [ ] . infection searching is a preliminary method for infectious diseases with low prevalence. these can also be simulated using several mathematical techniques based on differential equations, control theories etc., assuming a homogeneous mixing of population. it can also be simulated in a manner so that infected individuals are identified and cured at a rate proportional to the number of neighbors it has, analogous to the infection process. but, it does not allow to compare various infection searching budgets and thus a discrete-event simulation need to be undertaken. moreover, a number of studies have shown that analyses based on realistic models of disease transmission in healthy networks yields significant projections of infection spread than projections created using compartmental models [ ] . furthermore, depending on the number of contacts for any infected individuals, their susceptible neighbors are traced and removed. this is followed by identifying infection searching techniques that yields different numbers of newly infected individuals on the spread of the disease. contact searching identifies potential transmission contacts from an initially infected individual by revealing some new individual set who are prone to infection and can be subject of further searching effort. nevertheless, it suffers from network definition issues; is time consuming and depends on complete information about individuals and their relationships. it has been used as a control strategy, in case of stds. its main objective of contact searching is identifying asymptomatically infected individuals who are either treated or quarantined. complete contact searching deals with identifying the susceptible and/or infected individuals of already infected ones and conducting simulations and/or testing them for degree of infection spread, treating them as well as searching their neighbors for immunization. for instance, stds have been found to be difficult for immunization. the reason being, these have specifically long asymptomatic periods, during which the virus can replicate and the infection is transmitted to healthy, closely related neighbors. this is rapidly followed by severe effects, ultimately leading to the termination of the affected individual. likewise, recognizing these infections as global epidemic has led to the development of treatments that allow them to be managed by suppressing the replication of the infection for as long as possible. thus, complete contact searching act as an essential strategy even in case when the infection seems incurable [ ] . diary-based studies consider individuals recording contacts as they occur and allow a larger number of individuals to be sampled in detail. thus, this variation from the population approach of other tracing methods to the individual-level scale is possible. but, this approach suffers from several disadvantages. for instance, the data collection is at the discretion of the subjects and is difficult for researchers to link this information into a comprehensive network, as the individual identifies contacts that are not uniquely recorded [ ] . diary-based studies require the individuals to be part of some coherent group, residing in small communities. also, it is quite probable that this kind of a study may result in a large number of disconnected sub-groups, with each of them representing some locally connected set of individuals. diary-based studies can be beneficial in case of identifying infected and susceptible individuals as well as the degree of infectivity. these also provide a comprehensive network for diseases that spread by point-to-point contact and can be used to investigate the patterns infection spread. robustness is an essential connectivity property of power-law graph. it defines that power-law graphs are robust under random attack, but vulnerable under targeted attack. recent studies have shown that the robustness of power-law graph under random and targeted attacks are simulated displaying that power-law graphs are very robust under random errors but vulnerable when a small fraction of high degree vertices or links are removed. furthermore, some studies have also shown that if vertices are deleted at random, then as long as any positive proportion remains, the graph induced on the remaining vertices has a component of order of the total number of vertices [ ] . many a times it can be observed that a network of individuals may be subject to sudden change in the internal and/or external environment, due to some perturbation events. for this reason, a balance needs to be maintained against perturbations while being adaptable in the presence of changes, a property known as robustness. studies on the topological and functional properties of such networks have achieved some progress, but still have limited understanding of their robustness. furthermore, more important a path is, higher is the chance to have a backup path. thus, removing a link or an individual from any sub-network may also lead to blocking the information flow within that sub-network. the robustness of a model can also be assessed by means of altering the various parameters and components associated with forming a particular link. robustness of a network can also be studied with respect to 'resilience', a method of analyzing the sensitivities of internal constituents under external perturbation, that may be random or targeted in nature [ ] . basic disease models discuss the number of individuals in a population that are susceptible, infected and/or recovered from a particular infection. for this purpose, various differential equation based models have been used to simulate the events of action during the infection spread. in this scenario, various details of the infection progression are neglected, along with the difference in response between individuals. models of infections can be categorized as sir and susceptible, infected, susceptible (sis) [ ] . the sir model considers individuals to have long-lasting immunity, and divides the population into those susceptible to the disease (s), infected (i) and recovered (r). thus, the total number of individuals (t ) considered in the population is the transition rate from s to i is κ and the recovery rate from i to r is ρ . thus, the sir model can be represented as likewise, the reproductivity (θ) of an infection can be identified as the average number of secondary instances a typical single infected instance will cause in a population with no immunity. it determines whether infections spreads through a population; if θ < , the infection terminates in the long run; θ > , the infection spreads in a population. larger the value of θ, more difficult is to control the epidemic [ ] . furthermore, the proportion of the population that needs to be immunized can be calculated by known as endemic stability can be identified. depending upon these instances, immunization strategies can be initiated [ ] . although the contact network in a general sir model can be arbitrarily complex, the infection dynamics can still being studied as well as modeled in a simple fashion. contagion probabilities are set to a uniform value, i.e., p, and contagiousness has a kind of 'on-off' property, i.e., an individual is equally contagious for each of the t i steps while it has the infection, where is present state of the system. one can extend the idea that contagion is more likely between certain pairs of individuals or vertices by assigning a separate probability p v i ,v j to each pair of individuals or vertices v i and v j , for which v i is linked to v j in a directed contact network. likewise, other extensions of the contact model involves separating the i state into a sequence of early, middle, and late periods of the infection. for instance, it could be used to model an infection with a high contagious incubation period, followed by a less contagious period while symptoms are being expressed [ ] . in most of the cases, sir epidemics are thought of dynamic processes, in which the network state evolves step-by-step over time. it captures the temporal dynamics of the infection as it spreads through a population. the sir model has been found to be suitable for infections, which provides lifelong immunity, like measles. in this case, a property termed as the force of infection is existent, a function of the number of infectious individuals is. it also contains information about the interactions between individuals that lead to the transmission of infection. one can also have a static view of the epidemics where sir model for t i = . this means that considering a point in an sir epidemic when a vertex v i has just become infectious, has one chance to infect v j (since t i = ), with probability p. one can visualize the outcome of this probabilistic process and also assume that for each edge in the contact network, a probability signifying the relationship is identified. the sis model can be represented as removed state is absent in this case. moreover, after a vertex is over with the infectious state, it reverts back to the susceptible state and is ready to initiate the infection again. due to this alternation between the s and i states, the model is referred to as sis model. the mechanics of sis model can be discussed as follows [ ] . . at the initial stage, some vertices remain in i state and all others are in s state. . each vertex v i that enters the i state and remains infected for a certain number of steps t i . . during each of these t i steps, v i has a probability p of passing the infection to each of its susceptible directly linked neighbors. . after t i steps, v i no longer remains infected, and returns back to the s state. the sis model is predominantly used for simulating and understanding the progress of stds, where repeat infections are existent, like gonorrhoea. moreover, certain assumptions with regard to random mixing between individuals within each pair of sub-networks are present. in this scenario, the number of neighbors for each individual is considerably smaller than the total population size. such models generally avoid random-mixing assumptions thereby assigning each individual to a specific set of contacts that they can infect. an sis epidemic, can run for long time duration as it can cycle through the vertices multiple number of times. if at any time during the sis epidemic all vertices are simultaneously free of the infection, then the epidemic terminates forever. the reason being, no infected individuals exist that can pass the infection to others. in case if the network is finite in nature, a stage would arise when all attempts for further infection of healthy individuals would simultaneously fail for t i steps in a row. likewise, for contact networks where the structure is mathematically tractable, a particular critical value of the contagion probability p is existent, an sis epidemic undergoes a rapid shift from one that terminates out quickly to one that persists for a long time. in this case, the critical value of the contagion probability depends on the structure of the problem set [ ] . the patterns by which epidemics spread through vertex groups is determined by the properties of the pathogen, length of its infectious period, severity and the network structures. the path for an infection spread are given by a population state, with existence of direct contacts between the individuals or vertices. the functioning of network system depends on the nature of interaction between their individuals. this is essentially because of the effect of infection-causing individuals and topology of networks. to analyze the complexity of epidemics, it is important to understand the underlying principles of its distribution in the history of its existence. in recent years it has been seen that the study of disease dynamics in social networks is relevant with the spread of viruses and the nature of diseases [ ] . moreover, the pathogen and the network are closely intertwined with even within the same group of individuals, the contact networks for two different infections are different structures. this depends on respective modes of transmission of infections. for instance, a highly contagious infection, involving airborne transmission, the contact network includes a huge number of links, including any pair of individuals that are in contact with one another. likewise, for an infection requiring close contact, the contact network is much sparser, with fewer pairs of individuals connected by links [ ] . immunization is a site percolation problem where each immunized individual is considered to be a site which is removed from the infected network. its aim is to transfer the percolation threshold that leads to minimization of the number of infected individuals. the model of sir and immunization is regarded as a site-bond percolation model, and immunization is considered successful if the infected a network is below a predefined percolation threshold. furthermore, immunizing randomly selected individuals requires targeting a large fraction, frac, of the entire population. for instance, some infections require - % immunization. meanwhile, targetbased immunization of the hubs requires global information about the network in question, rendering it impractical in many cases, which is very difficult in certain cases [ ] . likewise, social networks possess a broad distribution of the number of links, conn, connecting individuals and analyzing them illustrate that that a large fraction, frac, of the individuals need to be immunized before the integrity of the infected network is compromised. this is essentially true for scale-free networks, where p(conn) ≈ conn − η , < η < , where the network remains connected even after removal of most of its individuals or vertices. in this scenario, a random immunization strategy requires that most of the individuals need to be immunized before an epidemic is terminated [ ] . for various infections, it may be difficult to reach a critical level of immunization for terminating the infection. in this case, each individual that is immunized is given immunity against the infection, but also provides protection to other healthy individuals within the population. based on the sir model, one can only achieve half of the critical immunization level which reduces the level of infection in the population by half. a crucial property of immunization is that these strategies are not perfect and being immunized does not always confer immunity. in this case, the critical threshold applies to a portion of the total population that needs to be immunized. for instance, if the immunization fails to generate immunity in a portion, por, of those immunized, then to achieve immunity one needs to immunize a portion here, im denotes immunity strength. thus, in case if por is huge it is difficult to remove infection using this strategy or provides partial immunity. it may also invoke in various manners: the immunization reduces the susceptibility of an individual to a particular infection, may reduce subsequent transmission if the individual becomes infected, or it may increase recovery. such immunization strategies require the immunized individuals to become infected and shift into a separate infected group, after which the critical immunization threshold (s i ) needs to be established. thus, if cil is the number of secondary infected individuals affected by an initial infectious individual, then thus, s i needs to be less than one, else it is not possible to remove the infection. but, one also needs to note that an immunization works equally efficiently if it reduces the transmission or susceptibility and increases the recovery rate. moreover, when the immunization strategy fails to generate any protection in a proportion por of those immunized, the rest −por are fully protected. in this scenario, it can be not possible to remove the infection using random immunization. thus, targeted immunization provides better protection than random-based [ ] . in case of homogenous networks, the average degree, conn, fluctuates less and can assume conn conn, i.e., the number of links are approximately equal to average degree. however, networks can also be heterogeneous. likewise, in a homogeneous network such as a random graph, p(conn) decays faster exponentially whereas for heterogenous networks it decays as a power law for large conn. the effect of heterogeneity on epidemic behavior studied in details for many years for scale-free networks. these studies are mainly concerned with the stationary limit and existence of an endemic phase. an essential result of this analysis is the expression of basic reproductive number which in this case is τ ∞ conn conn . here, τ is proportional to the second moment of degree, which finally diverges for increasing network sizes [ ] . it has been noticed that the degree of interconnection in between individuals for all form of networks is quite unprecedented. whereas, interconnection increases the spread of information in social networks, another exhaustively studied area contributes to the spread of infection throughout the healthy network. this rapid spreading is done due to less stringency of its passage through the network. moreover, initial sickness nature and time of infection are unavailable most of the time, and the only available information is related to the evolution of the sick-reporting process. thus, given complete knowledge of the network topology, the objective is to determine if the infection is an epidemic, or if individuals have become infected via an independent infection mechanism that is external to the network, and not propagated through the connected links. if one considers a computer network undergoing cascading failures due to worm propagation whereas random failures due to misconfiguration independent of infected nodes, there are two possible causes of the sickness, namely, random and infectious spread. in case of random sickness, infection spreads randomly and uniformly over the network where the network plays no role in spreading the infection; and infectious spread, where the infection is caused through a contagion that spreads through the network, with individual nodes being infected by direct neighbors with a certain probability [ ] . in random damage, each individual becomes infected with an independent probability ψ . at time t, each infected individual reports damage with an independent probability ψ . thus, on an average, a fraction ψ of the network reports being infected, where it is already known that social networks possess a broad distribution of the number of links, k, originating from an individual. computer networks, both physical and logical are also known to possess wide, scale-free, distributions. studies of percolation on broad-scale networks display that a large fraction, fc, of the individuals need to be immunized before the integrity of the network is compromised. this is particularly true for scale-free networks, where the percolation threshold tends to , and the network remains contagious even after removal of most of its infected individuals [ ] . when the hub individuals are targeted first, removal of just a fraction of these results in the breakdown of the network. this has led to the suggestion of targeted immunization of hubs. to implement this approach, the number for connections of each individual needs to be known. during infection spread, at time , a randomly selected individual in the network becomes infected. when a healthy individual becomes infected, a time is set for each outgoing link to an adjacent individual that is not infected, with expiration time exponentially distributed with unit average. upon expiration of a link's time, the corresponding individual becomes infected, and in-turn begins infecting its neighbors [ ] . in general, for an epidemic to occur in a susceptible population the basic reproductive rate must be greater than . in many circumstances not all contacts will be susceptible to infection. in this case, some contacts remain immune, due to prior infection which may have conferred life-long immunity, or due to some previous immunization. therefore, not all individuals are infected and the average number of secondary infections decrease. similarly, the epidemic threshold in this case is the number of susceptible individuals within a population that is required for an epidemic to occur. similarly, the herd immunity is the proportion of population immune to a particular infection. if this is achieved due to immunization, then each case leads to a new case and the infection becomes more stable within the population [ ] . one of the simplest immunization procedure consists of random introduction of immune individuals in the population for achieving uniform immunization density. in this case, for a fixed spreading rate, ξ , the relevant control parameter in the density of immune individuals present in the network, the immunity, imm. at the meanfield level, the presence of a uniform immunity reduces ξ by a factor − imm, i.e., the probability of identifying and infecting a susceptible and non-immune individual becomes ξ( −imm). for homogeneous networks, one observes that, for aconstant ξ , the stationary prevalence is given by for imm > imm c and for imm ≤ imm c here imm c is the critical immunization value above which the density of infected individuals in the stationary state is null and depends on ξ as thus, for a uniform immunization level larger than imm c , the network is completely protected and no large epidemic outbreaks are possible. on the contrary, uniform immunization strategies on scale-free heterogenous networks are totally ineffective. the presence of uniform immunization elocally depresses the infections prevalence for any value of ξ , and it is difficult to identify any critical fraction of immunized individuals that ensures the eradication of infection [ ] . cascading, or epidemic processes are those where the actions, infections or failure of certain individuals increase the susceptibility of others. this results in the successive spread of infections from a small set of initially infected individuals to a larger set. initially developed as a way to study human disease propagation, cascades ares useful models in a wide range of application. the vast majority of work on cascading processes focused on understanding how the graph structure of the network affects the spread of cascades. one can also focus on several critical issues for understanding the cascading features in network for which studying the architecture of the network is crucial [ ] . the standard independent cascade epidemic model assumes that the network is directed graph g = (v, e), for every directed edge between v i , v j , we say v i is a parent and v j is a child of the corresponding other vertex. parent may infect child along an edge, but the reverse cannot happen. let v denote the set of parents of each vertex v i , and for convenience v i ∈ v is included. epidemics proceed in discrete time where all vertices are initially in the susceptible state. at time , each vertex independently becomes active, with probability p init . this set of initially active vertices are called 'seeds'. in each time step, the active vertices probabilistically infects its susceptible children; if vertex v i is active at time t, it infects each susceptible child v j with probability p v i vj , independently. correspondingly, a vertex v j susceptible at time t becomes active in the next time step, i.e., at time t + , if any one of its parents infects it. finally, a vertex remains active for only one time slot, after which it becomes inactive and does not spread the infection further as well as cannot be infected again either [ ] . thus, in this kind of an sir epidemic, where some vertices remain forever susceptible because the epidemic never reaches them, while others transition, susceptible → active for one time step → inactive. in this chapter, we discussed some critical issues regarding epidemics and their outbursts in static as well as dynamic network structures. we mainly focused on sir and sis models as well as identifying key strategies for identifying the damage caused in networks. we also discussed the various modeling techniques for studying cascading failures. epidemics pass through populations and persists over long time periods. thus, efficient modeling of the underlying network plays a crucial role in understanding the spread and prevention of an epidemic. social, biological, and communication systems can be explained as complex networks with their degree distribution follows a power law, p(conn) ≈ conn − η , for the number of connections, conn for individuals, representing scale-free (sf) networks. we also discussed certain issues on epidemic spreading in sf networks characterized by complex topologies with basic epidemic models describing the proportion of individuals susceptible, infected and recovered from a particular disease. likewise, we also explained the significance of the basic reproduction rate of an infection, that can be identified as the average number of secondary instances a typical single infected instance will cause in a population with no immunity. also, we explained how determining the complete nature of a network required knowledge of every individual in a population and their relationships as, the problems with network definition and measurement depend on the assumptions of data collection processes. nevertheless, we also illustrated the importance of invasion resistance methods, with temporary immunity generating oscillations in localized parts of the network, with certain patches following large numbers of infections in concentrated areas. similarly, we also explained the significance of damages, namely, random, where the damage spreads randomly and uniformly over the network and in particular the network plays no role in spreading the damage; and infectious spread, where the damage spreads through the network, with one node infecting others with some probability. infectious diseases of humans: dynamics and control the mathematical theory of infectious diseases and its applications a forest-fire model and some thoughts on turbulence emergence of scaling in random networks mathematical models used in the study of infectious diseases spread of epidemic disease on networks networks and epidemic models network-based analysis of stochastic sir epidemic models with random and proportionate mixing elements of mathematical ecology intelligent information and database systems propagation phenomenon in complex networks: theory and practice relation between birth rates and death rates the analysis of malaria epidemics graph theory and networks in biology mathematical biology spread of epidemic disease on networks the use of mathematical models in the epidemiology study of infectious diseases and in the design of mass vaccination programmes forest-fire as a model for the dynamics of disease epidemics on the critical behaviour of simple epidemics sensitivity estimates for nonliner mathematical models ensemble modeling of metabolic networks on analytical approaches to epidemics on networks computational modeling in systems biology collective dynamics of 'small-world' networks unifying wildfire models from ecology and statistical physics key: cord- -cq z jt authors: han, jungmin; cresswell-clay, evan c; periwal, vipul title: statistical physics of epidemic on network predictions for sars-cov- parameters date: - - journal: nan doi: nan sha: doc_id: cord_uid: cq z jt the sars-cov- pandemic has necessitated mitigation efforts around the world. we use only reported deaths in the two weeks after the first death to determine infection parameters, in order to make predictions of hidden variables such as the time dependence of the number of infections. early deaths are sporadic and discrete so the use of network models of epidemic spread is imperative, with the network itself a crucial random variable. location-specific population age distributions and population densities must be taken into account when attempting to fit these events with parametrized models. these characteristics render naive bayesian model comparison impractical as the networks have to be large enough to avoid finite-size effects. we reformulated this problem as the statistical physics of independent location-specific `balls' attached to every model in a six-dimensional lattice of parametrized models by elastic springs, with model-specific `spring constants' determined by the stochasticity of network epidemic simulations for that model. the distribution of balls then determines all bayes posterior expectations. important characteristics of the contagion are determinable: the fraction of infected patients that die ($ . pm . $), the expected period an infected person is contagious ($ pm $ days) and the expected time between the first infection and the first death ($ pm $ days) in the us. the rate of exponential increase in the number of infected individuals is $ . pm . $ per day, corresponding to million infected individuals in one hundred days from a single initial infection, which fell to with even imperfect social distancing effectuated two weeks after the first recorded death. the fraction of compliant socially-distancing individuals matters less than their fraction of social contact reduction for altering the cumulative number of infections. the pandemic caused by the sars-cov- virus has swept across the globe with remarkable rapidity. the parameters of the infection produced by the virus, such as the infection rate from person-to-person contact, the mortality rate upon infection and the duration of the infectivity period are still controversial . parameters such as the duration of infectivity and predictions such as the number of undiagnosed infections could be useful for shaping public health responses as the predictive aspects of model simulations are possible guides to pandemic mitigation [ , , ] . in particular, the possible importance of superspreaders should be understood [ ] [ ] [ ] [ ] . [ ] had the insight that the early deaths in this pandemic could be used to find some characteristics of the contagion that are not directly observable such as the number of infected individuals. this number is, of course, crucial for public health measures. the problem is that standard epidemic models with differential equations are unable to determine such hidden variables as explained clearly in [ ] . the early deaths are sporadic and discrete events. these characteristics imply that simulating the epidemic must be done in the context of network models with discrete dynamics for infection spread and death. the first problem that one must contend with is that even rough estimates of the high infection transmission rate and a death rate with strong age dependence imply that one must use large networks for simulations, on the order of nodes, because one must avoid finite-size effects in order to accurately fit the early stochastic events. the second problem that arises is that the contact networks are obviously unknown so one must treat the network itself as a stochastic random variable, multiplying the computational time by the number of distinct networks that must be simulated for every parameter combination considered. the third problem is that there are several characteristics of sars-cov- infections that must be incorporated in any credible analysis, and the credibility of the analysis requires an unbiased sample of parameter sets. these characteristics are the strong age dependence of mortality of sars-cov- infections and a possible dependence on population density which should determine network connectivity in an unknown manner. thus the network nodes have to have location-specific population age distributions incorporated as node characteristics and the network connectivity itself must be a free parameter. an important point in interpreting epidemics on networks is that the simplistic notion that there is a single rate at which an infection is propagated by contact is indefensible. in particular, for the sars-cov- virus, there are reports of infection propagation through a variety of mucosal interfaces, including the eyes. thus, while an infection rate must be included as a parameter in such simulations, there is a range of infection rates that we should consider. indeed, one cannot make sense of network connectivity without taking into account the modes of contact, for instance if an individual is infected during the course of travel on a public transit system or if an individual is infected while working in the emergency room of a hospital. one expects that network connectivity should be inversely correlated with infectivity in models that fit mortality data equally well but this needs to be demonstrated with data to be credible, not imposed by fiat. the effective network infectivity, which we define as the product of network connectivity and infection rate, is the parameter that needs to be reduced by either social distancing measures such as stay-at-home orders or by lowering the infection rate with mask wearing and hand washing. a standard bayesian analysis with these features is computationally intransigent. we therefore adopted a statistical physics approach to the bayesian analysis. we imagined a six-dimensional lattice of models with balls attached to each model with springs. each ball represents a location for which data is available and each parameter set determines a lattice point. the balls are, obviously, all independent but they face competing attractions to each lattice point. the spring constants for each model are determined by the variation we find in stochastic simulations of that specific model. one of the dimensions in the lattice of models corresponds to a median age parameter in the model. each location ball is attracted to the point in the median age parameter dimension that best matches that location's median age, and we only have to check that the posterior expectation of the median age parameter for that location's ball is close to the location's actual median age. thus we can decouple the models and the data simulations without having to simulate each model with the characteristics of each location, making the bayesian model comparison amenable to computation. finally, the distribution of location balls over the lattice determines the posterior expectation values of each parameter. we matched the outcomes of our simulations with data on the two week cumulative death counts after the first death using bayes' theorem to obtain parameter estimates for the infection dynamics. we used the bayesian model comparison to determine posterior expectation values for parameters for three distinct datasets. finally, we simulated the effects of various partially effective social-distancing measures on random networks and parameter sets given by the posterior expectation values of our bayes model comparison. we used data for the sars-cov- pandemic as compiled by [ ] from the original data we generated random g(n, p = l/(n − )) networks of n = or nodes with an average of l links per node using the python package networkx [ ] . scalepopdens ≡ l is one of the parameters that we varied. we compared the posterior expectation for this parameter for a location with the actual population density in an attempt to predict the appropriate way to incorporate measurable population densities in epidemic on network models [ , ] . we used the python epidemics on networks package [ , ] to simulate networks with specific parameter sets. we defined nodes to have status susceptible, infected, recovered or dead. we started each simulation with exactly one infected node, chosen at random. the simulation has two sorts of events: . an infected node connected to a susceptible node can change the status of the susceptible node to infected with an infection rate, infrate. this event is network connectivity dependent. therefore we expect to see a negative or inverse correlation between infrate and scalepopdens. . an infected node can transition to recovered status with a recovery rate, recrate, or transition to a dead status with a death rate, deathrate. both these rates are entirely node-autonomous. the reciprocal of the recrate parameter (recdays in the following) is the number of days an individual is contagious. we assigned an age to each node according to a probability distribution parametrized by the median age of each data set (country or state). as is well-known, there is a wide disparity in median ages in different countries. the probability distribution approximately models the triangular shape of the population pyramids that is observed in demographic studies. we parametrized it as a function of age a as follows: here medianage is the median age of a specific country, maxage = y is a global maxi- it is computationally impossible to perform model simulations for the exact age distribution for each location. we circumvented this problem, as detailed in the next subsection (bayes setup), by incorporating a scalemedage parameter in the model, scaled so that scalemedage = . corresponds to a median age of years. the node age is used to make the deathrate of any node age-specific in the form of an age-dependent weight: where a[n] is the age of node n and ageincrease = . is an age-dependence exponent. w(a) is normalized so that a w(a|ageincrease)p (a|medianage = . y) = , using the median age of new york state's population as the value of ageincrease given above was approximately determined by fitting to the observed age-specific mortality statistics of new york state [ ] . however, we included ageincrease as a model parameter since the strong age dependence of sars-cov- mortality is not well understood, with the normalization adjusted appropriately as a function of ageincrease. note that a decrease in the median age with all rates and the age-dependence exponent held constant will lead to a lower number of deaths. we use simulations to find the number of dead nodes as a function of time. the first time at which a death occurs following the initial infection in the network is labeled time-firstdeath. figure close to its actual median age. we implemented bayes' theorem as usual. the probability of a model, m, given a set of after the first death did not affect our results. as alluded to in the previous subsection, the posterior expectation of the scalemedage parameter (× y) for each location should turn out to be close to the actual median age for each location in our setup, and this was achieved (right column, figure ). we simulated our grid of models on the nih biowulf cluster. our grid comprised of × parametrized models simulated with random networks each and parameters in all possible combinations from the following lists: parameters. in particular, note that the network infectivity (infcontacts) has a factor of two smaller uncertainty than either of its factors as these two parameters (infrate and scalepopdens) cooperate in the propagation of the contagion and therefore turn out to have a negative posterior weighted correlation coefficient (table i ). the concordance of posterior expectation values (table i) this goes along with the approximately day period between the first infection and the first death for a few outlier trajectories. however, it is also clear from the histograms in figure and the mean timefirstdeath given in table i that the likely value of this duration is considerably shorter. finally, we evaluated a possible correlation between the actual population density and the scalepopdens parameter governing network connectivity. we found a significant correlation when we added additional countries to the european union countries in this regression, we obtained (p < . , r = . ) scalepopdens(us&eu+) = . ln(population per km ) + . . while epidemiology is not the standard stomping ground of statistical physics, bayesian model comparison is naturally interpreted in a statistical physics context. we showed that taking this interpretation seriously leads to enormous reductions in computational effort. given the complexity of translating the observed manifestations of the pandemic into an understanding of the virus's spread and the course of the infection, we opted for a simple data-driven approach, taking into account population age distributions and the age dependence of the death rate. while the conceptual basis of our approach is simple, there were computational difficulties we had to overcome to make the implementation amenable to computability with finite computational resources. our results were checked to not depend on the size of the networks we simulated, on the number of stochastic runs we used for each model, nor on the number of days that we used for the linear regression. all the values we report in table i are well within most estimated ranges in the literature but with the benefit of uncertainty estimates performed with a uniform model prior. while each location ball may range over a broad distribution of models, the consensus posterior distribution (table i) shows remarkable concordance across datasets. we can predict the posterior distribution of time of initial infection, timefirstdeath, as shown in table i . the dynamic model can predict the number of people infected after the first infection (right panel, figure ) and relative to the time of first death (left panel, figure ) because we made no use of infection or recovery statistics in our analysis [ ] . note the enormous variation in the number of infections for the same parameter set, only partly due to stochasticity of the networks themselves, as can be seen by comparing the upper and lower rows of figure . with parameters intrinsic to the infection held fixed, we can predict the effect of various degrees of social distancing by varying network connectivity. we assumed that a certain fraction of nodes in the network would comply with social distancing and only these compliant nodes would reduce their connections at random by a certain fraction. figure shows the effects of four such combinations of compliant node fraction and fraction of con- (table ii) with the posterior expectations of parameters (table i) shows that the bayes entropy of the model posterior distribution is an important factor to consider, validating our initial intuition that optimization of model parameters would be inappropriate in this analysis. the regression we found (eq.'s , , ) with respect to population density must be considered in light of the fact that many outbreaks are occurring in urban areas so they are not necessarily reflective of the true population density dependence. furthermore, we did not find a significant regression for the countries of the european union by themselves, perhaps because they have a smaller range of population densities, though the addition of these countries into the us states data further reduced the regression p-value of the null hypothesis without materially altering regression parameters. detailed epidemiological data could be used to clarify its significance. [ [ ] [ ] [ ] [ ] have suggested the importance of super-spreader events but we did not encounter any difficulty in modeling the available data with garden variety g(n, p) networks. certainly if the network has clusters of older nodes, there will be abrupt jumps in the cumulative death count as the infection spreads through the network. furthermore, it would be interesting to consider how to make the basic model useful for more heterogenous datasets such as all countries of the world with vastly different reporting of death statistics. using the posterior distribution we derived as a starting point for more complicated models may be an approach worth investigating. infectious disease modeling is a deep field with many sophisticated approaches in use [ , [ ] [ ] [ ] and, clearly, our analysis is only scratching the surface of the problem at hand. network structure, in particular, is a topic that has received much attention in social network research [ , , [ ] [ ] [ ] . bayesian approaches have been used in epidemics on networks modeling [ ] and have also been used in the present pandemic context in [ , , ] . to our knowledge, there is no work in the published literature that has taken the approach adopted in this paper. there are many caveats to any modeling attempt with data this heterogenous and complex. first of all, any model is only as good as the data incorporated and unreported sars-cov- deaths would impact the validity of our results. secondly, if the initial deaths occur in specific locations such as old-age care centers, our modeling will over-estimate the death rate. a safeguard against this is that the diversity of locations we used may compensate to a limited extent. detailed analysis of network structure from contact tracing can be used to correct for this if such data is available, and our posterior model probabilities could guide such refinement. thirdly, while we ensured that our results did not depend on our model ranges as far as practicable, we cannot guarantee that a model with parameters outside our ranges could not be a more accurate model. the transparency of our analysis and the simplicity of our assumptions may be helpful in this regard. all code is available an seir infectious disease model with testing and conditional quarantine the lancet infectious diseases the lancet infectious diseases the lancet infectious diseases proceedings of the th python in science conference winter simulation conference (wsc) agent-based modeling and network dynamics infectious disease modeling charting the next pandemic: modeling infectious disease spreading in the data science age we are grateful to arthur sherman for helpful comments and questions and to carson chow for prepublication access to his group's work [ ] . this work was supported by the key: cord- -cql t r authors: mcmillin, stephen edward title: quality improvement innovation in a maternal and child health network: negotiating course corrections in mid-implementation date: - - journal: j of pol practice & research doi: . /s - - -z sha: doc_id: cord_uid: cql t r this article analyzes mid-implementation course corrections in a quality improvement innovation for a maternal and child health network working in a large midwestern metropolitan area. participating organizations received restrictive funding from this network to screen pregnant women and new mothers for depression, make appropriate referrals, and log screening and referral data into a project-wide data system over a one-year pilot program. this paper asked three research questions: ( ) what problems emerged by mid-implementation of this program that required course correction? ( ) how were advocacy targets developed to influence network and agency responses to these mid-course problems? ( ) what specific course corrections were identified and implemented to get implementation back on track? this ethnographic case study employs qualitative methods including participant observation and interviews. data were analyzed using the analytic method of qualitative description, in which the goal of data analysis is to summarize and report an event using the ordinary, everyday terms for that event and the unique descriptions of those present. three key findings are noted. first, network participants quickly responded to the emerged problem of under-performing screening and referral completion statistics. second, they shifted advocacy targets away from executive appeals and toward the line staff actually providing screening. third, participants endorsed two specific course corrections, using “opt out, not opt in” choice architecture at intake and implementing visual incentives for workers to track progress. opt-out choice architecture and visual incentives served as useful means of focusing organizational collaboration and correcting mid-implementation problems. this study examines inter-organizational collaboration among human service organizations serving a specific population of pregnant women and mothers at risk for perinatal depression. these organizations received restrictive funding from a local community network to screen this population for risk for depression, make appropriate referrals as indicated, and log screening and referral data into a project-wide data system for a -year pilot program. this paper asked three specific research questions: ( ) what problems emerged by mid-implementation of the screening and referral program that required course correction? ( ) how were advocacy targets developed to influence network and agency responses to these mid-course problems? ( ) what specific course corrections were identified and implemented to get implementation back on track? previous scholarship (mcmillin ) reported the background of how the maternal and child health organization studied here began as a community committee funded by the state legislature to address substance use by pregnant women and new mothers. ultimately this committee grew into a (c) nonprofit backbone organization (mcmillin ) that increasingly served as a pass-through entity for many grants it administered and dispersed to health and social service agencies who were members and partners of the network and primarily served families with young children. one important grant was shared with six network partner agencies to create a pilot program and data-sharing system for a universal screening and referral protocol for perinatal mood and anxiety disorders. this innovation used a network-wide shared data software system into which staff from all six partner agencies entered their screening results and the referrals they gave to clients. universal screening and referral for perinatal mood and anxiety disorders and cooccurring issues meant that every partner agency would do some kind of screening (virtually always an empirically validated instrument such as the edinburgh postnatal depression scale), and every partner would respond to high screens (scores over the designated clinical cutoff of whatever screening instrument was being used, indicating the presence of or high risk for clinical depression) with a referral to case managers in partner agencies that were also funded by the network. the funded partners faced a very tight timeline that anticipated regular screening and enrollment of an estimated number of clients in case management and depression treatment for every month of the fiscal program year. a slow start in screening and enrolling patients meant that funded partners would likely be in violation of their grant contract with the network while facing a rapidly closing window of time in which they would be able to catch up and provide enough contracted services to meet the contractual numbers for their catchment area, which could jeopardize funding for a second year. this paper covers the months in the middle of the pilot program year when network staff realized that funded partners were seriously behind schedule in the amount of screens and referrals for perinatal mood and depression these agencies were contracted to make at this point in the fiscal year. although challenging and complex for many human service organizations, collaboration with competitors in the form of "co-opetive relationships" has been linked to greater innovation and efficiency (bunger et al. , p. ) . but grant cycle funding can add to this complexity in the form of the "capacity paradox," in which small human service organizations working with specific populations face funding restrictions because they are framed as too small or lacking capacity for larger scale grants and initiatives (terrana and wells , p. ) . finally, once new initiatives are implemented in a funded cycle, human service organizations are increasingly expected to engage in extensive, timely, and often very specific data collection to generate evidence of effectiveness for a particular program (benjamin et al. ) . mid-course corrections during implementation of prevention programs targeted to families with young children have long been seen as important ways to refine and modify the roles of program staff working with these populations and add formal and informal supports to ongoing implementation and service delivery prior to final evaluation (lynch et al. ; wandersman et al. ) . mid-course corrections can help implementers in current interventions or programs adopt a more facilitative and individualized approach to participants that can improve implementation fidelity and cohesion (lynch et al. ; sobeck et al. ) . comprehensive reviews of implementation of programs for families with young children have consistently found that well-measured implementation improves program outcomes significantly, especially when dose or service duration is also assessed (durlak and dupre ; fixsen et al. ) . numerous studies have emphasized capturing implementation data at a low enough level to be able to use it to improve service data quickly and hit the right balance of implementation fidelity and thoughtful, intentional implementation adaptation (durlak and dupre ; schoenwald et al. ; schoenwald et al. ; schoenwald and hoagwood ; tucker et al. ). inter-organizational networks serving families with young children face special challenges in making mid-course corrections while maintaining implementation fidelity across member organizations (aarons et al. ; hanf and o'toole ) . implementation through inter-organizational networks is never merely a result of clear success or clear failure; rather, it is an ongoing assessment of how organizational actors are cooperating or not across organizational boundaries (cline ) . frambach and schillewaert ( ) echo this point by noting that intra-organizational and individual cooperation, consistency, and variance also have strong effects on the eventual level of implementation cohesion and fidelity that a given project is able to reach. moreover, recent research suggests that while funders and networks may emphasize and prefer inter-organizational collaboration, individual agency managers in collaborating organizations may see risks and consequences of collaboration and may face dilemmas in complying with network or funder expectations (bunger et al. ) . similar organizations providing similar services with overlapping client bases may fear opportunism or poaching from collaborators, and interpersonal trust as well as contracts or memoranda of understanding might be needed to assuage these concerns (bunger ; bunger et al. ) . even successful collaboration may expedite mergers between collaborating organizations that are undesired or disruptive to stakeholders and sectors (bunger ) . while funders may often prefer to fund larger and more comprehensive merged organizations, smaller specialized community organizations founded by and for marginalized populations may struggle to maintain their community connection and focus as subordinate components of larger firms (bunger ) . organizational policy practice and advocacy for mid-course corrections in a pilot program likely looks different from the type of advocacy and persuasion efforts that might seek to gain buy-in for initial implementation of the program. fischhoff ( ) notes that in the public health workforce, individual workers rarely know how to organize their work to anticipate the possibility or likelihood of mid-course corrections because most work is habituated and routinized to the point that it is rarely intentionally changed, and when it is changed, it is due to larger issues on which workers expect extensive further guidance. when a need for even a relatively minor mid-course correction is identified, it can result in everyone concerned "looking bad," from the workers adapting their implementation to the organizations requesting the changes (fischhoff , p. ) . there is also some evidence that health workers have surprisingly stable, pre-existing beliefs about their work situations and experiences, and requests to make mid-course corrections in work situations may have to contend with workers' pre-existing, stable beliefs about the program they are implementing no matter how well-reasoned the proposed course corrections are (harris and daniels ) . given a new emphasis in social work that organizational policy advocacy should be re-conceptualized as part of everyday organizational practice (mosley ) , a special focus on strategies that contribute to the success of professional networks and organizations that can leverage influence beyond that of a single agency becomes increasingly important. given the above problems with inter-organizational collaboration, increased attention has turned to automated methods of implementation that reduce burden on practitioners without unduly reducing freedom of choice and action. behavioral economics and behavioral science approaches have been suggested as ways to assist direct practitioners to follow policies and procedures that they are unlikely to intend to violate. evidence suggests that behavior in many contexts is easy to predict with high accuracy, and behavioral economics seeks to alter people's behavior in predictable, honest, and ethical ways without forbidding any options or significantly adding too-costly incentives, so that the best or healthiest choice is the easiest choice (thaler and sunstein ) . following mosley's ( ) recommendation, this paper examines in detail how a heavily advocated quality improvement pilot program for a maternal and child health network working in a large midwestern metropolitan area attempted to make mid-implementation course corrections for a universal screening and referral program for perinatal mood and anxiety disorders conducted by its member agencies. this paper answers the call of recent policy practice and advocacy research to examine how "openness, networking, sharing of tasks," and building and maintaining positive relationships are operative within organizational practice across multiple organizations (ruggiano et al. , p. ). additionally, this paper focuses on extending recent research to understand how mandated screening for perinatal mood and anxiety disorders can be implemented well (yawn et al. ) . this study used an ethnographic case study method because treating the network and this pilot program as a case study makes it possible to examine unique local data while also locating and investigating counter-examples to what was expected locally (stake ). this method makes it possible to inform and modify grand generalizations about the case before such generalizations become widely accepted (stake ). this study also used ethnographic methods such participant observation and informal, unstructured interview conversations at regularly scheduled meetings. adding these ethnographic approaches to a case study which is tightly time-limited can help answer research questions fully and efficiently (fusch et al. ). data were collected at regular network meetings, which are - h long and held twice a month. one meeting is a large group of about participants who supervise or perform screening and case management for perinatal mood and anxiety disorders as well as local practitioners in quality improvement and workforce training and development. a second executive meeting was held with - participants, typically network staff and the two co-chairs of each of organized three groups, a screening and referral group, a workforce training group, and a quality improvement group, to debrief and discuss major issues reported at the large group meeting. for this study, the author served as a consultant to the quality improvement group and attended and took notes on network meetings in the middle of the program year (november through february) to investigate how mid-course corrections in the middle of the contract year were unfolding. these network meetings generally used a world café focus group method, in which participants move from a large group to several small groups discussing screening and referral, training, and quality improvement specifically, then moved back to report small group findings to the large group (fouché and light ) . the author typed extensive notes on an ipad, and notetaking during small group breakouts could only intermittently capture content due to the roving nature of the world café model. note-taking was typically unobtrusive because virtually all participants in both small and large group meetings took notes on discussion. note-taking and note-sharing were also a frequent and iterative process, in which the author commonly forwarded the notes taken at each meeting to network staff and group participants after each meeting to gain their insights and help construct the agenda of the next meeting. by the middle of the program year, network participants had gotten to know each other and the author quite well, so the author was typically able to easily arrange additional conversations for purposes of member checking. these informal meetings supplemented the two regular monthly meetings of the network and allowed for specific follow-up in which participants were asked about specific comments and reactions they had shared at previous meetings. brief powerpoint presentations were also used at the beginning of successive meetings during the program year to summarize announcements and ideas from the last meeting and encourage new discussion. often, powerpoints were used to remind participants of dates, deadlines, statistics, and refined and summarized concepts. because so many other meeting participants took their own notes and shared them during meetings, a large amount of feedback on meeting topics and their meaning were able to be obtained. the author then coded the author's meeting notes in an iterative and sequenced process guided by principles of qualitative description, in which the goal of data analysis is to summarize and report an event using the ordinary, everyday terms for that event and the unique descriptions of those present (sandelowski and leeman ) . this analytic method was chosen because it is especially useful when interviewing health professionals about a specific topic, in that interpretation stays very close to the data presented while leveraging all of the methodological strengths of qualitative research, such as multiple, iterative coding, member checking, and data triangulation (neergaard et al. ). in this way, qualitative organizational research remains rigorous, while the significance of findings is easily translated to wider audiences for rapid action in intervention and implementation (sandelowski and leeman ) . by the middle of the program year, network meeting participants explicitly recognized that mid-course corrections were needed in the implementation of the new quality improvement and data-sharing program for universal screening and referral of perinatal mood and anxiety disorders. after iterative analysis of shared meeting notes, three key challenges were salient as themes from network meetings in the middle of the program year. regarding the first research question, concerning what problems emerged by midimplementation that required course correction, data showed that the numbers of clients screened and referred were a fraction of what was contractually anticipated by midway through the program year. this problem was two-fold, in that fewer screenings than expected were reported, but also data showed that those clients who screened as at risk for a perinatal mood and anxiety disorder were not consistently being given the referrals to further treatment indicated by the protocol. this was the first time the network had seen "real numbers" from the collected data for the program year that could be compared with estimated and predicted numbers for each part of the program year, both in terms of numbers anticipated to be screened and especially in terms of numbers expected to be referred to the case management services being funded by the network. however, the numbers were starkly disappointing: only about half of those whose screening scores were high enough to trigger referrals were actually offered referrals, and only about / of those who received referrals actually accepted the referral and followed up for further care. by the middle of the program year, only % of expected, estimated referrals had been issued, and no network partner was at the % expected. in responding to this data presentation, participants offered several possible patientlevel explanations. first, several noted that patients commonly experience inconsistent providers during perinatal care and may have little incentive to follow up on referrals after such a fragmented experience. one participant noted a patient who had been diagnosed with preeclampsia (a pregnancy complication marked by high blood pressure) by her first provider, but the diagnostician gave no further information beyond stating the diagnosis, and then the numerous other providers this patient saw never mentioned it again. this patient moved through care knowing nothing about her diagnosis and with little incentive to accept or follow up with other referrals. other participants noted that the typical approach to discharge planning and care transitions by providers was a poor match for clean, universal screening and referral, and that satisfaction surveys had captured patient concerns about how they were given information, which was typically on paper and presented to the patient as she leaves the hospital or medical office. as one participant noted, "we flood them the day mom leaves the hospital and we're lucky if the paper ever gets out of the back seat of the car." others noted that patients may not follow up on referrals simply because they are feeling overwhelmed with other areas of life or are feeling emotionally better without further treatment. however, while these explanations may shed light on why referred patients did not follow up on or keep referrals, they do nothing to explain why no referral or follow-up was offered for screens that were above the referral cutoff. two further explanations were salient. one explanation centered on the idea that some positive screens were potentially being ignored because staff may be reluctant to engage or feared working with clients in crisis-described as an, "if i don't ask the question, i don't have to deal with it" mindset. all screening tools used numeric scores, so that triggered referrals were not dependent on staff having to decide independently to make a referral, but conveying the difficult news that a client had scored high enough on a depression scale to warrant a follow-up referral may have been daunting to some staff. an alternative explanation suggested that staff were not ignoring positive screens but were not understanding the intricacies and expectations of the screening process. of the community agencies partnering with the network to provide screening, many were also able to provide case management as well, but staff did not realize that internal referrals to a different part of their agency still needed to be documented. in this case, a number of missed referrals could have been provided but never documented in the network datasharing system. regarding the second research question, concerning how advocacy targets needed to change based on the identification of the problem, participants agreed that the previous plan to reinforce the importance of the screening program to senior executives in current and potential partner agencies (mcmillin ) needed to be updated to reflect a much tighter focus on the line staff actually doing the work (or alternatively not doing the work in the ways expected) in the months remaining in the funded program year. one participant noted that the elusive warm handoff-a discharge and referral where the patient was warmly engaged, clearly understood the expected next steps, and was motivated to visit the recommended provider for further treatment-was also challenging for staff who might default to a "just hand them the paper" mindset, especially for those staff who were overwhelmed and understaffed. the network was funding additional case managers to link patients to treatment, but partner agencies were expected to screen using usual staff, who had been trained but not increased or otherwise compensated to do the screening. additional concerns mentioned the knowledge and preparation of staff to make good referrals, with an example noted of one staff member seemingly unaware of how to make a domestic violence referral even through a locally well-known agency specializing in interpersonal violence treatment and prevention has been working with the network for some time. meeting participants agreed that in the time remaining for the screening and referral pilot, advocacy efforts would have to be diverted away from senior executives and toward line staff if there was to be any chance of meeting enrollment targets and justifying further funding for the screening and referral program. participants also noted that while the operational assumption was that agencies that were network partners this pilot year would remain as network partners for future years of the universal screening and referral program, there was no guarantee about this. partner agencies that struggled to complete the pilot year of the program, with disappointing numbers, may decline to participate next year, especially if they lost network funding based on their challenged performance in the current program year. this suggested that additional advocacy at the executive level might still be needed, as executives could lead their agencies out of the network screening system after june , but that for the remainder of the program year, the line staff themselves who were performing screening needed to be heavily engaged and lobbied to have any hope of fully completing the pilot program on time. regarding the third research question, concerning specific course corrections identified and implemented to get implementation back on track, a prolonged brainstorming session was held after the disappointing data were shared. this effort produced a list of eight suggested "best practices" to help engage staff performing screening duties to excel in the work: ( ) making enrollment targets more visible to staff, perhaps by using visual charts and graphs common in fundraising drives; ( ) using "opt-out" choice architecture that would automatically enroll patients who screened above the cutoff score unless the patient objected; ( ) sequencing screens with other paperwork and assessments in ways that make sure screens are completed and acted upon in a timely way; ( ) offering patients incentives for completing screens; ( ) educating staff on reflective practice and compassion fatigue to avoid or reduce feelings of being overwhelmed about screening; ( ) using checklists that document work that is done and undone; ( ) maintaining intermittent contact and follow-up with patients to check on whether they have accepted and followed up on referrals; and ( ) using techniques of prolonged engagement so that by the time staff are screening patients for perinatal mood and anxiety disorders, patients are more likely to be engaged and willing to follow up. further discussion of these best practices noted that there was no available funding to compensate either patients for participating in screening or staff for conducting screening. long-term contact or prolonged engagement also seemed to be difficult to implement rapidly in the remaining months of the program year. low-cost, rapid implementation strategies were seen as most needed, and it was noted that strategies from behavioral economics were the practices most likely to be rapidly implemented at low-cost. visual charts and graphs displaying successful screenings and enrollments while also emphasizing the remaining screenings and enrollments needed to be on schedule were chosen for further training because these tactics would involve virtually no additional cost to partner agencies and could be implemented immediately. likewise, shifting to "opt-out" enrollment procedures was encouraged, where referred patients would be automatically enrolled in case management unless they specifically objected. in addition, the network quickly scheduled a workshop on how to facilitate meetings so that supervisors in partner agencies would be assisted in immediately discussing and implementing the above course corrections and behavioral strategies with their staff. training on using visual incentives emphasized three important components of using this technique. first, it was important to make sure that enrollment goals were always visually displayed in the work area of staff performing screening and enrollment work. this could be something as simple as a hand-drawn sign in the work area noting how many patients had been enrolled compared with what that week's enrollment target was. ideally this technique would transition to an infographic that was connected to an electronic dashboard in real time-where results would be transparently displayed for all to see in an automatic way that did not take additional staff time to maintain. second, the visual incentive needed to be displayed vividly enough to alter or motivate new worker behavior, but not so vividly as to compete with, distract, or delay new worker behavior. in many social work settings, participants agreed that weekly updates are intuitive for most staff. without regular check-in's and updates of the target numbers, it could be easy for workers to lose their sense of urgency about meeting these time-constrained goals. third, training emphasized teaching staff how behavioral biases could reduce their effectiveness. many staff are used to (and often good at) cramming and working just-in-time, but this is not possible when staff cannot control all aspects of work. screeners cannot control the flow of enrollees-rather they must be ready to enroll new clients intermittently as soon as they see a screening is positive-so re-learning not to cram or work just-in-time suggested a change in workplace routines for many staff. training on "opt-out" choice architecture for network enrollment procedures emphasized using behavioral economics and behavioral science to empower better direct practitioners. evidence suggests that behavior in many contexts is easy to predict with high accuracy, and behavioral economics seeks to alter people's behavior in predictable, honest, and ethical ways without forbidding any options or significantly adding too-costly incentives, so that the best or healthiest choice is the easiest choice (thaler and sunstein ) . training here also emphasized meeting thaler and sunstein's ( ) two standards for good choice architecture: ( ) the choice had to be transparent, not hidden, and ( ) it had to be cheap and easy to opt out. good examples of such choice architecture were highlighted, such as email and social media group enrollments, where one can unsubscribe and leave such a group with just one click. bad or false examples of choice architecture were also highlighted, such as auto-renewal for magazines or memberships where due dates by which to opt out are often hidden and there is always one more financial charge before one is free of the costly enrollment or subscription. training concluded by advising network participants to use opt-in choice architecture when the services in question are highly likely to be spam, not meaningful, or only relevant to a fraction of those approached. attendees were advised to use optout choice architecture when the services in question are highly likely to be meaningful, not spam, and relevant to most of those approached. since those with positive depression screens were not approached unless they had scored high on a depression screening, automatic enrollment in a case management service where clients would receive at least one more contact from social services was highly relevant to the population served in this pilot program and was encouraged, with clients always having the right to refuse. to jump-start changing the behavior of the staff in partner agencies actually doing the screenings, making the referrals, and enrolling patients in the case management program, the network quickly scheduled a facilitation training so that supervisors and all who led staff or chaired meetings could be prepared and empowered to discuss enrollment and teach topics like opt-out enrollment to staff. this training emphasized the importance of creating spaces for staff to express doubt or confusion about what was being asked of them. one technique that resonated with participants was doing check-ins with staff during a group training by asking staff to make "fists to fives," a hand signal on a - scale on how comfortable they were with the discussion, where holding a fist in the air is discomfort, disagreement, or confusion and waving all five fingers of one hand in the air meant total comfort or agreement with a query or topic. training also emphasized that facilitators and trainers should acknowledge that conflict and disagreement typically comes from really caring, so it was important to "normalize discomfort," call it out when people in the room seem uncomfortable, and reiterate that the partner agency is able and willing to have "the tough conversations" about the nature of the work. mid-course corrections attempted during implementation of a quality improvement system in a maternal and child health network offered several insights to how organizational policy practice and advocacy techniques may rapidly change on the ground. specifically, findings highlighted the importance of checking outcome data early enough to be able to respond to implementation concerns immediately. participants broadly endorsed organizational adoption of behavioral economic techniques to influence rapidly the work behavior of line staff engaged in screening and referral before lobbying senior executives to extend the program to future years. these findings invite further exploration of two areas: ( ) the workplace experiences of line staff tasked with mid-course implementation corrections, and ( ) the organizational and practice experiences of behavioral economic ("nudge") techniques. this network's approach to universal screening and referral was very clearly meant to be truly neutral or even biased in the client's favor. staff were allowed and even encouraged to use their own individual judgment and discretion to refer clients for case management, even if the client did not score above the clinical cutoff of the screening instrument. mindful of the dangers of rigid, top-down bureaucracy, the network explicitly sought to empower line staff to work in clients' favor, yet still experienced disappointing results. this outcome suggests several possibilities. first, it is possible that, as participants implied, line staff were sometimes demoralized workers or nervous non-clinicians who were not eager to convey difficult news regarding high depression scores to clients who may have already been difficult to serve. as hasenfeld's classic work (hasenfeld ) has explicated, the organizational pull toward people-processing in lieu of true people-changing is powerful in many human service organizations. tummers ( ) also recently showcases the tendency of workers to prioritize service delivery to motivated rather than unmotivated clients. smith ( ) suggests that regulatory and contractual requirements can ameliorate disparities in who gets prioritized for what kind of human service, but the variability if human service practice makes this problem hard to eliminate altogether. however, it is also possible that line staff did not see referral as clearly in a client's best interest but rather as additional paperwork and paper-pushing within their own workplaces, additional work that line staff were given neither extra time nor compensation to complete. given that ultimately the number of internal referrals that were undercounted or undocumented was seen as an important cause of disappointing project outcomes, staff reluctance to engage in extra bureaucratic sorting tasks is a distinct possibility. the line staff here providing screening may have seen their work as less of a clinical assessment process and more of a tedious, internal bureaucracy geared toward internal compliance and payment rather than getting needy clients to worthwhile treatment. further research on the experience of line staff members performing time-sensitive sorting tasks is needed to understand how even in environments explicitly trying to be empowering and supportive of worker discretion; worker discretion may have negative impacts on desired implementation outcomes. in addition to the experience of line staff in screening clients, the interest and embrace of agency supervisors in choosing behavioral economic techniques for staff training and screening processes also deserves further study. grimmelikhuijsen et al. ( ) advocate broadly for further study and understanding of behavioral public administration which integrates behavioral economic principles and psychology, noting that whether one agrees or disagrees with the "nudge movement" (p. ) in public administration, it is important to understand its growing influence. ho and sherman ( ) pointedly critique nudging and behavioral economic approaches, noting that they may hold promise for improving implementation and service delivery but do not focus on front-line workers, and the quality and consistency of organizational and bureaucratic services in which arbitrariness remains a consistent problem. finally, more research is needed on links between organizational policy implementation and state policy. in this case, state policy primarily set report deadlines and funding amounts with little discernible impact on ongoing organizational implementation. this gap also points to challenges in how policymakers can feasibly learn from implementation innovation in the community and how successful innovations can influence the policy process going forward. this article's findings and any inferences drawn from them must be understood in light of several study limitations. this study used a case study method and ethnographic approaches of participant observation, a process which always runs the risk of the personal bias of the researcher intruding into data collection as well as the potential for social desirability bias among those observed. moreover, a case study serves to elaborate a particular phenomenon and issue, which may limit its applicability to other cases or situations. a critical review of the use of the case study method in high-impact journals in health and social sciences found that case studies published in these journals used clear triangulation and member-checking strategies to strengthen findings and also used well-regarded case study approaches such as stake's and qualitative analytic methods such as sandelowski's (hyett et al. ). this study followed these recommended practices. continued research on health and human service program implementation that follows the criteria and standards analyzed by hyett et al. ( ) will contribute to the empirical base of this literature while ameliorating some of these limitations. research suggests that collaboration may be even more important for organizations than for individuals in the implementation of social innovations (berzin et al. ) . the network studied here adopted behavioral economics as a primary means of focusing organizational collaboration. however, a managerial turn to nudging or behavioral economics must do more than achieve merely administrative compliance. "opt-out, not-in" organizational approaches could positively affect implementation of social programs in two ways. first, it could eliminate unnecessary implementation impediments (such as the difficult conversations about depression referrals resisted by staff in this case) by using tools such as automatic enrollment to push these conversations to more specialized staff who could better advise affected clients. second, such approaches could reduce the potential workplace dissatisfaction of line staff, including any potential discipline they could face for incorrectly following more complicated procedures. thaler and sunstein ( ) explicitly endorse worker welfare as a rationale and site for behavioral economic approaches. they note that every system as a system has been planned with an array of choice decisions already made, and given that there is always a starting default, it should be set to predict desired best outcomes. this study supports considering behavioral economic approaches for social program implementation as a way to reset maladaptive default settings and provide services in ways that can be more just and more effective for both workers and clients. advancing a conceptual model of evidence-based practice implementation in public service sectors. administration and policy in mental health and mental health services research policy fields, data systems, and the performance of nonprofit human service organizations defining our own future: human service leaders on social innovation administrative coordination in nonprofit human service delivery networks: the role of competition and trust. nonprofit and voluntary sector quarterly institutional and market pressures on interorganizational collaboration and competition among private human service organizations defining the implementation problem: organizational management versus cooperation implementation matters: a review of research on the influence of implementation on program outcomes and the factors affecting implementation helping the public make health risk decisions implementation research: a synthesis of the literature the world café" in social work research organizational innovation adoption: a multi-level framework of determinants and opportunities for future research how to conduct a mini-ethnographic case study: a guide for novice researchers behavioral public administration: combining insights from public administration and psychology revisiting old friends: networks, implementation structures and the management of inter-organizational relations daily affect and daily beliefs human services as complex organizations managing street-level arbitrariness: the evidence base for public sector quality improvement methodology or method? a critical review of qualitative case study reports successful program development using implementation evaluation organizational policy advocacy for a quality improvement innovation in a maternal and child health network: lessons learned in early implementation recognizing new opportunities: reconceptualizing policy advocacy in everyday organizational practice qualitative description-the poor cousin of health research? identifying attributes of relationship management in nonprofit policy advocacy writing usable qualitative health research findings effectiveness, transportability, and dissemination of interventions: what matters when? workforce development and the organization of work: the science we need. administration and policy in mental health and mental health services research clinical supervision in effectiveness and implementation research the future of nonprofit human services lessons learned from implementing school-based substance abuse prevention curriculums financial struggles of a small community-based organization: a teaching case of the capacity paradox libertarian paternalism is not an oxymoron. university of chicago public law & legal theory working paper nudge: improving decisions about health, wealth, and happiness lessons learned in translating research evidence on early intervention programs into clinical care. mcn the relationship between coping and job performance comprehensive quality programming and accountability: eight essential strategies for implementing successful prevention programs identifying perinatal depression and anxiety: evidence-based practice in screening, psychosocial assessment and management publisher's note springer nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations the author declares that this work complied with appropriate ethical standards. the author declares that they have no conflict of interest. key: cord- -ewkche r authors: ghavasieh, arsham; bontorin, sebastiano; artime, oriol; domenico, manlio de title: multiscale statistical physics of the human-sars-cov- interactome date: - - journal: nan doi: nan sha: doc_id: cord_uid: ewkche r protein-protein interaction (ppi) networks have been used to investigate the influence of sars-cov- viral proteins on the function of human cells, laying out a deeper understanding of covid-- and providing ground for drug repurposing strategies. however, our knowledge of (dis)similarities between this one and other viral agents is still very limited. here we compare the novel coronavirus ppi network against known viruses, from the perspective of statistical physics. our results show that classic analysis such as percolation is not sensitive to the distinguishing features of viruses, whereas the analysis of biochemical spreading patterns allows us to meaningfully categorize the viruses and quantitatively compare their impact on human proteins. remarkably, when gibbsian-like density matrices are used to represent each system's state, the corresponding macroscopic statistical properties measured by the spectral entropy reveals the existence of clusters of viruses at multiple scales. overall, our results indicate that sars-cov- exhibits similarities to viruses like sars-cov and influenza a at small scales, while at larger scales it exhibits more similarities to viruses such as hiv and htlv . the covid- pandemic, with global impact on multiple crucial aspects of human life, is still a public health threat in most areas of the world. despite the ongoing investigations aiming to find a viable cure, our knowledge of the nature of disease is still limited, especially regarding the similarities and differences it has with other viral infections. on the one hand, sars-cov- shows high genetic similarity to sars-cov with the rise of network medicine [ ] [ ] [ ] [ ] [ ] [ ] , methods developed for complex networks analysis have been widely adopted to efficiently investigate the interdependence among genes, proteins, biological processes, diseases and drugs . similarly, they have been used for characterizing the interactions between viral and human proteins in case of sars-cov- [ ] [ ] [ ] , providing insights into the structure and function of the virus and identifying drug repurposing strategies , . however, a comprehensive comparison of sars-cov- against other viruses, from the perspective of network science, is still missing. here, we use statistical physics to analyze viruses, including sars-cov- . we consider the virus-human protein-protein interactions (ppi) as an interdependent system with two parts, human ppi network targeted by viral proteins. in fact, due to the large size of human ppi network, its structural properties barely change after being merged with viral components. consequently, we show that percolation analysis of such interdependent systems provides no information about the distinguishing features of viruses. instead, we model the propagation of perturbations from viral nodes through the whole system, using bio-chemical and regulatory dynamics, to obtain the spreading patterns and compare the average impact of viruses on human proteins. finally, we exploit gibbsian-like density matrices, recently introduced to map network states, to quantify the impact of viruses on the macroscopic functions of human ppi network, such as von neumann entropy. the inverse temperature β is used as a resolution parameter to perform a multiscale analysis. we use the above information to cluster together viruses and our findings indicate that sars-cov- groups with a number of pathogens associated with respiratory infections, including sars-cov, influenza a and human adenovirus (hadv) at the smallest scales, more influenced by local topological features. interestingly, at larger scales, it exhibits more similarity with viruses from distant families such as hiv and human t-cell leukemia virus type (htlv ). our results shed light on the unexplored aspects of sars-cov- , from the perspective of statistical physics of complex networks, and the presented framework opens the doors for further theoretical developments aiming to characterize structure and dynamics of virus-host interactions, as well as grounds for further experimental investigation and potentially novel clinical treatments. here, we use data regarding the viral proteins and their interactions with human proteins for viruses (see methods and fig. ) . to obtain the virus-human interactomes, we link the data to the biostr human ppi network ( , percolation of the interactomes. arguably, the simplest conceptual framework to assess how and why a networked system loses its functionality is via the process of percolation . here, the structure of interconnected systems is modeled by a network g with n nodes, which can be fully represented by an adjacency matrix a (a ij = if nodes i and j are connected, it is oth- . this point of view assumes that, as a first approximation, there is an intrinsic relation between connectivity and functionality: when the node removal occurs, the more capable of remaining assembled a system is, the better it will perform its tasks. hence, we have a quantitative way to assess the robustness of the system. if one wants to single out the role played by a certain property of the system, instead of selecting the nodes randomly, they can be sequentially removed following that criteria. for instance, if we want to find out what is the relevance of the most connected elements on the functionality, we can remove a fraction of the nodes with largest degree , . technically, the criteria can be whatever metric that allows us to rank nodes, although in practical terms topologically-oriented protocols are the most frequently used due to their accessibility, such as degree, betweenness, etc. therefore percolation is, at all effects, a topological analysis, since its input and output are based on structural information. in the past, the usage of percolation has been proved useful to shed light on several aspects of protein-related networks, such as in the identification of functional clusters and protein complexes , the verification of the quality of functional annotations or the critical properties as a function of mutation and duplication rates , to name but a few. following this research line, we perform the percolation analysis to all the ppi networks to understand if this technique brings any information that allows us to differentiate among viruses. the considered protocols are the random selection of nodes, the targeting of nodes by degree -i.e., the number of connections they haveand their removal by betweenness centrality -i.e., a measure of the likelihood of a node to be in the information flow exchanged through the system by means of shortest paths. we apply these attack strategies and compute the resulting (normalized) size of the largest connected component s in the network, which serves as a proxy to the remaining functional part, as commented above. this way, when s is close to unity the function of the network has been scarcely impacted by the intervention, while when s is close to the network can no longer be operative. the results are shown in fig. . surprisingly, for each attacking protocol, we observe that the curves of the size of the largest connected component neatly collapse in a common curve. in other words, percolation analysis completely fails at finding virus-specific discriminators. viruses do respond differently depending on the ranking used, but this is somehow expected due to the correlation between the metrics employed and the position of the nodes in the network. we can shed some light on the similar virus-wise response to percolation by looking at topological structure of the interactomes. despite being viruses of diverse nature and causing such different symptomatology, their overall structure shows a high level of similarity when it comes to the protein-protein interaction. indeed, for every pair of viruses we find the fraction of nodes f n and fraction of links f l that simultaneously participate in both. averaging over all pairs, we obtain that f n = . ± . and f l = . ± . . that means that the interactomes are structurally very similar, so the dismantling ranks. if purely topological analysis is not able to differentiate between viruses, then we need more convoluted, non-standard techniques to tackle this problem. in the next sections we will employ these alternative approaches. analysis of perturbation propagation. ppi networks represent the large scale set of interacting proteins. in the context of regulatory networks, edges encode dependencies for activation/inhibition with transcription factors. ppi edges can also represent the propensity for pairwise binding and the formation of complexes. the analytical treatment of these processes is described via bio-chemical dynamics , and regulatory dynamics . in bio-chemical (bio-chem) dynamics, these interactions are proportional to the product of concentrations of reactants, thus resulting in a second-order interaction, forming dimers. protein concentration x i (i = , , ..., n ) is also dependent on its degradation rate b i and the amount of protein synthesized at a rate f i . the resulting law of mass action: a ij x i x j summarizes the formation of complexes and degradation/synthesis processes that occur in a ppi. regulatory dynamics can be instead characterized by an interaction with neighbors described by a hill function that saturates at unity: in the context of the study of signal propagation, recent works have introduced the definition of network global correlation function , as ultimately, the idea is that constant perturbation brings the system to a new steady state x i → x i + dx i , and dx i /x i quantifies the magnitude of the response of node i from the perturbation in j. this allows also the definition of measures such as impact of a node as i i = j a ij g t ij describing the response of i's neighbors to its perturbation. interestingly, it was found that these measures can be described with power laws of degrees (i i ≈ k φ i ), via universal exponents dependent on the dynamics underlying odes allowing to effectively describe the interplay between topology and dynamics. in our case, φ = for both processes, therefore the perturbation from i has the same impact on neighbors, regardless of its degree. we exploit the definition of g ij to define the vector g v of perturbations of concentrations induced by the interaction with the virus v, where the k-th entry is given by the steps we follow to asses the impact of the viral nodes in the human interactome via the microscopic dynamics are described next. we first obtain the equilibrium states of human interactome by numerical integration of equations. then, for each virus, we compute the system response from perturbations starting in ∀i ∈ v which is eventually encoded in g v . finally, we repeat these steps for both the bio-chem and m-m models. the amount of correlation generated is a measure of the impact of the virus on the interactome equilibrium state. we estimate it as the euclidean -norm of the correlation vectors g v = i |g v i |, which we refer to as cumulative correlation. the results are presented in fig. . by allowing for multiple sources of perturbation, the biggest responses in magnitude will come from direct neighbors of these sources, making them the dominant contributors to g v . with i i not being dependent on the source degree, these results support the idea that with these specific forms of dynamical processes on the top of the interactome, the overall impact of a perturbation generated by a virus is proportional to the amount of human proteins it interacts with. results shown in fig. highlight that propagation patterns strongly depend on the sources (i.e., the affected nodes v), and strong similarities will generally be found within the same family and for viruses that share common impacted proteins in the interactome. conversely, families and viruses with small (or null) overlap in the sources exhibit low similarity and are not sharply distinguishable. to cope with this, we adopt a rather macroscopic view of the interactomes in the next section. analysis of spectral information. we have shown that the structural properties of human ppi network does not significantly change after being targeted by viruses. percolation analysis seems ineffective in distinguishing the specific characteristics of virus-host interactomes while, in contrast, the propagation of biochemical signals from viral components into human ppi network has been shown successful in assessing the viruses in terms of their average impact on human proteins. remarkably, the propagation patterns can be used to hierarchically cluster the viruses, although some of them are highly dependent on the choice of threshold (fig. ) . in this section, which is defined in terms of the propagator of a diffusion process on top of the network, normalized by the partition function z(β, g) = tr e −βl , which has an elegant physical meaning in terms of dynamical trapping for diffusive flows . consequently, the counterpart of massieu functionalso known as free entropy -in statistical physics can be defined for networks as note that a low value of the massieu function indicates high information flow between the nodes. the von neumann entropy can be directly derived from the massieu function by encoding the information content of graph g. finally, the difference between von neumann entropy and the massieu function follows where u(β, g) is the counterpart of internal energy in statistical physics. in the following, we use the above quantities to compare the interactomes corresponding to different virus-host interactomes. in fact, as the number of viral nodes is much smaller than the number of human proteins, we model each virus-human interdependent system g as a perturbation of the large human ppi network g (see fig. ). after considering the viral perturbations, the von neumann entropy, massieu function and the energy of the human ppi network change slightly. the magnitude of such perturbations can be calculated as explained in fig. , for von neumann entropy and massieu function, while the perturbation in internal energy follows their difference βδu(β, g) = δs(β, g) − δΦ(β, g), according to eq. . the parameter β encodes the propagation time in diffusion dynamics, or equivalently an inverse temperature from a thermodynamic perspective, and is used as a resolution parameter tuned to characterize macroscopic perturbations due to node-node interactions at different scales, from short to long range . based on the perturbation values and using k-means algorithm, a widely adopted clustering technique, we group the viruses together (see fig. , tab. and tab. ). at small scales, sars- cov- appears in a cluster with a number of other viruses causing respiratory illness, including sars-cov, influenza a and hadv. however, at larger scales, it exhibits more similarity with hiv , htlv and hpv type . table : the summary of clustering results at small scales (β ≈ from fig. ) is presented. remarkably, at this scale, sars-cov- groups with a number of respiratory diseases including sars-cov, influenza a and hadv. fig. ) is presented. here, sars-cov- shows higher similarity to hiv , htlv and hpv type . comparing covid- against other viral infections is still a challenge. in fact, various approaches can be adopted to characterize and categorize the complex nature of viruses and their impact on human cells. in this study, we used an approach based on statistical physics to analyze virus-human overview of the data set. it is worth noting that to build the covid- virus-host interactions, a different procedure had to be used. in fact, since the sars-cov- is too novel we could not find its ppi in the string repository and we have considered, instead, the targets experimentally observed in gordon et al , consisting of human proteins. the remainder of the procedure used to build the virus-host ppi is the same as before. see fig. for summary information about each virus. a key enzyme involved in the process of prostaglandin biosynthesis; ifih (interferon induced with helicase c domain , ncbi gene id: ), encoding mda , an intracellular sensor of viral rna responsible for triggering the innate immune response: it is fundamental for activating the process of pro-inflammatory response that includes interferons, for this reason it is targeted by several virus families which are able to hinder the innate immune response by evading its specific interferon response. contributions. ag, oa and sb performed numerical experiments and data analysis. mdd conceived and designed the study. all authors wrote the manuscript. the proximal origin of sars-cov- the genetic landscape of a cell epidemiologic features and clinical course of patients infected with sars-cov- in singapore a trial of lopinavir-ritonavir in adults hospitalized with severe covid- remdesivir, lopinavir, emetine, and homoharringtonine inhibit sars-cov- replication in vitro network medicine: a network-based approach to human disease focus on the emerging new fields of network physiology and network medicine human symptoms-disease network network medicine approaches to the genetics of complex diseases the human disease network the multiplex network of human diseases network medicine in the age of biomedical big data a sars-cov- protein interaction map reveals targets for drug repurposing structural genomics and interactomics of wuhan novel coronavirus, -ncov, indicate evolutionary conserved functional regions of viral proteins structural analysis of sars-cov- and prediction of the human interactome fractional diffusion on the human proteome as an alternative to the multi-organ damage of sars-cov- network medicine framework for identifying drug repurposing opportunities for covid- predicting potential drug targets and repurposable drugs for covid- via a deep generative model for graphs network robustness and fragility: percolation on random graphs introduction to percolation theory error and attack tolerance of complex networks breakdown of the internet under intentional attack identification of functional modules in a ppi network by clique percolation clustering identifying protein complexes from interaction networks based on clique percolation and distance restriction percolation of annotation errors through hierarchically structured protein sequence databases infinite-order percolation and giant fluctuations in a protein interaction network computational analysis of biochemical systems a practical guide for biochemists and molecular biologists propagation of large concentration changes in reversible protein-binding networks an introduction to systems biology quantifying the connectivity of a network: the network correlation function method universality in network dynamics the statistical physics of real-world networks classical information theory of networks the von neumann entropy of networks structural reducibility of multilayer networks spectral entropies as information-theoretic tools for complex network comparison complex networks from classical to quantum enhancing transport properties in interconnected systems without altering their structure scale-resolved analysis of brain functional connectivity networks with spectral entropy unraveling the effects of multiscale network entanglement on disintegration of empirical systems under revision string v : protein-protein association networks with increased coverage, supporting functional discovery in genome-wide experimental datasets biogrid: a general repository for interaction datasets the biogrid interaction database: update gene help: integrated access to genes of genomes in the reference sequence collection competing financial interests. the authors declare no competing financial interests.acknowledgements. the authors thank vera pancaldi for useful discussions. key: cord- - a gf authors: weinert, andrew; underhill, ngaire; gill, bilal; wicks, ashley title: processing of crowdsourced observations of aircraft in a high performance computing environment date: - - journal: nan doi: nan sha: doc_id: cord_uid: a gf as unmanned aircraft systems (uass) continue to integrate into the u.s. national airspace system (nas), there is a need to quantify the risk of airborne collisions between unmanned and manned aircraft to support regulation and standards development. both regulators and standards developing organizations have made extensive use of monte carlo collision risk analysis simulations using probabilistic models of aircraft flight. we've previously determined that the observations of manned aircraft by the opensky network, a community network of ground-based sensors, are appropriate to develop models of the low altitude environment. this works overviews the high performance computing workflow designed and deployed on the lincoln laboratory supercomputing center to process . billion observations of aircraft. we then trained the aircraft models using more than , flight hours at , feet above ground level or below. a key feature of the workflow is that all the aircraft observations and supporting datasets are available as open source technologies or been released to the public domain. the continuing integration of unmanned aircraft system (uas) operations into the national airspace system (nas) requires new or updated regulations, policies, and technologies to maintain safe and efficient use of the airspace. to help achieve this, regulatory organizations such as the federal aviation administration (faa) and the international civil aviation organization (icao) mandate the use of collision avoidance systems to minimize the risk of a midair collision (mac) between most manned aircraft (e.g. cfr § . ). monte carlo safety simulations and statistical encounter models of aircraft behavior [ ] have enabled the faa to develop, assess, and certify systems to mitigate the risk of airborne collisions. these simulations and models are based on observed aircraft behavior and have been used to design, evaluate, and validate collision avoidance systems deployed on manned aircraft worldwide [ ] . for assessing the safety of uas operations, the monte carlo simulations need to determine if the uas would be a hazard to manned aircraft. therefore there is an inherent need for models that represent how manned aircraft behave. while various models have been developed for decades, many of these models were not designed to model manned aircraft behavior where uas are likely to operate [ ] . in response, new models designed to characterize the low altitude environment are required. in response, we previously identified and determined that the opensky network [ ] , a community network of ground-based sensors that observe aircraft equipped with automatic dependent surveillance-broadcast (ads-b) out, would provide sufficient and appropriate data to develop new models [ ] . ads-b was initially developed and standardized to enable aircraft to leverage satellite signals for precise tracking and navigation. [ , ] . however, the previous work did not train any models. this work considered only how aircraft, observed by the opensky network, within the united states and flying between and , feet above ground level (agl) or less. thus this work does not consider all aircraft, as not all aircraft are equipped with ads-b. the scope of this work was informed by the needs of faa uas integration office, along with the activities of the standards development organizations of astm f , rtca sc- , and rtca sc- . initial scoping discussions were also informed by the uas excom science and research panel (sarp), an organization chartered under the excom senior steering group; however the sarp did not provide a final review of the research. we focused on two objectives identified by the aviation community to support integration of uas into the nas. first to train a generative statistical model of how manned aircraft behavior at low altitudes. and second to estimate the relative frequency that a uas would encounter a specific type of aircraft. these contributions are intended to support current and expected uas safety system development and evaluation and facilitate stakeholder engagement to refine our contributions for policy-related activities. the primary contribution of this paper is the design and evaluation of the high performance computing (hpc) workflow to train models and complete analyses that support the community's objectives. refer to previous work [ , ] to use the results from this workflow. this paper focus primarily on the use of the lincoln laboratory supercomputing center (llsc) [ ] to process billions of aircraft observations in a scalable and efficient manner. we first briefly overview the storage and compute infrastructure of the llsc. the llsc and its predecessors have been widely used to process aircraft tracks and support aviation research for more than a decade. the llsc high-performance computing (hpc) systems have two forms of storage: distributed and central. distributed storage is comprised of the local storage on each of the compute nodes and this storage is typically used for running database applications. central storage is implemented using the opensource lustre parallel file system on a commercial storage array. lustre provides high performance data access to all the compute nodes, while maintaining the appearance of a single filesystem to the user. the lustre filesystem is used in most of the largest supercomputers in the world. specifically, the block size of lustre is mb, thus any file created on the llsc will take at least mb of space. the processing described in this paper was conducted on the llsc hpc system [ ] . the system consists of a variety of hardware platforms, but we specifically developed, executed, and evaluated our software using compute nodes based on dual socket haswell (intel xeon e - v @ . ghz) processors. each haswell processor has cores and can run two threads per core with the intel hyper-threading technology. the haswell node has gb of memory. this section describes the high performance computing workflow and the results for each step. a shell script was used to download the raw data archives for a given monday from the opensky network. data was organized by day and hour. both the opensky network and our architecture will create a dedicated directory for a given day, such as - - . after extracting the raw data archives, up to comma separated value (csv) files will populate the directory; each hour in utc time corresponds to a specific file. however, there are a few cases where not every hour of the day was available. the files contain all the abstracted observations of all aircraft for that given hour. for a specific aircraft, observations are updated at least every ten seconds. for this paper, we downloaded mondays spanning february to june , totaling hours. the size of each hourly file was dependent upon the number of active sensors that hour, the time of day, the quantity of aircraft operations, and the diversity of the operations. across a given day, the hourly files can range in size by hundreds of megabytes with the maximum file size between and megabytes. together all the hourly files for a given day currently require about - gigabytes of storage. we observed that on average the daily storage requirement for was greater than for . parsing, organizing, and aggregating the raw data for a specific aircraft required high performance computing resources, especially when organizing the data at scale. many aviation use cases require organizing data and building a track corpus for each specific aircraft. yet it was unknown how many unique aircraft were observed in a given hour and if a given hourly file has any observations for a specific aircraft. to efficiently organize the raw data, we need to address these unknowns. we identified unique aircraft by parsing and aggregating the national aircraft registries of the united states, canada, the netherlands, and ireland. registries were processed for each individual year for - . all registries specified the registered aircraft's type (e.g. rotorcraft, fixed wing singleengine, etc.), the registration expiration date, and a global unique hex identifier of the transponder equipped on the aircraft. this identifier is known as the icao -bit address [ ] , with ( - ) unique addresses available worldwide. some of the registries also specified the maximum number of seats for each aircraft. using the registries, we created a four tier directory structure to organize the data. the highest level directory corresponds to the year, such as . the next level was organized by twelve general aircraft type, such as fixed wing single-engine, glider, or rotorcraft. the third directory level was based on the number of seats, with each directory representing a range of seats. a dedicated directory was created for aircraft with an unknown number of seats. the lowest level directory was based on the sorted unique icao -bit addresses. for each seat-based directory, up to icao -bit address directories are created. additionally to address that the four aircraft registries do not contain all registered aircraft globally, a second level directory titled "unknown" was created and populated with directories corresponding to each hour of data. the top and bottom level directories remained the same as the known aircraft types. the bottom directories for unknown aircraft are generated at runtime. this hierarchy ensures that there are no more than directories per level, as recommended by the llsc, while organizing the data to easily enable comparative analysis between years or different types of aircraft. the hierarchy was also sufficiently deep and wide to support efficient parallel process i/o operations across the entire structure. for example, a full directory path for the first three tiers of the directory hierarchy could be: " /rotorcraft/seats_ _ /." the directory would contain all the known unique icao -bit addresses for rotorcraft with - seats in . within this directory would be up to directories, such as "a c _a d " or "a d _a ecf" this lowest level directory would be used to store all the organized raw data for aircraft with an icao -bit address. the first hex value was inclusive, but the second hex value was not inclusive. with a directory structure established, each hourly file was then loaded into memory, parsed, and lightly processed. observations with incomplete or missing position reports were removed, along with any observations outside a user-defined geographic polygon. the default polygon, illustrated by figure , was a convex hull with a buffer of nautical mile around approximately north america, central america, the caribbean, and hawaii. units were also converted to u.s. aviation units. the country polygons were sourced from natural earth, a public domain map dataset [ ] . specifically for the mondays across the three years, directories were generated across the first three tiers of the hierarchy and , directories were created in total across the entire hierarchy. of these, , directories were nonempty. the majority of the directories were created within the unknown aircraft type directories. as overviewed by tables and , about . billion raw observations were organized, with about . billion observations available after filtering. there was a % annual percent increase in observations per hour from to . however, a % percent decrease in the average number of observations per hour was observed when comparing to ; this could be attributed to the covid- pandemic. this worldwide incident sharply curtailed travel, especially travel between countries. this reduction in travel was reflected in the amount of data filtered using the geospatial polygon. in and , about - % of observations were filtered based on their location. however, only % of observations were filtered for observations from march to june . conversely, the amount of observations removed due to quality control did not significantly vary in , as %, %, and % were removed for , , and . these results were generated using cpus across tasks, where each task corresponded to a specific hourly file. tasks were uniformly distributed across cpus, a dynamic selfscheduling parallelization approach was not implemented. each task required on average seconds to execute, with a median time of seconds. the maximum and minimum times to complete a task were and seconds. across all tasks, about hours of total compute time was required to parse and filter the days of data. it is expected that if the geospatial filtering was relaxed and observations from europe were not removed, that the compute time would increase due to increase demands on creating and writing to hourly files for each aircraft. since files were created for every hour for each unique aircraft, tens of millions of small files less than megabyte in size were created. this was problematic as small files typically use a single object storage target, thus serializing access to the data. additionally, in a cluster environment, hundreds or thousands of concurrent, parallel processes accessing small files can lead to significantly large random i/o patterns for file access and generates massive amounts of networks traffic. this results in increased latency for file access, higher network traffic and significantly slows down i/o and consequently causes degradation in overall application performance. while this approach to data organization may provide acceptable performance on a laptop or desktop computer, it was unsuitable for use in a shared, distributed hpc system. in response, we created zip archives for each of the bottom directories. in a new parent directory, we replicated the first three tiers of the directory hierarchy from the previous step. then instead of creating directories based on the icao -bit addresses, we archiving each directory with the hourly csv files from the previous organization step. we then removed the hourly csv files from storage. this was achieved using llmapreduce [ ] , with a task created for each of the , non-empty bottom level directories. similar to the previous organization step, all tasks were completed in a few hours but with no optimization for load balancing. the performance of this step could be improved by distributing tasks based on the number of files in the directories or the estimated size the output archive. a key advantage to archiving the organized data, is that the archives can be updated with new data as it becomes available. if the geospatial filtering parameters and aircraft registry data doesn't change, only new open sky data needs to be organized. once organized into individual csv files, llmapreduce can be used again to update the existing archives. this substantially reduces the computational and storage requirements to process new data. the archived data can now be segmented, have outliers removed, and interpolated. additionally above ground level altitude was calculated, airspace class was identified, and dynamic rates (e.g. vertical rate) were calculated. we also split the raw data into track segments based on unique position updates and time between updates. this ensures that each segment does not include significantly interpolated or extrapolated observations. track segments without ten points are removed. figure illustrates the track segments for a faa registered fixed wing multi-engine aircraft from march to june . note that segment length can vary from tens to hundreds of nautical miles long. track segment length was dependent upon the aircraft type, availability of active opensky network sensors, and nearby terrain. however, the ability to generate track segments that span multiple states represents a substantial improvement over previous processing approaches for development of aircraft behavior models. then for each segment we detect altitude outliers using a . scaled median absolute deviations approach and smooth the track using a gaussian-weight average filter with a -second time window. dynamic rates, such as acceleration, are calculated using a numerical gradient. outliers are then detected and removed based on these rates. outlier thresholds were based on aircraft type. for example, the speeds greater than knots were considered outliers for rotorcraft, but fixed wing multiengine aircraft had a threshold of knots. the tracks were then interpolated to a regular one second interval. lastly, we estimated the above ground level altitude using digital elevation models. this altitude estimation was the most computationally intensive component of the entire workflow. it consists of loading into memory and interpolating srtm or noaa globe [ ] digital elevation models (dems) to determine the elevation for each interpolated track segment position. to reduce the computational load prior to processing the terrain data, it was determined using a c++ based polygon test to identify which track segment positions are over the ocean, as defined by natural earth data. points are over the ocean are assumed to have an elevation of feet mean sea level and their elevation are not estimated using the dems. for the days of organized data, approximately , , interpolated track segments were generated. for each aircraft in a given year, a single csv was generated containing all the computed segments. in total across the three years, , files were generated. as these files contained significantly more rows and columns than when organizing the raw data, the majority of these final files were greater than mb in size. the output of this step did not face any significant storage block size challenges. similar to the previous step, tasks were created based on the bottom tier of the directory hierarchy. specifically for processing, parallel tasks were created for each archive. during processing, archives were extracted to a temporary directory while the final output was stored in standard memory. given the processed data, this section overviews two applications on how to exploit and dissemination the data to inform and support the aviation safety community. as the aircraft type was identified when organizing the raw data, it was a straightforward task to estimate the observed distribution of aircraft types per hour. these distributions are not reflective of all aircraft operations in the united states, as not all aircraft are observed by the opensky network. the distributions were also calculated independently for each aircraft type, so the yearly (row) percentages may not sum to %. furthermore the relatively low percentage of unknown aircraft was due to the geospatial filtering when organizing the raw data. if the same aircraft registries were used by the filtering was change to only include tracks in europe, the percentage of unknown aircraft would likely significantly rise. this analysis can be extended by identifying specific aircraft manufactures and models, such as boeing . however, the manufacturer and model information are not consistent within an aircraft registry nor across different registries. for example, entries of "cessna ," "textron cessna ," and "textron c " all refer to the same aircraft model. one possible explanation for the differences between entries is that cessna used to be an independent aircraft manufacturer and then eventually was acquired by textron. depending on the year of registration, the name of the aircraft may differ but the size and performance of the aircraft remains constant. since over , aircraft with unique icao -bit addresses were identified annually across the aircraft registries, parsing and organizing the aircraft models can be formulated as a traditional natural language processing problem. parsing the aircraft registries differs from a common problem of parsing aviation incident or safety reports [ , , ] due to the reduced word count of the registries and the structured format of the registries. future work will focus on using fuzzy string matching to identify similar aircraft. for many aviation safety studies, manned aircraft behavior is represented using mit lincoln laboratory encounter models. each encounter model is a bayesian network, a generative statistical model that mathematically represents aircraft behavior during close or safety critical encounters, such as near midair collisions. the development of the modern models started in [ ] , with significant updates in [ ] and [ ] . all the models were trained using the llsc [ ] or its predecessors. the most widely used of these models were trained using observations collected by groundbased secondary surveillance radars from the th radar evaluation squadron (rades) network. aircraft observations by the rades network are based on mode a/c, an identification friend or foe technology that provides less metadata than ads-b. notably aircraft type or model cannot be explicitly correlated or identified with specific aircraft tracks. instead, we filtered the rades observations based on the flying rules reported by the aircraft. however, this type of filtering is not unique to the rades data, it is also supported by the opensky network data. additionally, due to the performance of the rades sensors, we filtered out any observations below feet agl due to position uncertainties associated with radar time of arrival measurements. observations of ads-b equipped aircraft by the opensky network differ because ads-b enables aircraft to broadcast the aircraft's estimate of their own location, which is often based on precise gnss measurements. the improved position reporting of ads-b enabled the new opensky network-based models to be trained with an altitude floor of feet agl, instead of . specifically, three new statistical models of aircraft behavior were trained, each for a different aircraft type of fixed wing multi-engine, fixed wing single-engine, and rotorcraft. a key advantage to these models is the data reduction and dimensionality reduction. a model was created for each of the three aircraft types and stored as a human readable text file. each file requires approximately just . megabytes. this a significant reduction from the hundreds of gigabytes used to store the original days of data. table iv reports the quantity of data used to train each model. for example, the rotorcraft model was trained from about , flight hours over days. however, like the rades-based model, these models do not represent the geospatial nor temporal distribution of the training data. for example, a limitation of these models is that they do not inform if more aircraft were observed in new york city than los angeles. [ ] . these figures illustrate how different aircraft behave, such as rotorcraft flying relatively lower and slower than fixed wing multi-engine aircraft. also note that the rades-based model has no altitude observations below feet agl, whereas % of the approximately , rotorcraft flight hours were observed at - feet agl. it has not been assessed if the opensky network-based models can be used a surrogates for other aircraft types or operations. additionally the new models do not fully supersede the existing rades-based models, as each models represent different varieties of aircraft behavior. on github.com, please refer to the mit lincoln laboratory (@mit-ll) and airspace encounter models (@airspace-encounter-models) organizations. airspace encounter models for estimating collision safety analysis of upgrading to tcas version . using the u.s. correlated encounter model well-clear recommendation for small unmanned aircraft systems based on unmitigated collision risk bringing up opensky: a large-scale ads-b sensor network for research developing a low altitude manned encounter model using ads-b observations vision on aviation surveillance systems ads-mode s: initial system description representative small uas trajectories for encounter modeling interactive supercomputing on , cores for machine learning and data analysis mode s: an introduction and overview (secondary surveillance radar) introducing natural earth datanaturalearthdata.com llmapreduce: multi-level map-reduce for high performance data analysis the global land one-kilometer base elevation (globe) digital elevation model, version . using structural topic modeling to identify latent topics and trends in aviation incident reports temporal topic modeling applied to aviation safety reports: a subject matter expert review ontologies for aviation data management ieee/aiaa th digital avionics systems conference (dasc) uncorrelated encounter model of the national airspace system, version . correlated encounter model for cooperative aircraft in the national airspace system version . we greatly appreciate the support and assistance provided by sabrina saunders-hodge, richard lin, and adam hendrickson from the federal aviation administration. we also would like to thank fellow colleagues dr. rodney cole, matt edwards, and wes olson. key: cord- -oaqqh e authors: georgalakis, james title: a disconnected policy network: the uk's response to the sierra leone ebola epidemic date: - - journal: soc sci med doi: . /j.socscimed. . sha: doc_id: cord_uid: oaqqh e this paper investigates whether the inclusion of social scientists in the uk policy network that responded to the ebola crisis in sierra leone ( – ) was a transformational moment in the use of interdisciplinary research. in contrast to the existing literature, that relies heavily on qualitative accounts of the epidemic and ethnography, this study tests the dynamics of the connections between critical actors with quantitative network analysis. this novel approach explores how individuals are embedded in social relationships and how this may affect the production and use of evidence. the meso-level analysis, conducted between march and june , is based on the traces of individuals' engagement found in secondary sources. source material includes policy and strategy documents, committee papers, meeting minutes and personal correspondence. social network analysis software, ucinet, was used to analyse the data and netdraw for the visualisation of the network. far from being one cohesive community of experts and government officials, the network of people was weakly held together by a handful of super-connectors. social scientists’ poor connections to the government embedded biomedical community may explain why they were most successful when they framed their expertise in terms of widely accepted concepts. the whole network was geographically and racially almost entirely isolated from those affected by or directly responding to the crisis in west africa. nonetheless, the case was made for interdisciplinarity and the value of social science in emergency preparedness and response. the challenge now is moving from the rhetoric to action on complex infectious disease outbreaks in ways that value all perspectives equally. global health governance is increasingly focused on epidemic and pandemic health emergencies that require an interdisciplinary approach to accessing scientific knowledge to guide preparedness and crisis response. of acute concern is zoonotic disease, that can spread from animals to humans and easily cross borders. the "grave situation" of the chinese coronavirus (covid- ) outbreak seems to have justified these fears and is currently the focus of an international mobilisation of scientific and state resources (wood, ) . covid- started in wuhan, the capital of china's hubei province and has been declared a public health emergency of international concern (pheic) by the world health organisation (who). the interactions currently taking place, nationally and internationally between evidence, policy and politics, are complex and relate to theories around the role of the researcher as broker or advocate and the form and function of research policy networks (pielk, ) and (ward et al., ) and (georgalakis and rose, ) . in this paper i seek to explore these areas further through the lens of the uk's response to ebola in west africa. this policy context has been selected in relation to the division of the affected countries between key donors. the british government assumed responsibility for sierra leone and sought guidance from health officials, academics, humanitarian agencies and clinicians. the ebola epidemic that struck west africa in has been described as a "transformative moment for global health" (kennedy and nisbett, , p. ) , particularly in relation to the creation of a transdisciplinary response that was meant to take into account cultural practices and the needs of communities. the mobilisation of anthropological perspectives towards enhancing the humanitarian intervention was celebrated as an example of research impact by the uk's economic and social research council (esrc) and department for international development (dfid) (esrc, ). an eminent group of social scientists called for future global emergency health interventions to learn from this critical moment of interdisciplinary cooperation and mutual understanding (s. a. abramowitz et al., ) . however, there has been much criticism of this narrative, ranging from the serious https://doi.org/ . /j.socscimed. . received august ; received in revised form february ; accepted february * director of communications and impact, institute of development studies university of sussex, library road, falmer, brighton, bn re, uk. e-mail addresses: j.georgalakis@ids.ac.uk, mjcg @bath.ac.uk. available online february - / crown copyright © published by elsevier ltd. all rights reserved. t doubts of some anthropologists themselves about their impact (martineau et al., ) , to denouncements of largely european and north american anthropologists' legitimacy and the utility of their advice (benton, ) . there are two questions i hope to address through a critical commentary on the events that unfolded and with social network analysis of the uk based research and policy network that emerged: i) how transformational was the uk policy response to ebola in relation to changes in evidence use patterns and behaviours? ii) how does the form and function of the uk policy network relate to epistemic community theory? the first question will explore the degree to which social scientists and specifically anthropologists and medical anthropologists, were incorporated into the uk policy network. the second question seeks to locate the dynamics of this network in the literature on network theory and the role of epistemic communities in influencing policy during emergencies. the paper does not attempt to evidence the impact of anthropology in the field or take sides in hotly debated issues such as support for home care. instead, it looks at how individuals are embedded in social relationships and how this may affect the production and use of evidence (victor et al., ) . the emerging field of network analysis around the generation and uptake of evidence in policy, recommends this critical realist constructivist methodology. it utilises interactive theories of evidence use, the study of whole networks and the analysis of the connections between individuals in policy and research communities (nightingale and cromby, ; oliver and faul, ) . although ebola related academic networks have been mapped, this methodological approach has never previously been applied to the policy networks that coalesced around the international response. hagel et al. show how research on the ebola virus rapidly increased during the crisis in west africa and identified a network of institutions affiliated through co-authorship. unfortunately, their data tell us very little about the type of research being published and how it was connected into policy processes (hagel et al., ) . in contrast, this paper seeks to inform the ongoing movements promoting interdisciplinarity as key to addressing global health challenges. zoonotic disease has been the subject of particular concerns around the, "connections and disconnections between social, political and ecological worlds" (bardosh, , p. ) . with the outbreak of covid- in china at the end of , its rapid spread overseas and predictions of more frequent and more deadly pandemics and epidemics in the future, the importance of breaking down barriers between policy actors, humanitarians, social scientists, doctors and medical scientists can only increase with time. before we look at detailed accounts of events relating to the uk policy network, first we must consider what the key policy issues were relating to an anthropological response versus a purely clinical one. anthropological literature exists, from previous outbreaks, documenting the cultural practices that affected the spread of ebola (hewlett and hewlett, ) . the main concerns relate to how local practices may accelerate the spread of the virus and the need to address these in order to lower infection rates. ebola is highly contagious, particularly from contamination by bodily fluids. in west africa, many local customs exist around burial practices that clinicians believe heighten the risk to communities. common characteristics of these are, the washing of bodies by family members, passing clothing belonging to the deceased to family and the touching of the body (richards, ) . another concern, as the crisis unfolded, was people attempting to provide home care to victims of the virus. the clinical response was to create isolation units or ebola treatment units (etus) in which to assess and treat suspected cases (west & von saint andré-von arnim, ) . community based care centres were championed by the uk government but their deployment came late and opinion was divided around their effectiveness. clinicians regarded etus as an essential part of the response and wanted to educate people to discourage them from engaging in what they regarded as deeply unsafe practices, including home care (walsh and johnson, ) and (msf, ) . anthropologists with expertise in the region focused instead on engaging communities more constructively, managing stigma and understanding local behaviours and customs (fairhead, ) , (richards, b) and (berghs, ) . anthropologist, paul richards, argues that agencies' and clinicians' lack of understanding of local customs worsened the crisis (richards, ) and that far from being ignorant and needing rescuing from themselves, communities had coping strategies of their own. his studies from sierra leone and liberia relate how some villages isolated themselves, created their own burial teams and successfully protected those who came in contact with suspected cases with makeshift protective garments (richards, a) . anthropologists working in west africa during the epidemic prioritised studies of social mobilisation and community engagement and worked with communities directly on ebola transmission. sharon abramowitz, in her review of the anthropological response across guinea, liberia and sierra leone, provides examples from the field work of chiekh niang (fleck, ) , sylvain faye, juliene anoko, almudena mari saez, fernanda falero, patricia omidian, several medicine sans frontiers (msf) anthropologists and others (s. abramowitz, ) . however, abramowitz argues that learning generated by these ethnographic studies was largely ignored by the mainstream response. however, not everyone has welcomed the intervention of the international anthropological community. some critics have argued that social scientists in mostly european and north american universities were poorly suited to providing sound advice given their lack of familiarity with field-based operations. adia benton suggests that predominantly white northern anthropologists have an "inflated sense of importance" that led them to exaggerate the relevance of their research. this in turn helped reinforce concepts of "superior northern knowledge" (benton, , p. ). this racial optic seems to contradict the portrayal of plucky anthropologists being the victims of knowledge hierarchies that favour other knowledges over their own. our focus here, on the mobilisation of knowledge from an international community of experts, recommends that we consider how this can be understood in relation to group dynamics as well as individual relationships. particularly relevant is peter haas' theory of epistemic communities. haas helped define epistemic communities and how they differ from other policy communities, such as interest groups and advocacy coalitions (haas, ) . they share common principles and analytical and normative beliefs (causal beliefs). they have an authoritative claim to policy relevant expertise in a particular domain and haas claims that policy actors will routinely place them above other interest groups in their level of expertise. he believes that epistemic communities and smaller more temporary collaborations within them, can influence policy. he observes that in times of crisis and acute uncertainty, policy actors often turn to them for advice. the emergence of an epistemic community focused on the uk policy response was framed by the division of the affected countries between key donors along historic colonial lines. namely, the uk was to lead in sierra leone, the united states in liberia and the french in guinea. this seems to have focused social scientists in the uk on engaging effectively with a government and wider scientific community who seemed to want to draw on their expertise. this was a relatively close-knit community of scholars who already worked together, co-published and cited each other's work and in many cases worked in the same academic institutions. crucially, their ranks were swelled by a small number of epidemiologists and medical anthropologists who shared their concerns. from the time msf first warned the international community of an unprecedented outbreak of ebola in guinea at the end of march , it was six months before an identifiable and organised movement of social scientists emerged (msf, ) . things began to happen quickly when the who announced in early september of that year that conventional biomedical responses to the outbreak were failing (who, a) . this acted like a siren call to social scientists incensed by the reported treatment of local communities and the way in which a narrative had emerged blaming local customs and ignorance for the rapid spread of the virus. british anthropologist, james fairhead, hastily organised a special panel on ebola at the african studies association (asa) annual conference, that was taking place at the university of sussex (uos) on the september , . amongst the panellists were: anthropologist melissa leach, director of the institute of development studies (ids); audrey gazepo, university of ghana, medical anthropologist melissa parker from the london school of hygiene and tropical medicine (lshtm); anthropologist and public health specialist, anne kelly from kings college london and stefan elbe, uos. informally, after the conference, this group discussed the idea of an online repository or platform for the supply of regionally relevant social science (f. martineau et al., ) . this would later become the ebola response anthropology platform (erap). in the days and weeks that followed it was the personal and professional connections of these individuals that shaped the network engaging with the uk's intervention. just two days after the emergency panel at the asa, jeremy farrar, director of the wellcome trust, convened a meeting of around public health specialists and researchers, including leach, on the uk's response to the epidemic. discussions took place on the funding and organisation of the anthropological response. the government was already drawing on the expertise and capacity of public health england (phe), the ministry of defence (mod) and the department of health (doh), to drive its response but social scientists had no seat at the table. the government's chief medical officer (cmo) sally davies called a meeting of the ebola scientific assessment and response group (esarg), on the th september, focused on issues which included community transmission of ebola. leach's inclusion as the sole anthropologist was largely thanks to farrar and chris whitty, dfid's chief scientific advisor (m leach, ). there was already broad acceptance of the need for the response to focus on community engagement and the who had been issuing guidance on how to engage and what kind of messaging to use for those living in the worst affected areas (who, c) . in their account of these events three of the central actors from the uk's anthropological community describe how momentum gathered quickly and that: "it felt as if we were pushing at an open door" (f. martineau et al., , ) . by the following month, the uk's coalition government was embracing its role as the leading bilateral donor in sierra leone and wanted to raise awareness and funds from other governments and foundations. a high level conference: defeating ebola in sierra leone, had been quickly organised, in partnership with the sierra leone government, at which an international call for assistance was issued (dfid, ) . it was shortly after this that the cmo, at the behest of the government's cabinet office briefing room (cobra), formed the scientific advisory group for emergencies on ebola (sage). by its first meeting on the october , , british troops were on the ground along with volunteers from the uk national health service (nhs) (stc, ). leach was pulled into this group along with most of the members of esarg that had met the previous month. it was decided in this initial meeting to set up a social science sub-group including whitty, leach and the entire steering group of the newly established erap (sage, a). this included not just british-based anthropologists but also paul richards and esther mokuwa from njala university, sierra leone. from this point anthropologists appeared plugged into the government's architecture for guiding their response. there were several modes for the interaction between social scientists and policy actors that focused on the uk led response. firstly, there were the formal meetings of committees or other bodies that were set up to directly advise the uk government in london. secondly, there were the multitude of ad-hoc interactions, conversations, meetings and briefings, some of which were supported with written reports. then, there was the distribution of briefings, reports and previously published works by erap which included use of the pre-existing health, education advice and resource team (heart) platform, which already provided bespoke services to dfid in the form of a helpdesk (heart, ). erap was up and running by the th october and during the crisis the platform published around open access reports which were accessed by over , users (erap, ). there were also a series of webinars and workshops and an online course (lshtm, ) . according to ids and lshtm's application to the esrc's celebrating impact awards (m. leach et al., ) , the policy actors that participated in these interactions included: uk government officials in dfid's london head quarters and its sierra leone country office, in the mod and the government's office for science (go-science). closest of all to the prime minister and the cabinet office was sage. they also communicated with international non-governmental organisations (ingos) like help aged international and christian aid who requested briefings or meetings. erap members advised the who via three core committees, as well as the united nations mission for ebola emergency response (unmeer) and the united nations food and agricultural organisation (unfao). by the end of the crisis members of erap had given written and oral evidence to three separate uk parliamentary inquiries. these interactions were not entirely limited to policy audiences. erap members also contributed to the design of training sessions and a handbook on psychosocial impact of ebola delivered to all the clinical volunteers from the nhs prior to their deployment from december onwards (redruk, ). the way in which anthropologists engaged in policy and practice seemed to reflect an underlying assumption that they would work remotely to the response and engage primarily with the uk government, multilaterals and ingos. a strength of this approach, apart from the obvious personal safety and logistical implications, was that anthropologists enjoyed a proximity to key actors in london. face to face meetings could be held and committees joined in person (f. martineau et al., ) . a good example of a close working relationship that required a personal interaction were the links built with two policy analysts working in the mod. not even dfid staff had made this connection and it was thanks to a member of the erap steering committee that one of these officials was able to join the sage social science subcommittee and provide a valuable connection back into the ministry (martineau et al., ) . with proximity to the uk government in london came distance from the policy professionals and humanitarians in sierra leone. just % of erap's initial funding was focused on field work. although, this later went up and comparative analysis on resistance in guinea and sierra leone and between ebola and lassa fever was undertaken (wilkinson and fairhead, ) , as well as a review of the disaster emergency committee (dec) crisis appeal response (oosterhoff, ) . there was also an evaluation of the community care centres and additional funding from dfid supported village-level fieldwork by erap researchers from njala university, leading to advice to social mobilisation teams. nonetheless, the network's priority was on giving advice to donors and multilaterals, albeit at a great distance from the action. this type of intervention has not escaped accusations of "armchair anthropology" (delpla and fassin, ) in (s. abramowitz, , p. ). rather than relying solely on this qualitative account, drawn largely from those directly involved in these events, social network analysis (sna) produces empirical data for exploring the connections between individuals and within groups (crossley and edwards, ) . it is a quantitative approach rooted in graph theory and covers a range of methods which are frequently combined with qualitative methods (s. p. borgatti et al., ) . in this case, the network comprises of nodes who are the individuals identified as being directly involved in some of the key events just described. a second set of nodes are the events or interactions themselves. content analysis of secondary sources linked to these events provides an unobtrusive method for identifying in some detail the actors who will have left traces of their involvement. sna allows us to establish these actors' ties to common nodes (they were part of the same committee or event or contributed to the same reports.) furthermore, we can assign non-network related attributes to each of our nodes such as gender, location, role and organisation affiliation type. not only does this approach provide a quantitative assessment of who was involved and through which channels but the mathematical foundations of sna allow for whole network analysis of cohesion across certain groups. you may calculate levels of homophily (the tendency of individuals to associate with similar others) between genders, disciplines and organisational type and identify sub-networks and the super-connectors that bridge them (s. p. borgatti et al., ) . the descriptive and statistical examination of graphs provides a means with which to test a hypothesis and associated network theory that is concerned with the entirety of the social relations in a network and how these affect the behaviour of individuals within them (stovel and shaw, ) and (ward et al., ) . the quantitative analysis of secondary sources was conducted between march and june , utilising content analysis of artefacts which included reports, committee papers, public statements, policy documentation and correspondence. sna software, ucinet, was used to analyse nodes and ties and netdraw for the visualisation of the network (s. and (s. . the source material is limited to artefacts relating to the uk government's response to the ebola outbreak in ii) the apparent prominence or influence of these groups on the uk's response to the crisis, iii) the remit of these groups to focus on the social response, as opposed to the purely clinical one. tracing the events and policy moments which reveal how individual social scientists engaged with the ebola crisis from mid- requires one to look well beyond academic literature. whilst some of this material is openly available, a degree of insider knowledge is required to identify who the key actors were and the modes of their engagement. this is partly a reflection of a sociological approach to policy research that treats networks, only partially visible in the public domain, as a social phenomenon (gaventa, ) . the calculation of network homogeneity (how interconnected the network is), the identification of cliques or sub-networks and the centrality of particular nodes, can be mathematically stable measures of the function of the network. however, the reliability of this study mainly resides on its validity. the assignment of attributes is in some cases fairly subjective. whereas gender and location are verifiable, the choice of whether an individual is an international policy actor or a national policy actor must be inferred from their official role during the crisis period. sometimes this can be based on the identity of their home institution. given dfid's central focus on overseas development assistance, its officials have been classified as internationals, rather than nationals. in some cases, individuals may be qualified clinicians or epidemiologists but their role in the crisis may have been primarily policy related and not medical or scientific. therefore, they are classified as policy actors not scientists. other demographic attributes could have been identified such as race and age which would have enabled more options for data analysis. a key factor here is the use of a two mode matrix that identifies connections via people participating in the same events or forums, rather than direct social relationships such as friendship. therefore, measurement validity is largely determined by whether connections of this type can be used to determine how knowledge and expertise flow between individuals. to mitigate the risk that this measurement fails to capture knowledge exchange toward policy processes, particular care was taken with the sampling of focal events used to generate the network. the majority of errors in sna relate to the omission of nodes or ties. fig. sets out the advantages and disadvantages of each of the selected events and the data artefacts used to identify associated individuals. i am aware that some critics might take exception to my choice of network. it is sometimes suggested that by focusing on northern dominated networks or the actions of bilaterals and multilaterals, you simply reinforce coloniality and a racist framing of development and aid (richardson et al., ) and (richardson, ) . however, there is a valid, even essential, purpose here. only by seeking to understand the politics of knowledge and the social and political dynamics of global health and humanitarian networks can we challenge injustice and historically reinforced narratives that favour some perspectives over others. the secondary sources identify unique individuals, all but five of whom can be identified by name. four types of attribute are assigned to these nodes: gender, location (global north or south), organisation type and organisational role. attributes have been identified through an internet search of institutional websites, linkedin and related online resources. role and organisation type are recorded for the period of the crisis. the total number of nodes given at the bottom of fig. is slightly lower due to the anonymity of five individuals whose gender and role could not be established. looking at this distribution of attributes across the whole network one can make the following observations in relation to how prominently different characteristics are represented: i. females slightly outnumber males in the social science category but there are twice as many male 'scientists other' than female. they are a combination of clinicians, virologists, epidemiologists and other biomedical expertise. ii. there are just nine southern based nodes out of a total of and none of these are policy makers or practitioners. this is racially and geographically a northern network with just a sliver of west african perspectives. these included, yvonne aki-sawyerr, veteran ebola campaigner and current mayor of freetown, four academics from njala university and development professionals working in the sierra leone offices of agencies such as the unfao. iii. although 'scientists other' only just outnumber social scientists this is heavily skewed by one of the eight interaction nodesthe lessons for development conferencewhich was primarily a learning event and not part of the advisory processes around the response. many individuals who participated in this event are not active in any of the other seven interactions. if we remove these non-active nodes from the network, you get just social scientists compared to 'scientist other'. the remaining core policy network of individuals appears to be weighted towards the biomedical sciences. netdraw's standardised graph layout algorithm has been used in fig. to optimise distances between nodes which helps to visualise cohesive sub-groups or sub-networks and produces a less cluttered picture (s. . however, it should be noted that graph layout algorithms provide aesthetic benefits at the expense of attribute based or values based accuracy. the exact length of ties between nodes and their position do not correspond exactly to the quantitative data. we can drag one of these nodes into another position to make it stand out more clearly without changing its mathematical or sociological properties (s. p. borgatti et al., ) . we can see in this graph layout the clustering of the eight interactive nodes or focal events and observe some patterns in the attributes of the nodes closest to them. the right-hand side is heavily populated with social scientists. as mentioned above, this is influenced by the lessons for development event. as you move to the left side fewer social scientists are represented and they are outnumbered by other disciplines. the state owned or driven interactions such as sage and parliamentary committees appear on this left side and the anthropological epistemic community driven or owned interactions, such as erap reports and lessons for development, appear on the right side. the apparent connectors or bridges are in the centre. these bridges can be conceptualised as both focal events, including the erap steering committee, the sage social science sub-committee and the asa ebola panel, or as the key individual nodes connected to these. we know that many informal interactions between researchers, officials and humanitarians are not captured here. we are only seeing a partial picture of the network, traces of which remain preserved in documents pertaining to the eight nodal events sampled. nonetheless, so far the quantitative data seem to correspond closely with the qualitative accounts of the crisis. also, of interest is the visual representation of organisation affiliation. all bar one of the social scientists (in the whole network - fig. ) are affiliated to a research organisation, whereas one third of the members of other scientific disciplines are attached to government institutions, donors or multilaterals. these are the public health officials and virologists working in the doh, phe and elsewhere. they appear predominantly on the left side with much stronger proximity to government led initiatives. however, it is also clear that whilst social scientists are a small minority in the government led events, the right side of the graph includes a significant number of practitioners, policy actors and clinicians. it is this part of the network that most closely resembles an inter-epistemic community. for the centrally located bridging nodes we can see a small number of social scientists and policy actors embedded in government. as accounts of the crisis have suggested these individuals appear to have been the super-connectors. a final point of clarification is that this is not a map showing the actual knowledge flow between actors during the crisis. each of the spider shaped sub-networks represent co-occurrence of individuals on committees, panels and other groups. we can infer from this some likelihood of knowledge exchange but we cannot measure this. one exception to these co-occurrence types of tie between nodes are the erap reports (bottom right), which reveals a cluster of nodes who contributed to reports along with those who requested them. even though this represents a knowledge flow of sorts we can still only record the interaction and make assumptions about the actual flow of knowledge. a variation of degree centrality, eigenvector centrality, counts the number of nodes adjacent to a given node and weighs each adjacent node by its centrality. the eigenvector equation, used by netdraw, calculates each node's centrality proportionally to the sum of centralities of the nodes it is adjacent to (s. . netdraw increases the size of nodes in relation to their popularity or eigenvector value. the better connected nodes are to others who are also well connected the larger the nodes appear (s. p. borgatti et al., ) . in order to focus on the key influencers or knowledge brokers in the network, we entirely remove nodes solely connected to the lessons for development conference. as mentioned earlier, this event is a poor proxy for research-policy interactions and unduly over-represents social scientists who were otherwise unconnected to advisory or knowledge exchange activities. this reduces the number of individuals in the network from to . we also utilise ucinet's transform function to convert the two-mode incidence matrix into a one mode adjacency matrix (s. . ties between nodes are now determined by connections through co-occurrence. we no longer need to see the events and committees themselves but can visualise the whole network as a social network of connected individuals. we can now observe and mathematically calculate, how inter-connected or homogeneous this research-policy network really is. we see in fig. a more exaggerated separation of social science and other sciences on the right and left of the graph than in fig. . we can also see three distinct sub-networks emerging, bridged by six key nodes with high centrality values. the highly interconnected sub-network on the right is shaped in part by erap and the production of briefings and their supply to a small number of policy actors. we can see here the visualisation of slightly higher centrality scores than for the government scientific advisors on the left. by treating this as a relational network we observe that interactions like the establishment of a sage sub-group for social scientists increased the homophily of the right side of the network and reduced its interconnectivity with the whole network. although, one must be cautious about assigning too much significance to the position of individual nodes in a whole network analysis, the central location of the two social scientists and a dfid official closely correspond to the accounts of the crisis. this heterogeneous brokerage demonstrates the tendency of certain types of actors to be the sole link between dissimilar nodes (hamilton et al., ) . likewise, some boundary nodes or outliers, such as one of the mod's advisors at the bottom of the network, are directly mentioned in the qualitative accounts. just four individuals in this whole network are based in africa, suggesting almost complete isolation from humanitarians operating on the ground and from african scholarship. both the qualitative accounts of the role of anthropologists in the crisis and the whole network analysis presented here largely, correspond with haas' definition of epistemic communities. the international community of anthropologists and medical anthropologists that mobilised in autumn do indeed share common principles and analytical and normative beliefs. debates around issues, such as the level to which communities could reduce transmission rates themselves, did not prevent this group from providing a coherent response to the key policy dilemmas. this community did indeed emerge or coalesce around the demand placed on their expertise by policy makers concerned with the community engagement dimensions of the response. in the area of burial practices, there does appear to be some indication of the knowledge of social scientists being incorporated into the response. various interactions between anthropologists, dfid and the who did provide the opportunity to raise the socio-political-economic significance of funerals. for example, it was explained that the funerals of high status individuals would be much more problematic in terms of the numbers of people exposed (f. martineau et al., ) . anthropologists contributed to the writing of the who's guidelines for safe and dignified burials (who, b). however, their advice was only partially incorporated into these guidelines and the wider policies of the who at the time. the suggestion for a radical decentralised approach to formal burial response that would require the creation of community-based burial teams was ignored until much later in the crisis and never fully implemented. as loblova and dunlop suggest in their critique of epistemic community theory, the extent to which anthropology could influence policy was bounded by the beliefs and understanding of policy communities themselves (löblová, ) and (dunlop, ) . olga loblova argues that there is a selection bias in the tendency to look at case studies where there has been a shift in policy along the lines of the experts' knowledge. likewise, claire dunlop suggests that haas' framework may exaggerate their influence on policy. she separates the power of experts to control the production of knowledge and engage with key policy actors from policy objectives themselves. she refers to adult education literature and its implications for what decision makers learn from epistemic communities, or to put it another way, the cognitive influence of research evidence (dunlop, ) . she argues that the more control that knowledge exchange processes place with the "learners" in terms of framing, content and the intended policy outcomes, the less influential epistemic communities will be (dunlop, ) . hence, in contested areas such as home care, it was the more embedded and credible clinical epistemic community that prevailed. from october , anthropologists were arguing that given limited access to etus, which were struggling at that time, home care was an inevitability and so should be supported. where they saw the provision of home care kits as an ethical necessity, many clinicians, humanitarians and global health professionals regarded home care as deeply unethical with the potential to lead to a two tier system of support (f. martineau et al., ) and (whitty et al., ) . in sierra leone, irish diplomat sinead walsh was baffled by what she saw as the blocking of the distribution of home care kits. an official from the us centres for disease control and protection (cdc) was quoted in an article in the new york times as saying that home care was: "admitting defeat" (nossiter, ) in (walsh and johnson, ) . home care was never prioritised in sierra leone whereas in liberia hundreds of thousands of kits were distributed (walsh and johnson, ) . in this area, clinicians, humanitarians and policy actors seemed to maintain a policy position directly opposed to anthropological based advice. network theory provides further evidence around why this may have been the case. in his study of uk think tanks, jordan tchilingirian suggests that policy think tanks operate on the periphery of more established networks and enjoy fluctuating levels of support and interest in their ideas. ideas and knowledge do not simply flow within the network, given that dominant paradigms and political, social and cultural norms privilege better established knowledge communities (tchilingirian, ) . this is reminiscent of meyer's work on the boundaries that exist between "amateurs" and "policy professionals" (meyer, ) . moira faul's research on global education policy networks proposes that far from being "flat," networks can augment existing power relations and knowledge hierarchies (faul, ) . this is worth considering when one observes how erap's supply of research knowledge and the sage sub-committee for anthropologists only increased the homophily of the social science sub-community, leaving it weakly connected to the core policy network (fig. .) . the positive influence of anthropological advice on the uk's response was cited by witnesses to the subsequent parliamentary committee inquiries in . however, there is some indication of different groups or networks favouring different narratives. the international development select committee (idc) was very clear in its final report that social science had been a force for good in the response and recommended that dfid grow its internal anthropological capacity (idc, a, b) . this contrasts to the report of the science and technology committee (stc), which despite including evidence from at least one anthropologist, does not make a direct reference to anthropology in its report (stc, ) . this is perhaps the public health officials in their core domain of infectious disease outbreaks reasserting their established authority. this sector has been described as the uk's "biomedical bubble" which benefits from much higher pubic support and funding than the social sciences (jones and wilsdon, ) . just the presence of anthropologists in an evidence session of the stc is a very rare event in contrast to the idc which regularly reaches out to social scientists. not everyone agrees that the threat of under-investing in social science was the primary issue. the stc's report highlights the view that there was a lack of front line clinicians represented on committees advising the uk government, particularly from aid organisations (stc, ) . regardless of assessments of how successfully anthropological knowledge influenced policy and practice during the epidemic, there has been a subsequent elevation of social science in global health preparedness and humanitarian response programmes. writing on behalf of the wellcome trust in , joão rangel de almeida says: "epidemics are a social phenomenon as much as a biological one, so understanding people's behaviours and fears, their cultural norms and values, and their political and economic realities is essential too." (rangel de almeida, ). the social science in humanitarian action platform, which involves many of the same researchers who were part of the sierra leone response, has subsequently been supported by unicef, usaid and joint initiative on epidemic preparedness (jiep) with funding from dfid and wellcome. its network of social science advisers have been producing briefings to assist with the ebola response in the democratic republic of congo (drc) (farrar, ) and have mobilised in response to the covid- respiratory illness epidemic. network theory provides a useful framework with which to explore the politics of knowledge in global health with its emphasis on individuals' social context. by analysing data pertaining to researchers' and policy professionals' participation in policy networks one can test assumptions around interdisciplinarity and identify powerful knowledge gatekeepers. detailed qualitative accounts of policy processes needn't be available, as they were in this case, to employ this methodology. assuming the researcher has access to meeting minutes and other records of who attended which events or who was a member of which committees and groups, similar analysis of network homophily and centrality will be possible. the greatest potential for learning, with significant policy and research implications, comes from mixed methods approaches. by combining qualitative research to populate your network with a further round of data gathering to understand it better, you can reveal the social and political dynamics truly driving evidence use and decision making (oliver and faul, ) . although this study lacked this scope, it has still successfully identified the shape of the research-policy network that emerged around the uk led response to ebola and the clustering of actors within it. the network was a diverse group of scientists, practitioners and policy professionals. however, it favoured the views of government scientists with their emphasis on epidemiology and the medical response. it was also almost entirely lacking in west african members. nonetheless, it was largely thanks to a strong political demand for anthropological knowledge, in response to perceived community violence and distrust, that social scientists got a seat at the table. this was brokered by a small group of individuals from both government and research organisations, who had prior relationships to build on. the emergent inter-epistemic community was only partially connected into the policy network and we should reject the description of the whole network as trans-disciplinary. social scientists were most successful in engaging when they framed their expertise in terms of already widely accepted concepts, such as the need for better communications with communities. they were least successful when their evidence countered strongly held beliefs in areas such as home care. their high level of homophily as a group, or sub network, only deepened the ability of decision makers to ignore them when it suited them to do so. the epistemic community's interactivity with uk policy did not significantly alter policy design or implementation and it did not challenge fundamentally eurocentric development knowledge hierarchies. it was transformative only in as much as it helped the epistemic community itself learn how to operate in this environment. the real achievement has been on influencing longer term evidence use behaviours. they made the case for interdisciplinarity and the value of social science in emergency preparedness and response. the challenge now is moving from the rhetoric to action on complex infectious disease outbreaks. as demonstrated by ebola in drc and covid- , every global health emergency we face will have its own unique social and political dimensions. we must remain cognisant of the learning arising from the international response to sierra leone's tragic ebola epidemic. it suggests that despite the increasing demand for interdisciplinarity, social science evidence is frequently contested and policy networks have a strong tendency to leave control over its production and use in the hands of others. credit authorship contribution statement james georgalakis: conceptualization, methodology, software, formal analysis, investigation, data curation, writing -original draft, visualization. epidemics (especially ebola) social science intelligence in the global ebola response one health : science, politics and zoonotic disease in africa ebola at a distance: a pathographic account of anthropology's relevance stigma and ebola: an anthropological approach to understanding and addressing stigma operationally in the ebola response ucinet for windows. analytic technologies cases, mechanisms and the real: the theory and methodology of mixed-method social network analysis une histoire morale du temps present policy transfer as learning: capturing variation in what decisionmakers learn from epistemic communities the irony of epistemic learning: epistemic communities, policy learning and the case of europe's hormones saga ebola response anthropology platform erap milestone achievements up until the global community must unite to intensify ebola response in the drc networks and power: why networks are hierarchical not flat and what can be done about it the human factor. world health organization finding the spaces for change: a power analysis introduction: identifying the qualities of research-policy partnerships in international development-a new analytical framework introduction: epistemic communities and international policy coordination analysing published global ebola virus disease research using social network analysis evaluating heterogeneous brokerage: new conceptual and methodological approaches and their application to multi-level environmental governance networks health. education advice and resource team ebola, culture and politics: the anthropology of an emerging disease responses to the ebola crisis ebola: responses to a public health emergency. house of commons the biomedical bubble: why uk research and innovation needs a greater diversity of priorities the ebola epidemic: a transformative moment for global health ebola: engaging long-term social science research to transform epidemic response when epistemic communities fail: exploring the mechanism of policy influence online course: ebola in context: understanding transmission, response and control epistemologies of ebola: reflections on the experience of the ebola response anthropology platform on the boundaries and partial connections between amateurs and professionals pushed to the limit and beyond social constructionism as ontology: exposition and example a hospital from hell networks and network analysis in evidence, policy and practice ebola crisis appeal response review social science research: a much-needed tool for epidemic control. wellcome. redruk, . pre-departure ebola response training burial/other cultural practices and risk of evd transmission in the mano river region burial/other cultural practices and risk of evd transmission in the mano river region ebola: how a people's science helped end an epidemic on the coloniality of global public health biosocial approaches to the - ebola pandemic. health hum. rights , . sage scientific advisory group for emergencies -ebola summary minute of nd meeting scientific advisory group for emergencies -ebola summary minute of rd meeting science in emergencies: uk lessons from ebola. house of commons. stovel producing knowledge, producing credibility: british think-tank researchers and the construction of policy reports the oxford handbook of political networks getting to zero: a doctor and a diplomat on the ebola frontline network analysis and political science clinical presentation and management of severe ebola virus disease infectious disease: tough choices to reduce ebola transmission key messages for social mobilization and community engagement in intense transmission areas comparison of social resistance to ebola response in sierra leone and guinea suggests explanations lie in political configurations not culture coronavirus: china president warns spread of disease 'accelerating', as canada confirms first case. the independent acknowledgements i thank dr jordan tchilingirian (university of western australia) for discussions and support on ucinet. i thank professor melissa leach and dr annie wilkinson (institute of development studies) for access to archival data. key: cord- -k wcibdk authors: pacheco, jorge m.; van segbroeck, sven; santos, francisco c. title: disease spreading in time-evolving networked communities date: - - journal: temporal network epidemiology doi: . / - - - - _ sha: doc_id: cord_uid: k wcibdk human communities are organized in complex webs of contacts that may be represented by a graph or network. in this graph, vertices identify individuals and edges establish the existence of some type of relations between them. in real communities, the possible edges may be active or not for variable periods of time. these so-called temporal networks typically result from an endogenous social dynamics, usually coupled to the process under study taking place in the community. for instance, disease spreading may be affected by local information that makes individuals aware of the health status of their social contacts, allowing them to reconsider maintaining or not their social contacts. here we investigate the impact of such a dynamical network structure on disease dynamics, where infection occurs along the edges of the network. to this end, we define an endogenous network dynamics coupled with disease spreading. we show that the effective infectiousness of a disease taking place along the edges of this temporal network depends on the population size, the number of infected individuals in the population and the capacity of healthy individuals to sever contacts with the infected, ultimately dictated by availability of information regarding each individual’s health status. importantly, we also show how dynamical networks strongly decrease the average time required to eradicate a disease. understanding disease spreading and evolution involves overcoming a multitude of complex, multi-scale challenges of mathematical and biological nature [ , ] . traditionally, the contact process between an infected individual and the susceptible ones was assumed to affect equally any susceptible in a population (mean-field approximation, well-mixed population approximation) or, alternatively, all those susceptible living in the physical neighborhood of the infected individual (spatial transmission). during recent years, however, it has become clear that disease spreading [ ] [ ] [ ] [ ] transcends geography: the contact process is no longer restricted to the immediate geographical neighbors, but exhibits the stereotypical small-world phenomenon [ ] [ ] [ ] [ ] , as testified by recent global pandemics (together with the impressive amount of research that has been carried out to investigate them) or, equally revealing, the dynamics associated with the spreading of computer viruses [ , [ ] [ ] [ ] [ ] [ ] [ ] [ ] [ ] [ ] [ ] [ ] [ ] [ ] [ ] . recent advances in the science of networks [ , , , , ] also provided compelling evidence of the role that the networks of contacts between individuals or computers play in the dynamics of infectious diseases [ , ] . in the majority of cases in which complex networks of disease spreading have been considered [ ] , they were taken to be a single, static entity. however, contact networks are intrinsically temporal entities and, in general, one expects the contact process to proceed along the lines of several networks simultaneously [ , - , , , , - ] . in fact, modern societies have developed rapid means of information dissemination, both at local and at centralized levels, which one naturally expects to alter individuals' response to vaccination policies, their behavior with respect to other individuals and their perception of likelihood and risk of infection [ ] . in some cases one may even witness the adoption of centralized measures, such as travel restrictions [ , ] or the imposition of quarantine spanning parts of the population [ ] , which may induce abrupt dynamical features onto the structure of the contact networks. in other cases, social media can play a determinant role in defining the contact network, providing crucial information on the dynamical patterns of disease spreading [ ] . furthermore, the knowledge an individual has (based on local and/or social media information) about the health status of acquaintances, partners, relatives, etc., combined with individual preventive strategies [ ] [ ] [ ] [ ] [ ] [ ] [ ] [ ] [ ] (such as condoms, vaccination, the use of face masks or prophylactic drugs, avoidance of visiting specific web-pages, staying away from public places, etc.), also leads to changes in the structure and shape of the contact networks that naturally acquire a temporal dimension that one should not overlook. naturally, the temporal dimension and multitude of contact networks involved in the process of disease spreading render this problem intractable from an analytic standpoint. recently, sophisticated computational platforms have been developed to deal with disease prevention and forecast [ , , , , , [ ] [ ] [ ] [ ] [ ] [ ] [ ] [ ] [ ] [ ] [ ] [ ] [ ] . the computational complexity of these models reflects the intrinsic complexity of the problem at stake, and their success relies on careful calibration and validation procedures requiring biological and socio-geographic knowledge of the process at stake. our goal here, instead, will be to answer the following question: what is the impact of a temporal contact network structure in the overall dynamics of disease progression? does one expect that it will lead to a rigid shift of the critical parameters driving disease evolution, as one witnesses whenever one includes spatial transmission patterns? or even to an evanescence of their values whenever one models the contact network as a (static and infinite) scale-free network, such that the variance of the network degree distribution becomes arbitrarily large? or will the temporal nature of the contact network lead to new dynamical features? and, if so, which features will emerge from the inclusion of this temporal dimension? to answer this question computationally constitutes, in general, a formidable challenge. we shall attempt to address the problem analytically, and to this end some simplifications will be required. however, the simplifications we shall introduce become plausible taking into consideration recent results (i) in the evolutionary dynamics of social dilemmas of cooperation, (ii) in the dynamics of peer-influence, and even (iii) in the investigation of how individual behavior determines and is determined by the global, population wide behavior. all these recent studies point out to the fact that the impact of temporal networks in the population dynamics stems mostly from the temporal part itself, and not so much from the detailed shape and structure of the network [ ] [ ] [ ] [ ] [ ] [ ] [ ] [ ] . indeed, we now know that (i) different models of adaptive network dynamics lead to similar qualitative features regarding their impact in what concerns the evolution of cooperation [ ] [ ] [ ] [ ] [ ] [ ] [ ] [ ] , (ii) the degree of peer-influence is robust to the structural patterns associated with the underlying social networks [ ] , and (iii) the impact of temporal networks in connecting individual to collective behavior in the evolution of cooperation is very robust and related to a problem of n-body coordination [ , ] . altogether, these features justify that we model the temporal nature of the contact network in terms of a simple, adaptive network, the dynamics of which can be approximately described in terms a coupled system of odes. this "adaptive-linking" dynamics, as it was coined [ , [ ] [ ] [ ] , leads to network snapshot structures that do not replicate what one observes in real-life, in the same sense that the small-world model of watts and strogatz does not lead to the heterogeneous and diverse patterns observed in data snapshots of social networks. notwithstanding, the active-linking dynamics allows us to include, analytically, the temporal dimension into the problem of disease dynamics. the results [ ] , as we elaborate in sects. and , prove rewarding, showing that the temporal dimension of a contact network leads to a shift of the critical parameters (defined below) which is no longer rigid but, instead, becomes dependent on the frequency of infected individuals in the population. this, we believe, constitutes a very strong message with a profound impact whenever one tries to incorporate the temporal dimension into computational models of disease forecast. this chapter is organized as follows. in the following sect. , we introduce the standard disease models we shall employ, as well as the details of the temporal contact network model. section is devoted to present and discuss the results, and sect. contains a summary of the main conclusions of this work. in this section, we introduce the disease models we shall employ which, although well-known and widely studied already, are here introduced in the context of stochastic dynamics in finite populations, a formulation that has received less attention than the standard continuous model formulation in terms of coupled ordinary differential equations (odes). furthermore, we introduce and discuss in detail the temporal contact network model. here we introduce three standard models of disease transmission that we shall employ throughout the manuscript, using this section at profit to introduce also the appropriate notation associated with stochastic dynamics of finite populations and the markov chain techniques that we shall also employ in the remainder of this chapter. we shall start by discussing the models in the context of well-mixed populations, which will serve as a reference scenario for the disease dynamics, leaving for the next section the coupling of these disease models with the temporal network model described below. we investigate the popular susceptible-infected-susceptible (sis) model [ , ] , the susceptible-infected (si) model [ ] used to study, e.g., aids [ , ] , and the susceptible-infected-recovered (sir) model [ , ] , more appropriate to model, for instance, single season flu outbreaks [ ] or computer virus spreading [ ] . it is also worth pointing out that variations of these models have been used to successfully model virus dynamics and the interplay between virus dynamics and the response of the immune system [ ] . in the sis model individuals can be in one of two epidemiological states: infected (i) or susceptible (s). each disease is characterized by a recovery rate (•) and an infection rate (oe). in an infinite, well-mixed population, the fraction of infected individuals (x) changes in time according to the following differential equation where y d x is the fraction of susceptible individuals and hki the average number of contacts of each individual [ ] . there are two possible equilibria (p x d ): x d and x d r , where r d hki/ı denotes the basic reproductive ratio. the value of r determines the stability of these two equilibria: x d r is stable when r > and unstable when r < . let us now move to finite populations, and consider the well-mixed case where the population size is fixed and equal to n. we define a discrete stochastic markov process describing the disease dynamics associated with the sis model. each configuration of the population, which is defined by the number of infected individuals i, corresponds to one state of the markov chain. time evolves in discrete steps and two types of events may occur which change the composition of the population: infection events and recovery events. this means that, similar to computer simulations of the sis model on networked populations, at most one infection or recovery event will take place in each (discrete) time step. thus, the dynamics can be represented as a markov chain m with nc states [ , ] -as many as the number of possible configurations -illustrated in the following fig. . . in a finite, well-mixed population, the number i of infected will decrease at a rate given by where denotes the recovery time scale, i n the probability that a randomly selected individual is infected and ı the probability that this individual recovers. adopting as a reference, we assume that the higher the average number of contacts hki, the smaller the time scale inf at which infection update events occur ( inf d /hki) [ ] . consequently, the number of infected will also increase at a rate given by equations ( . ) and ( . ) define the transitions between different states. this way, we obtain the following transition matrix for m: where each element p kj of p represents the probability of moving from state k to state j during one time step. the state without any infected individual (id ) is an absorbing state of m. in other words, the disease always dies out and will never re-appear, once this happens. at this level of approximation, it is possible to derive an analytical expression for the average time t i it takes to reach the single absorbing state of the sis markov chain (i.e., the average time to absorption) starting from a configuration in which there are i infected individuals. denoting by p i (t) the probability that the disease disappears at time t when starting with i infected individuals at time , we may write [ ] using the properties of p i (t) we obtain the following recurrence relation for t i whereas for t n we may write defining the auxiliary variables i d l , a little algebra allows us to write, for t such that t i can be written as a function of t as follows the intrinsic stochasticity of the model, resulting from the finiteness of the population, makes the disease disappear from the population after a certain amount of time. as such, the population size plays an important role in the average time to absorption associated with a certain disease, a feature we shall return to below. equations ( . ) and ( . ) define the markov chain m just characterized. the fraction of time the population spends in each state is given by the stationary distribution of m, which is defined as the eigenvector associated with eigenvalue of the transition matrix of m [ , ] . the fact that in the sis model the state without infected (id ) is an absorbing state of the markov chain, implies that the standard stationary distribution will be completely dominated by this absorbing state, which precludes one to gather information on the relative importance of other configurations. this makes the so-called quasi-stationary distribution of m [ ] the quantity of interest. this quantity allows us to estimate the relative prevalence of the population in configurations other than the absorbing state, by computing the stationary distribution of the markov chain obtained from m by excluding the absorbing state id [ ] . it provides information on the fraction of time the population spends in each state, assuming the disease does not go extinct. the markov process m defined before provides a finite population analogue of the well-known mean-field equations written at the beginning of sect. . . . indeed, in the limit of large populations, g.i/ d t c .i/ t .i/ provides the rate of change of infected individuals. for large n, replacing i n by x and n i n by y, the gradients of infection which characterize the rate at which the number of infected are changing in the population, are given by again, we obtain two roots: g(i) d for i d and i r d n .n / ı hki . moreover, i r becomes the finite population equivalent of an interior equilibrium for r Á ı hki n n > (note that, for large n we have that n n ). the disease will most likely expand whenever i < i r , the opposite happening otherwise. the si model is mathematically equivalent to the sis model with ı d , and has been employed to study for instance the dynamics of aids. the markov chain representing the disease dynamics is therefore defined by transition matrix eq. ( . ), with t i d for all i. the remaining transition probabilities t c i ( < i < n) are exactly the same as for the sis model. since all t i equal zero, the markov chain has two absorbing states: the canonical one without any infected (id ) and the one without any susceptible (idn). the disease will expand monotonically as soon as one individual in the population gets infected, ultimately leading to a fully infected population. the average amount of time after which this happens, which we refer to as the average infection time, constitutes the main quantity of interest. this quantity can be calculated analytically [ ] : the average number of time steps needed to reach % infection, starting from i infected individuals is given by ( . ) with sir one models diseases in which individuals acquire immunity after recovering from infection. we distinguish three epidemiological states to model the dynamics of such diseases: susceptible (s), infected (i) and recovered (r), indicating those who have become immune to further infection. the sir model in infinite, well-mixed populations is defined by a recovery rate ı and an infection rate . the fraction of infected individuals x changes in time according to the following differential equation where y denotes the fraction of susceptible individuals, which in turn changes according to p y d hki xy: ( . ) finally, the fraction of individuals z in the recovered class changes according to p z d ı x: ( . ) to address the sir model in finite, well-mixed populations, we proceed in a way similar to what we have done so far with sis and si models. the markov chain describing the disease dynamics becomes slightly more complicated and has states (i, r), where i is the number of infected individuals in the population and r the number of recovered (and immune) individuals (i c r Ä n). a schematic representation of the markov chain is given in fig. . . note that the states ( , r), with Ä r Ä n, are absorbing states. each of these states corresponds to the number of individuals that are (or have become) immune at the time the disease goes extinct. consider a population of size n with average degree hki. the number of infected will increase with a rate where denotes the recovery time scale. as before, the gradient of infection g(i), such that g.i/ d t c .i/ t .i/, measures the likelihood for the disease to either expand or shrink in a given state, and is given by note that we recover eq. ( . ) in the limit n ! . for a fixed number of recovered individuals r , we have that g(i, r ) d for i d and for i r d n becomes the finite population analogue of an interior equilibrium. furthermore, one can show that the partial derivative @g.i;r/ @i has at most one single root in ( , ), possibly located at i r d i r Ä i r . hence, g(i, r ) reaches a local maximum at i r (given that at that point @ g.i;r/ @i ˇi r d hki n.n / < ). the number of infected will therefore most likely increase for i < i r (assuming r immune individuals), and most likely decrease otherwise. the gradient of infection also determines the probability to end up in each of the different absorbing states of the markov chain. these probabilities can be calculated analytically [ ] . to this end, let us use y a i;r to denote the probability that the population ends up in the absorbing state with a recovered individuals, starting from a state with i infected and r recovered. we obtain the following recurrence relationship for y a i;r y a i;r d t .i; r/ y a i ;rc c t c .i; r/ y a ic ;r c t .i; r/ t c .i; r/ y a i;r ; ( . ) which reduces to the following boundary conditions ( . ) allow us to compute y a i;r for every a, i and r. our network model explicitly considers a finite and constant population of n individuals. its temporal contact structure allows, however, for a variable number of overall links between individuals, which in turn will depend on the incidence of disease in the population. this way, infection proceeds along the links of a contact network whose structure may change based on each individual's health status and the availability of information regarding the health status of others. we shall assume the existence of some form of local information about the health status of social contacts. information is local, in the sense that individual behavior will rely on the nature of their links in the contact network. moreover, this will influence the way in which individuals may be more or less effective in avoiding contact with those infected while remaining in touch with the healthy. suppose all individuals seek to establish links at the same rate c. for simplicity, we assume that new links are established and removed randomly, a feature which usually does not always apply in real cases, where the limited social horizon of individuals or the nature of their social ties may constrain part of their neighborhood structure (see below). let us further assume that links may be broken off at different rates, based on the nature of the links and the information available about the individuals they connect: let us denote these rates by b pq for links of type pq (p , q fs, i, rg. we assume that links are bidirectional, which means that we have links of pq types si, sr, and ir. let l pq denote the number of links of type pq and l m pq the maximum possible number of links of that type, given the number of individuals of type s, i and r in the population. this allows us to write down (at a mean-field level) a system of odes [ , ] for the time evolution of the number of links of pq-type (l pq ) [ , ] which depends on the number of individuals in states p and q (l m pp d p .p / = and l m pq d pq for p ¤ q) and thereby couples the network dynamics to the disease dynamics. in the steady state of the linking dynamics ( p l pq d ), the number of links of each type is given by l pq d ' pq l m pq , with ' pq d c/(c c b pq ) the fractions of active pq-links, compared to the maximum possible number of links l m pq , for a given number of s, i and r. in the absence of disease only ss links exist, and hence ss determines the average connectivity of the network under disease free conditions, which one can use to characterize the type of the population under study. in the presence of i individuals, to the extent that s individuals manage to avoid contact with i, they succeed in escaping infection. thus, to the extent that individuals are capable of reshaping the contact network based on available information of the health status of other individuals, disease progression will be inhibited. in the extreme limit of perfect information and individual capacity to immediately break up contacts with infected, we are isolating all infected, and as such containing disease progression. our goal here, however, is to understand how and in which way local information, leading to a temporal reshaping of the network structure, affects overall disease dynamics. we investigate the validity of the approximations made to derive analytical results as well as their robustness by means of computer simulations. all individual-based simulations start from a complete network of size nd . disease spreading and network evolution proceed together under asynchronous updating. disease update events take place with probability ( c ) , where d net / dis . we define dis as the time-scale of disease progression, whereas net is the time scale of network change. the parameter d net / dis provides the relative time scale in terms of which we may interpolate between the limits when network adaptation is much slower than disease progression ( ! ) and the opposite limit when network adaptation is much faster than disease progression ( ! ). since d net / dis is the only relevant parameter, we can make, without loss of generality, dis d . for network update events, we randomly draw two nodes from the population. if connected, then the link disappears with probability given by the respective b pq . otherwise, a new link appears with probability c. when a disease update event occurs, a recovery event takes place with probability ( c hki) , an infection event otherwise. in both cases, an individual j is drawn randomly from the population. if j is infected and a recovery event has been selected then j will become susceptible (or recovered, model dependent) with probability •. if j is susceptible and an infection event occurs, then j will get infected with probability oe if a randomly chosen neighbor of j is infected. the quasi-stationary distributions are computed (in the case of the sis model) as the fraction of time the population spends in each configuration (i.e., number of infected individuals) during disease event updates ( generations; under asynchronous updating, one generation corresponds to n update events, where n is the population size; this means that in one generation, every individual has one chance, on average, to update her epidemic state). the average number of infected hii and the mean average degree of the network hki observed during these generations are kept for further plotting. we have checked that the results reported are independent of the initial number of infected in the network. finally, for the sir and si models, the disease progression in time, shown in the following sections, is calculated from independent simulations, each simulation starting with infected individual. the reported results correspond to the average amount of time at which i individuals become infected. in this section we start by (i) showing that a quickly adapting community induces profound changes in the dynamics of disease spreading, irrespective of the underlying epidemic model; then, (ii) we resort to computer simulations to study the robustness of these results for intermediate time-scales of network adaptation; finally, (iii) we profit from the framework introduced above to analyze the impact of information on average time for absorption and disease progression in adaptive networks. empirically, it is well-known that often individuals prevent infection by avoiding contact with infected once they know the state of their contacts or are aware of the potential risks of such infection [ , , [ ] [ ] [ ] [ ] [ ] [ ] [ ] [ ] [ ] : such is the case of many sexually transmitted diseases [ , [ ] [ ] [ ] , for example, and, more recently, the voluntary use of face masks and the associated campaigns adopted by local authorities in response to the sars outbreak [ , [ ] [ ] [ ] or even the choice of contacting or not other individuals based on information on their health status gathered from social media [ , , ] . in the present study, individual decision is based on available local information about the health state of one's contacts. thus, we can study analytically the limit in which the network dynamics -resulting from adaptation to the flow of local information -is much faster than disease dynamics, as in this case, one may separate the time scales between network adaptation and contact (disease) dynamics: the network has time to reach a steady state before the next contact takes place. consequently, the probability of having an infected neighbor is modified by a neighborhood structure which will change in time depending on the impact of the disease in the population and the overall rates of severing links with infected. let us start with the sir model. the amount of information available translates into differences mostly between the break-up rates of links that may involve a potential risk for further infection (b si , b ir , b ii ), and those that do not (b ss , b sr , b rr ). therefore, we consider one particular rate b i for links involving infected individuals (b i Á b si d b ir d b ii ), and another one, b h , for links connecting healthy . in general, one expects b i to be maximal when each individual has perfect information about the state of her neighbors and to be (minimal and) equal to b h when no information is available, turning the ratio between these two rates into a quantitative measure of the efficiency with which links to infected are severed compared to other links. note that we reduce the model to two break-up rates in order to facilitate the discussion of the results. numerical simulations show that the general principles and conclusions remain valid when all break-up rates are incorporated explicitly. it is worth noticing that three out of these six rates are of particular importance for the overall disease dynamics: b ss , b sr and b si . these three rates, combined with the rate c of creating new links, define the fraction of active ss, sr and si links, and subsequent correlations between individuals [ ] , and therefore determine the probability for a susceptible to become infected (see models and methods). this probability will increase when considering higher values of c (assuming b i > b h ). in other words, when individuals create new links more often, therefore increasing the likelihood of establishing connections to infected individuals (when present), they need to be better informed about the health state of their contacts in order to escape infection. in the fast linking limit, the other three break-up rates (b ii , b ir and b rr ) will also influence disease progression since they contribute to changing the average degree of the network. when the time scale for network update ( net ) is much smaller than the one for disease spreading ( dis ), we can proceed analytically using at profit the separation of times scales. in practice, this means that the network has time to reach a steady state before the next disease event takes place. consequently, the probability of having an infected neighbor is modified by a neighborhood structure which will change in time depending on the impact of the disease in the population and the overall rates of severing links with infected individuals. for a given configuration (i,r) of the population, the stationary state of the network is characterized by the parameters ' ss , ' si and ' sr . consequently, the number of infected increases at a rate [ ] where we made d . the effect of the network dynamics becomes apparent in the third factor, which represents the probability that a randomly selected neighbor of a susceptible is infected. in addition, eq. ( . ) remains valid, as the linking dynamics does not affect the rate at which the number of infected decreases. it is noteworthy that we can write eq. ( . ) in the form which is formally equivalent to eq. ( . ) and shows that disease spreading in a temporal adaptive network is equivalent to that in a well-mixed population with (i) a frequency dependent average degree hki and (ii) a transmission probability that is rescaled compared to the original according to note that this expression remains valid for both sir, sis (r d ) and si (ı d , r d ) models. since the lifetime of a link depends on its type, the average degree hki of the network depends on the number of infected in the population, and hence becomes frequency (and time) dependent, as hki depends on the number of infected (through l m pq ) and changes in time. note that Á scales linearly with the frequency of infected in the population, decreasing as the number of infected increases (assuming ss ı si > ); moreover, it depends implicitly (via the ratio ss ı si ) on the amount of information available. it is important to stress the distinction between the description of the disease dynamics at the local level (in the vicinity of an infected individual) and that at the population wide level. strictly speaking, a dynamical network does not change the disease dynamics at the local level, meaning that infected individuals pass the disease to their neighbors with probability intrinsic to the disease itself. at the population level, on the other hand, disease progression proceeds as if the infectiousness of the disease effectively changes, as a result of the network dynamics. consequently, analyzing a temporal network scenario at a population level can be achieved via a renormalization of the transmission probability, keeping the (mathematically more attractive) well-mixed scenario. in this sense, from a well-mixed perspective, dynamical networks contribute to changing the effective infectiousness of the disease, which becomes frequency and information dependent. note further that this information dependence is a consequence of using a single temporal network for spreading the disease and information. interestingly, adaptive networks have been shown to have a similar impact in social dilemmas [ ] . from a global, population-wide perspective, it is as if the social dilemma at stake differs from the one every individual actually plays. as in sect. , one can define a gradient of infection g, which measures the tendency of the disease to either expand or shrink in a population with given configuration (defined by the number of individuals in each of the states s, i and r). to do so, we study the partial derivative @g.i;r/ @i at i d this quantity exceeds zero whenever note that taking r d yields the basic reproductive ratio r a for both sir and sis: on the other hand, whenever r a < , eradication of the disease is favored in the sis model (g(i)< ), irrespective of the fraction of infected, indicating how the presence of information (b h < b i ) changes the basic reproductive ratio. in fig. . we illustrate the role of information in the sis model by plotting g for different values of b i (assuming b h < b i ) and a fixed transmission probability . the corresponding quasi-stationary distributions are shown in the right panel and clearly reflect the sign of g. whenever g(i) is positive (negative), the dynamics will act to increase (decrease), on average, the number of infected. figure population and, once again, allows us to identify when disease expansion will be favored or not. figure . gives a complete picture of the gradient of infection, using the appropriate simplex structure in which all points satisfy the relation icrcsdn. the dashed line indicates the boundary g(i, r) d in case individuals do not have any information about the health status of their contacts, i.e., links that involve infected individuals disappear at the same rate as those that do not (b i d b h ). disease expansion is more likely than disease contraction (g(i, r) > ) when the population is in a configuration above the line, and less likely otherwise. similarly, the solid line indicates the boundary g(i, r) d when individuals share information about their health status, and use it to avoid contact with infected. once again, the availability of information modifies the disease dynamics, inhibiting disease progression for a broad range of configurations. up to now we have assumed that the network dynamics proceeds much faster than disease spreading (the limit ! ). this may not always be the case, and hence it is important to assess the domain of validity of this limit. in the following, we use computer simulations to verify to which extent these results, obtained analytically via time scale separation, remain valid for intermediate values of the relative timescale for the linking dynamics. we start with a complete network of size n, in which initially one individual is infected, the rest being susceptible. as stated before, disease spreading and network evolution proceed simultaneously under asynchronous updating. network update events take place with probability ( c ) , whereas a disease model (si, sis or sir) state update event occurs otherwise. for each value of , we run simulations. for the si model, the quantity of interest to calculate is the average number of generations after which the population becomes completely infected. these values are depicted in fig. . . the lower dashed line indicates the analytical prediction of the infection time in the limit ! (the limit when networks remain static), which we already recover in the simulations for > . when is smaller than , the average infection time significantly increases, and already reaches the analytical prediction for the limit ! (indicated by the upper dashed line) when < . hence, the validity of the time scale separation does again extend well beyond the limits one might expect. for the sir model, we let the simulations run until the disease goes extinct, and computed the average final fraction of individuals that have been affected by is given by eqs. ( . ) and ( . ) . one observes that linking dynamics does not affect disease dynamics for > . once drops below ten, a significantly smaller fraction of individuals is affected by the disease. this fraction reaches the analytical prediction for ! as soon as < . . hence, and again, results obtained via separation of time scales remain valid for a wide range of intermediate time scales. we finally investigate the role of intermediate time scales in the sis model. we performed computer simulations in the conditions discussed already, and computed several quantities that we plot in fig. . . figure . shows the average hii of the quasi-stationary distributions obtained via computer simulations (circles) as a function of the relative time scale of network update. whenever ! , we can characterize the disease dynamics analytically, assuming a well-mixed population (complete graph), whereas for ! we recover the analytical results obtained in the fast linking limit. at intermediate time scales, fig. . shows that as long as is smaller than ten, network dynamics contributes to inhibit disease spreading by effectively increasing the critical infection rate. overall, the validity of the time scale separation extends well beyond the limits one might anticipate based solely on the time separation ansatz. as long as the time scale for network update is smaller than the one for disease spreading ( < ), the analytical prediction for the limit ! , indicated by the lower dashed line in fig. . , remains valid. the analytical result in the extreme opposite limit ( ! ), indicated by the upper dashed line in fig. . , holds as long as > . moreover, it is noteworthy that the network dynamics influences the disease dynamics both by reducing the frequency of interactions between susceptible and infected, and by reducing the average degree of the network. these complementary effects are disentangled in intermediate regimes, in which the network dynamics is too slow to warrant sustained protection of susceptible individuals from contacts with infected, despite managing to reduce the average degree (not shown). in fact, for > the disease dynamics is mostly controlled by the average degree, as shown by the solid lines in fig. . . here, the average stationary distribution was determined by replacing, in the analytic expression for static networks, hki by the time-dependent average connectivity hki computed numerically. this, in turn, results from the frequency dependence of hki. when b i > b h , the network will reshape into a configuration with smaller hki as soon as the disease expansion occurs. for < , hki reflects the lifetime of ss links, as there are hardly any infected in the population. for < < , the network dynamics proceeds fast enough to reduce hki, but too slowly to reach its full potential in hindering disease progression. given the higher fraction of infected, and the fact that si and ii links have a shorter lifetime than ss links, the average degree drops when increasing from to . any further increase in leads to a higher average degree, as the network approaches its static limit. contrary to the deterministic sis model, the stochastic nature of disease spreading in finite populations ensures that the disease disappears after some time. however, this result is of little relevance given the times required to reach the absorbing state (except, possibly, in very small communities). indeed, the characteristic time scale of the dynamics plays a determinant role in the overall epidemiological process and constitutes a central issue in disease spreading. figure . shows the average time to absorption t in adaptive networks for different levels of information, illustrating the spectacular effect brought about by the network dynamics on t . while on networks without information (b i d b h ) t rapidly increases with the rate of infection oe, adding information moves the fraction of infected individuals rapidly to the absorbing state, and, therefore, to the disappearance of the disease. moreover, the size of the population can have a profound effect on t . with increasing population size, the population spends most of the time in the vicinity of the state associated with the interior root of g(i). for large populations, this acts to reduce the intrinsic stochasticity of the dynamics, dictating a very slow extinction of the disease, as shown in fig. . . when recovery from the disease is impossible, a situation captured by the si model, the population will never become disease-free again once it acquires at least one infected individual. the time to reach absorbing state in which all individuals are infected, again depends on the presence of information. when information prevails, susceptible individuals manage to resist infection for a long time, thereby delaying the rapid progression of the disease, as shown in the inset of fig. . . naturally, the average number of generations needed to reach a fully infected population increases with the availability of information, as illustrated in the main panel of fig. . . making use of three standard models of epidemics involving a finite population in which infection takes place along the links of a temporal graph, the nodes of which are occupied by individuals, we have shown analytically that the bias introduced into the graph dynamics resulting from the availability of information about the health status of others in the population induces fundamental changes in the overall dynamics of disease progression. the network dynamics employed here differs from those used in most other studies [ , [ ] [ ] [ ] [ ] [ ] [ ] [ ] [ ] [ ] [ ] . we argue, however, that the differences obtained stem mostly from the temporal aspect of the network, and not so much from the detailed dynamics that is implemented. importantly, temporal network dynamics leads to additional changes in r compared to those already obtained when moving from the well-mixed assumption to static networks [ ] . an important ingredient of our model, however, is that the average degree of the network results from the selforganization of the network structure, and co-evolves with the disease dynamics. a population suffering from high disease prevalence where individuals avoid contact in order to escape infection will therefore exhibit a lower average degree than a population with hardly any infected individuals. such a frequency-dependent average degree further prevents that containment of infected individuals would result in the formation of cliques of susceptible individuals, which are extremely vulnerable to future infection, as reported before [ , , ] . the description of disease spreading as a stochastic contact process embedded in a markov chain constitutes a second important ingredient of the present model. this approach allows for a direct comparison between analytical predictions and individual-based computer simulations, and for a detailed analysis of finite-size effects and convergence times, whose exponential growth will signal possible bistable disease scenarios. in such a framework, we were able to show that temporal adaptive networks in which individuals may be informed about the health status of others lead to a disease whose effective infectiousness depends on the overall number of infected in the population. in other words, disease propagation on temporal adaptive networks can be seen as mathematically equivalent to disease spreading on a well-mixed population, but with a rescaled effective infectiousness. in accord with the intuition advanced in the introduction, as long as individuals react promptly and consistently to accurate available information on whether their acquaintances are infected or not, network dynamics effectively weakens the disease burden the population suffers. last but not least, if recovery from the disease is possible, the time for disease eradication drastically reduces whenever individuals have access to accurate information about the health state of their acquaintances and use it to avoid contact with those infected. if recovery or immunity is impossible, the average time needed for a disease to spread increases significantly when such information is being used. in both cases, our model clearly shows how availability of information hinders disease progression (by means of quick action on infected, e.g., their containment via link removal), which constitutes a crucial factor to control the development of global pandemics. finally, it is also worth mentioning that knowledge about the health state of others may not always be accurate or available in time. this is for instance the case for diseases where recently infected individuals remain asymptomatic for a substantial period. the longer the incubation period associated with the disease, the less successful individuals will be in escaping infection, which in our model translates into a lower effective rate of breaking si links, with the above mentioned consequences. moreover, different (social) networks through which awareness of the health status of others proceeds may lead to different rates of information spread. one may take these features into account by modeling explicitly the spread of information through a coupled dynamics between disease expansion and individuals' awareness of the disease [ , ] . creation and destruction of links may for instance not always occur randomly, as we assumed here, but in a way that is biased by a variety of factors such as social and genetic distance, geographical proximity, family ties, etc. the resulting contact network may therefore become organized in a specific way, promoting the formation of particular structures, such as networks characterized by long-tailed degree distributions or with strong topological correlations among nodes [ , [ ] [ ] [ ] which, in turn, may influence the disease dynamics. the impact of combining such effects, resulting from specific disease scenarios, with those reported here will depend on the prevalence of such additional effects when compared to linkrewiring dynamics. a small fraction of non-random links, or of ties which cannot be broken, will likely induce small modifications on the average connectivity of the contact network, which can be incorporated in our analytic expressions without compromising their validity regarding population wide dynamics. on the other hand, when the contact network is highly heterogeneous (e.g., exhibiting pervasive long-tail degree distributions), non-random events may have very distinct effects, from being almost irrelevant (and hence can be ignored) to inducing hierarchical cascades of infection [ ] , in which case our results will not apply. modeling infectious diseases in humans and animals infectious diseases in humans evolution of networks: from biological nets to the internet and www dynamical processes in complex networks epidemic processes in complex networks small worlds: the dynamics of networks between order and randomness epidemiology. how viruses spread among computers and people epidemic spreading and cooperation dynamics on homogeneous small-world networks network structure and the biology of populations how to estimate epidemic risk from incomplete contact diaries data? quantifying social contacts in a household setting of rural kenya using wearable proximity sensors epidemic risk from friendship network data: an equivalence with a non-uniform sampling of contact networks spatiotemporal spread of the outbreak of ebola virus disease in liberia and the effectiveness of non-pharmaceutical interventions: a computational modelling analysis the basic reproduction number as a predictor for epidemic outbreaks in temporal networks information content of contact-pattern representations and predictability of epidemic outbreaks birth and death of links control disease spreading in empirical contact networks influenza a (h n ) and the importance of digital epidemiology predicting and controlling infectious disease epidemics using temporal networks. f prime reports localization and spreading of diseases in complex networks the global obesity pandemic: shaped by global drivers and local environments a highresolution human contact network for infectious disease transmission dynamics and control of diseases in networks with community structure modelling the influence of human behaviour on the spread of infectious diseases: a review a guide to temporal networks temporal networks empirical temporal networks of face-to-face human interactions exploiting temporal network structures of human interaction to effectively immunize populations adaptive contact networks change effective disease infectiousness and dynamics rewiring for adaptation adaptive networks: coevolution of disease and topology endemic disease, awareness, and local behavioural response contact switching as a control strategy for epidemic outbreaks the spread of awareness and its impact on epidemic outbreaks infection spreading in a population with evolving contacts fluctuating epidemics on adaptive networks adaptive coevolutionary networks: a review long-standing influenza vaccination policy is in accord with individual self-interest but not with the utilitarian optimum modeling the worldwide spread of pandemic influenza: baseline case and containment interventions forecast and control of epidemics in a globalized world public health measures to control the spread of the severe acute respiratory syndrome during the outbreak in toronto digital epidemiology the responsiveness of the demand for condoms to the local prevalence of aids influenza pandemic: perception of risk and individual precautions in a general population impacts of sars on health-seeking behaviors in general population in hong kong capturing human behaviour knowledge of malaria, risk perception, and compliance with prophylaxis and personal and environmental preventinve measures in travelers exiting zimbabwe from harare and victoria falls international airport meta-analysis of the relationship between risk perception and health behavior: the example of vaccination risk compensation and vaccination: can getting vaccinated cause people to engage in risky behaviors? public perceptions, anxiety, and behaviour change in relation to the swine flu outbreak: cross sectional telephone survey early assessment of anxiety and behavioral response to novel swineorigin influenza a(h n ) epidemic dynamics on an adaptive network susceptible-infected-recovered epidemics in dynamic contact networks disease spreading with epidemic alert on small-world networks robust oscillations in sis epidemics on adaptive networks: coarse graining by automated moment closure coevolutionary cycling of host sociality and pathogen virulence in contact networks cooperation prevails when individuals adjust their social ties coevolution of strategy and structure in complex networks with dynamical linking active linking in evolutionary games repeated games and direct reciprocity under active linking reacting differently to adverse ties promotes cooperation in social networks selection pressure transforms the nature of social dilemmas in adaptive networks origin of peer influence in social networks linking individual and collective behavior in adaptive social networks uses and abuses of mathematics in biology a contribution to the mathematical theory of epidemics production of resistant hiv mutants during antiretroviral therapy a first course in stochastic processes stochastic processes in physics and chemistry fixation of strategies for an evolutionary game in finite populations on the quasi-stationary distribution of the stochastic logistic epidemic men's behavior change following infection with a sexually transmitted disease an examination of the social networks and social isolation in older and younger adults living with hiv/aids social stigmatization and hepatitis c virus infection assessing vaccination sentiments with online social media: implications for infectious disease dynamics and control social and news media enable estimation of epidemiological patterns early in the haitian cholera outbreak the effects of local spatial structure on epidemiological invasions epidemic processes over adaptive state-dependent networks classes of small-world networks statistical mechanics of complex networks the structure and function of complex networks velocity and hierarchical spread of epidemic outbreaks in scale-free networks key: cord- -uc j fyi authors: brandi, giuseppe; di matteo, tiziana title: a new multilayer network construction via tensor learning date: - - journal: computational science - iccs doi: . / - - - - _ sha: doc_id: cord_uid: uc j fyi multilayer networks proved to be suitable in extracting and providing dependency information of different complex systems. the construction of these networks is difficult and is mostly done with a static approach, neglecting time delayed interdependences. tensors are objects that naturally represent multilayer networks and in this paper, we propose a new methodology based on tucker tensor autoregression in order to build a multilayer network directly from data. this methodology captures within and between connections across layers and makes use of a filtering procedure to extract relevant information and improve visualization. we show the application of this methodology to different stationary fractionally differenced financial data. we argue that our result is useful to understand the dependencies across three different aspects of financial risk, namely market risk, liquidity risk, and volatility risk. indeed, we show how the resulting visualization is a useful tool for risk managers depicting dependency asymmetries between different risk factors and accounting for delayed cross dependencies. the constructed multilayer network shows a strong interconnection between the volumes and prices layers across all the stocks considered while a lower number of interconnections between the uncertainty measures is identified. network structures are present in different fields of research. multilayer networks represent a widely used tool for representing financial interconnections, both in industry and academia [ ] and has been shown that the complex structure of the financial system plays a crucial role in the risk assessment [ , ] . a complex network is a collection of connected objects. these objects, such as stocks, banks or institutions, are called nodes and the connections between the nodes are called edges, which represent their dependency structure. multilayer networks extend the standard networks by assembling multiple networks 'layers' that are connected to each other via interlayer edges [ ] and can be naturally represented by tensors [ ] . the interlayer edges form the dependency structure between different layers and in the context of this paper, across different risk factors. however, two issues arise: the construction of such networks is usually based on correlation matrices (or other symmetric dependence measures) calculated on financial asset returns. unfortunately, such matrices being symmetric, hide possible asymmetries between stocks. multilayer networks are usually constructed via contemporaneous interconnections, neglecting the possible delayed cause-effect relationship between and within layers. in this paper, we propose a method that relies on tensor autoregression which avoids these two issues. in particular, we use the tensor learning approach establish in [ ] to estimate the tensor coefficients, which are the building blocks of the multilayer network of the intra and inter dependencies in the analyzed financial data. in particular, we tackle three different aspects of financial risk, i.e. market risk, liquidity risk, and future volatility risk. these three risk factors are represented by prices, volumes and two measures of expected future uncertainty, i.e. implied volatility at days (iv ) and implied volatility at days (iv ) of each stock. in order to have stationary data but retain the maximum amount of memory, we computed the fractional difference for each time series [ ] . to improve visualization and to extract relevant information, the resulting multilayer is then filtered independently in each dimension with the recently proposed polya filter [ ] . the analysis shows a strong interconnection between the volumes and prices layers across all the stocks considered while a lower number of interconnection between the volatility at different maturity is identified. furthermore, a clear financial connection between risk factors can be recognized from the multilayer visualization and can be a useful tool for risk assessment. the paper is structured as follows. section is devoted to the tensor autoregression. section shows the empirical application while sect. concludes. tensor regression can be formulated in different ways: the tensor structure is only in the response or the regression variable or it can be on both. the literature related to the first specification is ample [ , ] whilst the fully tensor variate regression received attention only recently from the statistics and machine learning communities employing different approaches [ , ] . the tensor regression we are going to use is the tucker tensor regression proposed in [ ] . the model is formulated making use of the contracted product, the higher order counterpart of matrix product [ ] and can be expressed as: where x ∈ r n ×i ×···×in is the regressor tensor, y ∈ r n ×j ×···×jm is the response tensor, e ∈ r n ×j ×···×jm is the error tensor, a ∈ r ×j ×···×jm is the intercept tensor while the slope coefficient tensor, which represents the multilayer network we are interested to learn, is b ∈ r i ×···×in ×j ×···×jm . subscripts i x and j b are the modes over winch the product is carried out. in the context of this paper, x is a lagged version of y, hence b represents the multilinear interactions that the variables in x generate in y. these interactions are generally asymmetric and take into account lagged dependencies being b the mediator between two separate in time tensor datasets. therefore, b represents a perfect candidate to use for building a multilayer network. however, the b coefficient is high dimensional. in order to resolve the issue, a tucker structure is imposed on b such that it is possible to recover the original b with smaller objects. one of the advantages of the tucker structure is, contrarily to other tensor decompositions such as the parafac, that it can handle dimension asymmetric tensors since each factor matrix does not need to have the same number of components. tensor regression is prone to over-fitting when intra-mode collinearity is present. in this case, a shrinkage estimator is necessary for a stable solution. in fact, the presence of collinearity between the variables of the dataset degrades the forecasting capabilities of the regression model. in this work, we use the tikhonov regularization [ ] . known also as ridge regularization, it rewrites the standard least squares problem as where λ > is the regularization parameter and f is the squared frobenius norm. the greater the λ the stronger is the shrinkage effect on the parameters. however, high values of λ increase the bias of the tensor coefficient b. indeed, the shrinkage parameter is usually set via data driven procedures rather than input by the user. the tikhonov regularization can be computationally very expensive for big data problem. to solve this issue, [ ] proposed a decomposition of the tikhonov regularization. the learning of the model parameters is a nonlinear optimization problem that can be solved by iterative algorithms such as the alternating least squares (als) introduced by [ ] for the tucker decomposition. this methodology solves the optimization problem by dividing it into small least squares problems. recently, [ ] developed an als algorithm for the estimation of the tensor regression parameters with tucker structure in both the penalized and unpenalized settings. for the technical derivation refer to [ ] . in this section, we show the results of the construction of the multilayer network via the tensor regression proposed in eq. . the dataset used in this paper is composed of stocks listed in the dow jones (dj). these stocks time series are recorded on a daily basis from / / up to / / , i.e. trading days. we use over the listed stocks as they are the ones for which the entire time series is available. for the purpose of our analysis, we use log-differenciated prices, volumes, implied volatility at days (iv ) and implied volatility at days (iv ). in particular, we use the fractional difference algorithm of [ ] to balance stationarity and residual memory in the data. in fact, the original time series have the full amount of memory but they are non-stationary while integer log-differentiated data are stationary but have small residual memory due to the process of differentiation. in order to preserve the maximum amount of memory in the data, we use the fractional differentiation algorithm with different levels of fractional differentiation and then test for stationarity using the augmented dickey-fuller test [ ] . we find that all the data are stationary when the order of differentiation is α = . . this means that only a small amount of memory is lost in the process of differentiation. the tensor regression presented in eq. has some parameters to be set, i.e. the tucker rank and the shrinkage parameter λ for the penalized estimation of eq. as discussed in [ ] . regarding the tucker rank, we used the full rank specification since we do not want to reduce the number of independent links. in fact, using a reduced rank would imply common factors to be mapped together, an undesirable feature for this application. regarding the shrinkage parameter λ, we selected the value as follows. first, we split the data in a training set composed of % of the sample and in a test set with the remaining %. we then estimated the regression coefficients for different values of λ on the training set and then we computed the predicted r on the test set. we used a grid of λ = , , , , , . and the predicted r is maximized at λ = (no shrinkage). in this section, we show the results of the analysis carried out with the data presented in sect. . . the multilayer network built via the estimated tensor autoregression coefficient b represents the interconnections between and within each layer. in particular b i,j,k,l is the connection between stock i in layer j and stock k in layer l. it is important to notice that the estimated dependencies are in general not symmetric, i.e. b i,j,k,l = b k,j,i,l . however, the multilayer network constructed using b is fully connected. for this reason, a method for filtering those networks is necessary. different methodologies are available for filtering information from complex networks [ , ] . in this paper, we use the polya filter of [ ] as it can handle directed weighted networks and it is both flexible and statistically driven. in fact, it employs a tuning parameter a that drives the strength of the filter and returns the p-values for the null hypotheses of random interactions. we filter every network independently (both intra and inter connections) using a parametrization such that % of the total links are removed. in order to asses the dependency across the layers, we analyze two standard multilayer network measures, i.e. inter-layer assortativity and edge overlapping. a standard way to quantify inter-layer assortativity is to calculate pearson's correlation coefficient over degree sequences of two layers and it represents a measure of association between layers. high positive (negative) values of such measure mean that the two risk factors act in the same (opposite) direction. instead, overlapping edges are the links between pair of stocks present contemporaneously in two layers. high values of such measure mean that the stocks have common connections behaviour. as it can be possible to see from fig. , prices and volatility have a huge portion of overlapping edges, still, these layers are disassortative as the correlation between the nodes sequence across the two layer is negative. this was an expected result since the negative relationship between prices and volatility is a stylized fact in finance. not surprisingly, the two measures of volatility are highly assortative and have a huge fraction of overlapping edges. finally, we show in fig. the filtered multilayer network constructed via the tensor coefficient b estimated via the tensor autoregression of eq. . as it can be possible to notice, the volumes layer has more interlayer connections rather than intralayer connections. since each link represents the effect that one variable has on itself and other variables in the future, this means that stocks' liquidity risk mostly influences future prices and expected uncertainty. the two volatility networks have a relatively small number of interlayer connections despite being assortative. this could be due to the fact that volatility risk tends to increase or decrease through a specific maturity rather than across maturities. it is also possible to notice that more central stocks, depicted as bigger nodes in fig. , have more connections but that this feature does not directly translate in a higher strength (depicted as darker colour of the nodes). this is a feature already emphasized in [ ] for financial networks. fig. . estimated multilayer network. node colours: loglog scale; darker colour is associated to higher strength of the node. node size: loglog scale; darker colour is associated to higher k-coreness score. edge colour: uniform. from a financial point of view, such graphical representation put together three different aspects of financial risk: market risk, liquidity risk (in terms of volumes exchanged) and forward looking uncertainty measures, which account for expected volatility risk. in fact, the stocks in the volumes layer are not strongly interconnected but produce a huge amount of risk propagation through prices and volatility. understanding the dynamics of such multilayer network representation would be a useful tool for risk managers in order to understand risk balances and propose risk mitigation techniques. in this paper, we proposed a methodology to build a multilayer network via the estimated coefficient of the tucker tensor autoregression of [ ] . this methodology, in combination with a filtering technique, has proven able to reproduce interconnections between different financial risk factors. these interconnections can be easily mapped to real financial mechanisms and can be a useful tool for monitoring risk as the topology within and between layers can be strongly affected in distressed periods. in order to preserve the maximum memory information in the data but requiring stationarity, we made use of fractional differentiation and found out that the variables analyzed are stationary with differentiation of order α = . . the model can be extendedto a dynamic framework in order to analyze the dependency structures under different market conditions. the multiplex dependency structure of financial markets risk diversification: a study of persistence with a filtered correlation-network approach systemic liquidity contagion in the european interbank market the structure and dynamics of multilayer networks unveil stock correlation via a new tensor-based decomposition method predicting multidimensional data via tensor learning a fast fractional difference algorithm a pólya urn approach to information filtering in complex networks tensor regression with applications in neuroimaging data analysis parsimonious tensor response regression tensor-on-tensor regression on the stability of inverse problems a decomposition of the tikhonov regularization functional oriented to exploit hybrid multilevel parallelism principal component analysis of three-mode data by means of alternating least squares algorithms introduction to statistical time series complex networks on hyperbolic surfaces key: cord- -mckqp v authors: ksieniewicz, paweł; goścień, róża; klinkowski, mirosław; walkowiak, krzysztof title: pattern recognition model to aid the optimization of dynamic spectrally-spatially flexible optical networks date: - - journal: computational science - iccs doi: . / - - - - _ sha: doc_id: cord_uid: mckqp v the following paper considers pattern recognition-aided optimization of complex and relevant problem related to optical networks. for that problem, we propose a four-step dedicated optimization approach that makes use, among others, of a regression method. the main focus of that study is put on the construction of efficient regression model and its application for the initial optimization problem. we therefore perform extensive experiments using realistic network assumptions and then draw conclusions regarding efficient approach configuration. according to the results, the approach performs best using multi-layer perceptron regressor, whose prediction ability was the highest among all tested methods. according to cisco forecasts, the global consumer traffic in the internet will grow on average with annual compound growth rate (cagr) of % in years - [ ] . the increase in the network traffic is a result of two main trends. firstly, the number of devices connected to the internet is growing due to the increasing popularity of new services including internet of things (iot ). the second important trend influencing the traffic in the internet is popularity of bandwidth demanding services such as video streaming (e.g., netflix ) and cloud computing. the internet consists of many single networks connected together, however, the backbone connecting these various networks are optical networks based on fiber connections. currently, the most popular technology in optical networks is wdm (wavelength division multiplexing), which is expected to be not efficient enough to support increasing traffic in the nearest future. in last few years, a new concept for optical networks has been deployed, i.e., architecture of elastic optical networks (eons). however, in the perspective on the next decade some new approaches must be developed to overcome the predicted "capacity crunch" of the internet. one of the most promising proposals is spectrally-spatially flexible optical network (ss-fon) that combines space division multiplexing (sdm) technology [ ] , enabling parallel transmission of co-propagating spatial modes in suitably designed optical fibers such as multi-core fibers (mcfs) [ ] , with flexible-grid eons [ ] that enable better utilization of the optical spectrum and distanceadaptive transmissions [ ] . in mcf-based ss-fons, a challenging issue is the inter-core crosstalk (xt) effect that impairs the quality of transmission (qot ) of optical signals and has a negative impact on overall network performance. in more detail, mcfs are susceptible to signal degradation as a result of the xt that happens between adjacent cores whenever optical signals are transmitted in an overlapping spectrum segment. addressing the xt constraints significantly complicates the optimization of ss-fons [ ] . besides numerous advantages, new network technologies bring also challenging optimization problems, which require efficient solution methods. since the technologies and related problems are new, there are no benchmark solution methods to be directly applied and hence many studies propose some dedicated optimization approaches. however, due to the problems high complexity, their performance still needs a lot of effort to be put [ , ] . we therefore observe a trend to use artificial intelligence techniques (with the high emphasis on pattern recognition tools) in the field of optimization of communication networks. according to the literature surveys in this field [ , , , ] , the researchers mostly focus on discrete labelled supervised and unsupervised learning problems, such as traffic classification. regression methods, which are in the scope of that paper, are mostly applied for traffic prediction and estimation of quality of transmission (qot ) parameters such as delay or bit error rate. this paper extends our study initiated in [ ] . we make use of pattern recognition models to aid optimization of dynamic mcf-based ss-fons in order to improve performance of the network in terms of minimizing bandwidth blocking probability (bbp), or in other words to maximize the amount of traffic that can be allocated in the network. in particular, an important topic in the considered optimization problem is selection of a modulation format (mf) for a particular demand, due to the fact that each mf provides a different tradeoff between required spectrum width and transmission distance. to solve that problem, we define applicable distances for each mf (i.e., minimum and maximum length of a routing path that is supported by each mf). to find values of these distances, which provide best allocation results, we construct a regression model and then combine it with monte carlo search. it is worth noting that this work does not address dynamic problems in the context of changing the concept over time, as is often the case with processing large sets, and assumes static distribution of the concept [ ] . the main novelty and contribution of the following work is an in-depth analysis of the basic regression methods stabilized by the structure of the estimator ensemble [ ] and assessment of their usefulness in the task of predicting the objective function for optimization purposes. in one of the previous works [ ] , we confirmed the effectiveness of this type of solution using a regression algorithm of the nearest weighted neighbors, focusing, however, much more on the network aspect of the problem being analyzed. in the present work, the main emphasis is on the construction of the prediction model. its main purpose is: -a proposal to interpret the optimization problem in the context of pattern recognition tasks. the rest of the paper is organized as follows. in sect. , we introduce studied network optimization problem. in sect. , we discuss out optimization approach for that problem. next, in sect. we evaluate efficiency of the proposed approach. eventually, sect. concludes the work. the optimization problem is known in the literature as dynamic routing, space and spectrum allocation (rssa) in ss-fons [ ] . we are given with an ss-fon topology realized using mcfs. the topology consists of nodes and physical link. each physical link comprises of a number of spatial cores. the spectrum width available on each core is divided into arrow and same-sized segments called slices. the network is in its operational state -we observe it in a particular time perspective given by a number of iterations. in each iteration (i.e., a time point), a set of demands arrives. each demand is given by a source node, destination node, duration (measured in the number of iterations) and bitrate (in gbps). to realize a demand, it is required to assign it with a light-path and reserve its resources for the time of the demand duration. when a demand expires, its resources are released. a light-path consists of a routing path (a set of links connecting demand source and destination nodes) and a channel (a set of adjacent slices selected on one core) allocated on the path links. the channel width (number of slices) required for a particular demand on a particular routing path depends on the demand bitrate, path length (in kilometres) and selected modulation format. each incoming demand has to be realized unless there is not enough free resources when it arrives. in such a case, a demand is rejected. please note that the selected light-paths in i -th iteration affect network state and allocation possibilities in the next iterations. the objective function is defined here as bandwidth blocking probability (bbp) calculated as a summed bitrate of all rejected demands divided by the summed bitrate of all offered demands. since we aim to support as much traffic as it is possible, the objective criterion should be minimized [ , ] . the light-paths' allocation process has to satisfy three basic rssa constraints. first, each channel has to consists of adjacent slices. second, the same channel (i.e., the same slices and the same core) has to be allocated on each link included in a light-path. third, in each time point each slice on a particular physical link and a particular core can be used by at most one demand [ ] . there are four modulation formats available for transmissions- -qam, -qam, qpsk and bpsk. each format is described by its spectral efficiency, which determines number of slices required to realize a particular bitrate using that modulation. however, each modulation format is also characterized by the maximum transmission distance (mtd) which provides acceptable value of optical signal to noise ratio (osnr) at the receiver side. more spectrally-efficient formats consume less spectrum, however, at the cost of shorter mtds. moreover, more spectrally-efficient formats are also vulnerable to xt effects which can additionally degrade qot and lead to demands' rejection [ , ] . therefore, the selection of the modulation format for each demand is a compromise between spectrum efficiency and qot. to answer that problem, we use the procedure introduced in [ ] to select a modulation format for a particular demand and routing path [ ] . let m = , , , denote modulation formats ordered in increasing mtds (and in decreasing spectral efficiency at the same time). it means that m = denotes -qam and m = denotes bpsk. let mt d = [mtd , mtd , mtd , mtd ] be a vector of mtds for modulations -qam, -qam, qpsk, bpsk respectively. moreover, let at d = [atd , atd , atd , atd ] (where atd i <= mtd i , i = , , , ) be the vector of applicable transmission distances. for a particular demand and a routing path we select most spectrally-efficient modulation format i for which atd i is grater of equal to the selected path length and the xt effect is on an acceptable level. for each candidate modulation format, we asses the xt level based on the adjacent resources' (i.e., slices and cores) availability using procedure proposed in [ ] . it is important to note that we do not indicate atd (for bpsk) since we assume that this modulation is able to support transmission on all candidate routing paths regardless of their length. please also note that when xt level is too high for all modulation formats, the demand is rejected regardless of the light-paths' availability. in sect. we have studied rssa problem and emphasised the importance of efficient modulation selection task. for that task we have proposed solution method whose efficiency strongly depends on the applied atd vector. therefore, we aim to find atd * vector that provides best results. the vector elements have to be positive and have upper bounds given by vector mtd. moreover, the following condition have to be satisfied: atd i < atd i+ , i = , . since solving rssa instances is a time consuming process, it is impossible to evaluate all possible atd vectors in a reasonable time. we therefore make use of regression methods and propose a scheme to find atd * depicted in fig. . a representative set of different atd vectors is generated. then, for each of them we simulate allocation of demands in ss-fon (i.e., we solve dynamic rssa). for the purpose of demands allocation (i.e., selection of light-paths), we use a dedicated algorithm proposed in [ ] . for each considered atd vector we save obtained bbp. based on that data, we construct a regression model, which predicts bbp based on an atd vector. having that model, we use monte carlo method to find atd * vector, which is recommended for further experiments. to solve an rssa instance for a particular atd vector, we use heuristic algorithm proposed in [ ] . we work under the assumption that there are candidate routing paths for each traffic demand (generated using dijkstra algorithm). since the paths are generated in advance and their lengths are known, we can use an atd vector and preselect for these paths modulation formats based on the procedure discussed in sect. . therefore, rssa is reduced to the selection of one of the candidate routing paths and a communication channel with respect to the resource availability and assessed xt levels. from the perspective of pattern recognition methods, the abstraction of the problem is not the key element of processing. the main focus here is the representation available to construct a proper decision model. for the purposes of considerations, we assume that both input parameters and the objective function take only quantitative and not qualitative values, so we may use probabilistic pattern recognition models to process them. if we interpret the optimization task as searching for the extreme function of many input parameters, each simulation performed for their combination may also be described as a label for the training set of supervised learning model. in this case, the set of parameters considered in a single simulation becomes a vector of object features (x n ), and the value of the objective function acquired around it may be interpreted as a continuous object label (y n ). repeated simulation for randomly generated parameters allows to generate a data set (x) supplemented with a label vector (y). a supervised machine learning algorithm can therefore gain, based on such a set, a generalization abilities that allows for precise estimation of the simulation result based on its earlier runs on the random input values. a typical pattern recognition experiment is based on the appropriate division of the dataset into training and testing sets, in a way that guarantees their separability (most often using cross-validation), avoiding the problem of data peeking and a sufficient number of repetitions of the validation process to allow proper statistical testing of mutual model dependencies hypotheses. for the needs of the proposal contained in this paper, the usual -fold cross validation was adopted, which calculates the value of the r metric for each loop of the experiment. having constructed regression model, we are able to predict bbp value for a sample atd vector. please note that the time required for a single prediction is significantly shorter that the time required to simulate a dynamic rssa. the last step of our optimization procedure is to find atd * -vector providing lowest estimated bbp values. to this end, we use monte carlo method with a number of guesses provided by the user. the rssa problem was solved for two network topologies-dt ( nodes, links) and euro ( nodes, links). they model deutsche telecom (german national network) and european network, respectively. each network physical link comprised of cores wherein each of the cores offers frequency slices of . ghz width. we use the same network physical assumptions and xt levels and assessments as in [ ] . traffic demands have randomly generated end nodes and birates uniformly distributed between gbps and tbps, with granularity of gbps. their arrival follow poisson process with an average arrival rate λ demands per time unit. the demand duration is generated according to a negative exponential distribution with an average of /μ. the traffic load offered is λ/μ normalized traffic units (ntus). for each testing scenario, we simulate arrival of demands. four modulations are available ( -qam, -qam, qpsk, bpsk) wherein we use the same modulation parameters as in [ ] . for each topology we have generated different datasets, each consists of samples of atd vector and corresponding bbp. the datasets differ with the xt coefficient (μ = · − indicated as "xt ", μ = · − indicated as "xt ", for more details we refer to [ ] ) and network links scaling factor (the multiplier used to scale lengths of links in order to evaluate if different lengths of routing paths influence performance of the proposed approach). for dt we use following scaling factors: . , . , . , . . . , . . for euro the values are as follows: . , . , . , . , . , . , . , . , . . we indicate them as "sx.xxx " where x.xxx refers to the scaling factor value. using these datasets we can evaluate whether xt coefficient (i.e., level of the vulnerability to xt effects) and/or average link length influence optimization approach performance. the experimental environment for the construction of predictive models, including the implementation of the proposed processing method, was implemented in python, following the guidelines of the state-of-art programming interface of the scikit-learn library [ ] . statistical dependency assessment metrics for paired tests were calculated according to the wilcoxon test, according to the implementation contained in scipy module. each of the individual experiments was evaluated by r score -a typical quality assessment metric for regression problems. the full source code, supplemented with employed datasets is publicly available in a git repository . five simple recognition models were selected as the base experimental estimators: knr-k-nearest neighbors regressor with five neighbors, leaf size of and euclidean metric approximated by minkowski distance, -dknr-knr regressor weighted by distance from closest patterns, mlp-a multilayer perceptron with one hidden layer of one hundred neurons, with the relu activation function and adam optimizer, dtr-cart tree with mse split criterion, lin-linear regression algorithm. in this section we evaluate performance of the proposed optimization approach. to this end, we conduct three experiments. experiment focuses on the number of patterns required to construct a reliable prediction model. experiment assesses the statistical dependence of built models. eventually, experiment verifies efficiency of the proposed approach as a function of number of guesses in the monte carlo search. the first experiment carried out as part of the approach evaluation is designed to verify how many patterns -and thus how many repetitions of simulations -must be passed to individual regression algorithms to allow the construction of a reliable prediction model. the tests were carried out on all five considered regressors in two stages. first, the range from to patterns was analyzed, and in the second, from to patterns per processing. it is important to note that due to the chosen approach to cross-validation, in each case the model is built on % of available objects. the analysis was carried out independently on all available data sets, and due to the non-deterministic nature of sampling of available patterns, its results were additionally stabilized by repeating a choice of the objects subset five times. in order to allow proper observations, the results were averaged for both topologies. plots for the range from to patterns were additionally supplemented by marking ranges of standard deviation of r metric acquired within the topology and presented in the range from the . value. the results achieved for averaging individual topologies are presented in figs. and . for dt topology, mlp and dtr algorithms are competitively the best models, both in terms of the dynamics of the relationship between the number of patterns and the overall regression quality. the linear regression clearly stands out from the rate. a clear observation is also the saturation of the models, understood by approaching the maximum predictive ability, as soon as around patterns in the data set. the best algorithms already achieve quality within . , and with patterns they stabilize around . . the relationship between each of the recognition algorithms and the number of patterns takes the form of a logarithmic curve in which, after fast initial growth, each subsequent object gives less and less potential for improving the quality of prediction. this suggests that it is not necessary to carry out further simulations to extend the training set, because it will not significantly affect the predictive quality of the developed model. very similar observations may be made for euro topology, however, noting that it seems to be a simpler problem, allowing faster achievement of the maximum model predictive capacity. it is also worth noting here the fact that the standard deviation of results obtained by mlp is smaller, which may be equated with the potentially greater stability of the model achieved by such a solution. the second experiment extends the research contained in experiment by assessing the statistical dependence of models built on a full datasets consisting of a thousand samples for each case. the results achieved are summarized in tables a and b. as may be seen, for the dt topology, the lin algorithm clearly deviates negatively from the other methods, in absolutely every case being a worse solution than any of the others, which leads to the conclusion that we should completely reject it from considering as a base for a stable recognition model. algorithms based on neighborhood (knr and dknr) are in the middle of the rate, in most cases statistically giving way to mlp and dtr, which would also suggest departing from them in the construction of the final model. the statistically best solutions, almost equally, in this case are mlp and dtr. for euro topology, the results are similar when it comes to lin, knr and dknr approaches. a significant difference, however, may be seen for the achievements of dtr, which in one case turns out to be the worst in the rate, and in many is significantly worse than mlp. these observations suggest that in the final model for the purposes of optimization lean towards the application of neural networks. what is important, the highest quality prediction does not exactly mean the best optimization. it is one of the very important factors, but not the only one. it is also necessary to be aware of the shape of the decision function. for this purpose, the research was supplemented with visualizations contained in fig. . algorithms based on neighborhood (knn, dknn) and decision trees (dtr) are characterized by a discrete decision boundary, which in the case of visualization resembles a picture with a low level of quantization. in the case of an ensemble model, stabilized by cross-validation, actions are taken to reduce this property in order to develop as continuous a border as possible. as may be seen in the illustrations, compensation occurs, although in the case of knn and dknn leads to some disturbances in the decision boundary (interpreted as thresholding the predicted label value), and for the dtr case, despite the general correctness of the performed decisions, it generates image artifacts. such a model may still retain high predictive ability, but it has too much tendency to overfit and leads to insufficient continuity of the optimized function to perform effective optimization. clear decision boundaries are implemented by both the lin and mlp approaches. however, it is necessary to reject lin from processing due to the linear nature of the prediction, which (i ) in each optimization will lead to the selection of the extreme value of the analyzed range and (ii ) is not compatible with the distribution of the explained variable and must have the largest error in each of the optimas. summing up the observations of experiments and , the mlp algorithm was chosen as the base model for the optimization task. it is characterized by (i ) statistically best predictive ability among the methods analyzed and (ii ) the clearest decision function from the perspective of the optimization task. the last experiment focuses on the finding of best atd vector based on the constructed regression model. to this end, we use monte carlo method with different number of guesses. tables and present the obtained results as a function of number of guesses, which changes from up to . the results quality increases with the number of guesses up to some threshold value. then, the results do not change at all or change only a little bit. according to the presented values, monte carlo method applied with guesses provides satisfactory results. we therefore recommend that value for further experiments. the following work has considered the topic of employing pattern recognition methods to support ss-fon optimization process. for a wide pool of generated cases, analyzing two real network topologies, the effectiveness of solutions implemented by five different, typical regression methods was analyzed, starting from logistic regression and ending with neural networks. conducted experimental analysis shows, with high probability obtained by conducting proper statistical validation, that mlp is characterized by the greatest potential in this type of solutions. even with a relatively small pool of input simulations, constructing a data set for learning purpouses, interpretable in both the space of optimization and machine learning problems, simple networks of this type achieve both high quality prediction measured by the r metric, and continuous decision space creating the potential for conducting optimization. basing the model on the stabilization realized by using ensemble of estimators additionally allows to reduce the influence of noise on optimization, whichin a state-of-art optimization methods -could show a tendency to select invalid optimas, burdened by the nondeterministic character of the simulator. further research, developing ideas presented in this article, will focus on the generalization of the presented model for a wider pool of network optimization problems. high-capacity transmission over multi-core fibers a comprehensive survey on machine learning for networking: evolution, applications and research opportunities visual networking index: forecast and trends elastic optical networking: a new dawn for the optical layer on the efficient dynamic routing in spectrally-spatially flexible optical networks on the complexity of rssa of any cast demands in spectrally-spatially flexible optical networks machine learning assisted optimization of dynamic crosstalk-aware spectrallyspatially flexible optical networks survey of resource allocation schemes and algorithms in spectrally-spatially flexible optical networking data stream classification using active learned neural networks artificial intelligence (ai) methods in optical networks: a comprehensive survey an overview on application of machine learning techniques in optical networks scikit-learn: machine learning in python machine learning for network automation: overview, architecture, and applications survey and evaluation of space division multiplexing: from technologies to optical networks modeling and optimization of cloud-ready and content-oriented networks. ssdc classifier selection for highly imbalanced data streams with minority driven ensemble key: cord- -shauvo j authors: kruglov, vasiliy n. title: using open source libraries in the development of control systems based on machine vision date: - - journal: open source systems doi: . / - - - - _ sha: doc_id: cord_uid: shauvo j the possibility of the boundaries detection in the images of crushed ore particles using a convolutional neural network is analyzed. the structure of the neural network is given. the construction of training and test datasets of ore particle images is described. various modifications of the underlying neural network have been investigated. experimental results are presented. when processing crushed ore mass at ore mining and processing enterprises, one of the main indicators of the quality of work of both equipment and personnel is the assessment of the size of the crushed material at each stage of the technological process. this is due to the need to reduce material and energy costs for the production of a product unit manufactured by the plant: concentrate, sinter or pellets. the traditional approach to the problem of evaluating the size of crushed material is manual sampling with subsequent sieving with sieves of various sizes. the determination of the grain-size distribution of the crushed material in this way entails a number of negative factors: -the complexity of the measurement process; -the inability to conduct objective measurements with sufficient frequency; -the human error factor at the stages of both data collection and processing. these shortcomings do not allow you to quickly adjust the performance of crushing equipment. the need for obtaining data on the coarseness of crushed material in real time necessitated the creation of devices for in situ assessment of parameters such as the grain-size distribution of ore particles, weight-average ore particle and the percentage of the targeted class. the machine vision systems are able to provide such functionality. they have high reliability, performance and accuracy in determining the geometric dimensions of ore particles. at the moment, several vision systems have been developed and implemented for the operational control of the particle size distribution of crushed or granular material. in [ ] , a brief description and comparative analysis of such systems as: split, wipfrag, fragscan, cias, ipacs, tucips is given. common to the algorithmic part of these systems is the stage of dividing the entire image of the crushed ore mass into fragments corresponding to individual particles with the subsequent determination of their geometric sizes. such a segmentation procedure can be solved by different approaches, one of which is to highlight the boundaries between fragments of images of ore particles. classical methods for borders highlighting based on the assessment of changes in brightness of neighboring pixels, which implies the use of mathematical algorithms based on differentiation [ , ] . figure shows typical images of crushed ore moving on a conveyor belt. algorithms from the opencv library, the sobel and canny filters in particular, used to detect borders on the presented images, have identified many false boundaries and cannot be used in practice. this paper presents the results of recognizing the boundaries of images of stones based on a neural network. this approach has been less studied and described in the literature, however, it has recently acquired great significance in connection with its versatility and continues to actively develop with the increasing of a hardware performance [ , ] . to build a neural network and apply machine learning methods, a sample of images of crushed ore stones in gray scale was formed. the recognition of the boundaries of the ore particles must be performed for stones of arbitrary size and configuration on a video frame with ratio × pixels. to solve this problem with the help of neural networks, it is necessary to determine what type of neural network to use, what will be the input information and what result we want to get as the output of the neural network processing. analysis of literary sources showed that convolutional neural networks are the most promising when processing images [ , [ ] [ ] [ ] . convolutional neural network is a special architecture of artificial neural networks aimed at efficient pattern recognition. this architecture manages to recognize objects in images much more accurately, since, unlike the multilayer perceptron, two-dimensional image topology is considered. at the same time, convolutional networks are resistant to small displacements, zooming, and rotation of objects in the input images. it is this type of neural network that will be used in constructing a model for recognizing boundary points of fragments of stone images. algorithms for extracting the boundaries of regions as source data use image regions having sizes of × or × . if the algorithm provides for integration operations, then the window size increases. an analysis of the subject area for which this neural network is designed (a cascade of secondary and fine ore crushing) showed: for images of × pixels and visible images of ore pieces, it is preferable to analyze fragments with dimensions of × pixels. thus, the input data for constructing the boundaries of stones areas will be an array of images consisting of ( − )*( − ) = halftone fragments measuring × pixels. in each of these fragments, the central point either belongs to the boundary of the regions or not. based on this assumption, all images can be divided into two classes. to mark the images into classes on the source images, the borders of the stones were drawn using a red line with a width of pixels. this procedure was performed manually with the microsoft paint program. an example of the original and marked image is shown in fig. . then python script was a projected, which processed the original image to highlight × pixels fragments and based on the markup image sorted fragments into classes preserving them in different directories to write the scripts, we used the python programming language and the jupyter notebook ide. thus, two data samples were obtained: training dataset and test dataset for the assessment of the network accuracy. as noted above, the architecture of the neural network was built on a convolutional principle. the structure of the basic network architecture is shown in fig. [ ] . the network includes an input layer in the format of the tensor × × . the following are several convolutional and pooling layers. after that, the network unfolds in one fully connected layer, the outputs of which converge into one neuron, to which the activation function, the sigmoid, will be applied. at the output, we obtain the probability that the center point of the input fragment belongs to the "boundary point" class. the keras open source library was used to develop and train a convolutional neural network [ , , , ] . the basic convolutional neural network was trained with the following parameters: - epoch; -error -binary cross-entropy; -quality metric -accuracy (percentage of correct answers); -optimization algorithm -rmsprop. the accuracy on the reference data set provided by the base model is . %. in order to improve the accuracy of predictions, a script was written that trains models on several configurations, and also checks the quality of the model on a test dataset. to improve the accuracy of the predictions of the convolutional neural network, the following parameters were varied with respect to the base model: -increasing the number of layers: + convolutional + pooling; -increasing of the number of filters: + in each layer; -increasing the size of the filter up to * ; -increasing the number of epochs up to ; -decreasing in the number of layers. these modifications of the base convolutional neural network did not lead to an improvement in its performance -all models had the worst quality on the test sample (in the region of - % accuracy). the model of the convolutional neural network, which showed the best quality, was the base model. its quality in the training sample is estimated at . %, and in the test sample -at %. none of the other models were able to surpass this figure. data on accuracy and epoch error are shown in fig. and . if you continue to study for more than epochs, then the effect of retraining occurs: the error drops, and accuracy increases only on training samples, but not on test ones. figure shows examples of images with neural network boundaries. as you can see from the images, not all the borders are closed. the boundary discontinuities are too large to be closed using morphological operations on binary masks; however, the use of the "watershed" algorithm [ ] will reduce the identification error of the boundary points. in this work, a convolutional neural network was developed and tested to recognize boundaries on images of crushed ore stones. for the task of constructing a convolutional neural network model, two data samples were generated: training and test dataset. when building the model, the basic version of the convolutional neural network structure was implemented. in order to improve the quality of model recognition, a configuration of various models was devised with deviations from the basic architecture. an algorithm for training and searching for the best model by enumerating configurations was implemented. in the course of the research, it was found that the basic model has the best quality for recognizing boundary points. it shows the accuracy of the predictions for the targeted class at %. based on the drawn borders on the test images, it can be concluded that the convolutional neural network is able to correctly identify the boundary points with a high probability. it rarely makes mistakes for cases when there is no boundary (false positive), but often makes mistakes when recognizing real boundary points (false negative). the boundary breaks are too large to be closed using morphological operations on binary masks, however, the use of the "watershed" algorithm will reduce the identification error for boundary points. funding. the work was performed under state contract Γc / , grant from the fasie. keras: the python deep learning library deep learning with python, st edn machine learning: the art and science of algorithms that make sense of data digital image processing hands-on machine learning with scikit-learn and tensorflow: concepts, tools, and techniques to build intelligent systems deep learning with keras: implement neural networks with keras on theano and tensorflow comprehensive guide to convolutional neural networks -the eli way image processing, analysis and machine vision identifying, visualizing, and comparing regions in irregularly spaced d surface data python data science handbook: essential tools for working with data, st edn key: cord- -zyjd rmp authors: peixoto, tiago p. title: network reconstruction and community detection from dynamics date: - - journal: nan doi: . /physrevlett. . sha: doc_id: cord_uid: zyjd rmp we present a scalable nonparametric bayesian method to perform network reconstruction from observed functional behavior that at the same time infers the communities present in the network. we show that the joint reconstruction with community detection has a synergistic effect, where the edge correlations used to inform the existence of communities are also inherently used to improve the accuracy of the reconstruction which, in turn, can better inform the uncovering of communities. we illustrate the use of our method with observations arising from epidemic models and the ising model, both on synthetic and empirical networks, as well as on data containing only functional information. the observed functional behavior of a wide variety largescale system is often the result of a network of pairwise interactions. however, in many cases, these interactions are hidden from us, either because they are impossible to measure directly, or because their measurement can be done only at significant experimental cost. examples include the mechanisms of gene and metabolic regulation [ ] , brain connectivity [ ] , the spread of epidemics [ ] , systemic risk in financial institutions [ ] , and influence in social media [ ] . in such situations, we are required to infer the network of interactions from the observed functional behavior. researchers have approached this reconstruction task from a variety of angles, resulting in many different methods, including thresholding the correlation between time series [ ] , inversion of deterministic dynamics [ ] [ ] [ ] , statistical inference of graphical models [ ] [ ] [ ] [ ] [ ] and of models of epidemic spreading [ ] [ ] [ ] [ ] [ ] [ ] , as well as approaches that avoid explicit modeling, such as those based on transfer entropy [ ] , granger causality [ ] , compressed sensing [ ] [ ] [ ] , generalized linearization [ ] , and matching of pairwise correlations [ , ] . in this letter, we approach the problem of network reconstruction in a manner that is different from the aforementioned methods in two important ways. first, we employ a nonparametric bayesian formulation of the problem, which yields a full posterior distribution of possible networks that are compatible with the observed dynamical behavior. second, we perform network reconstruction jointly with community detection [ ] , where, at the same time as we infer the edges of the underlying network, we also infer its modular structure [ ] . as we will show, while network reconstruction and community detection are desirable goals on their own, joining these two tasks has a synergistic effect, whereby the detection of communities significantly increases the accuracy of the reconstruction, which in turn improves the discovery of the communities, when compared to performing these tasks in isolation. some other approaches combine community detection with functional observation. berthet et al. [ ] derived necessary conditions for the exact recovery of group assignments for dense weighted networks generated with community structure given observed microstates of an ising model. hoffmann et al. [ ] proposed a method to infer community structure from time-series data that bypasses network reconstruction by employing a direct modeling of the dynamics given the group assignments, instead. however, neither of these approaches attempt to perform network reconstruction together with community detection. furthermore, they are tied down to one particular inverse problem, and as we will show, our general approach can be easily extended to an open-ended variety of functional models. bayesian network reconstruction.-we approach the network reconstruction task similarly to the situation where the network edges are measured directly, but via an uncertain process [ , ] : if d is the measurement of some process that takes place on a network, we can define a posterior distribution for the underlying adjacency matrix a via bayes' rule where pðdjaÞ is an arbitrary forward model for the dynamics given the network, pðaÞ is the prior information on the network structure, and pðdÞ ¼ p a pðdjaÞpðaÞ is a normalization constant comprising the total evidence for the data d. we can unite reconstruction with community detection via an, at first, seemingly minor, but ultimately consequential modification of the above equation where we introduce a structured prior pðajbÞ where b represents the partition of the network in communities, i.e., b ¼ fb i g, where b i ∈ f ; …; bg is group membership of node i. this partition is unknown, and is inferred together with the network itself, via the joint posterior distribution the prior pðajbÞ is an assumed generative model for the network structure. in our work, we will use the degreecorrected stochastic block model (dc-sbm) [ ] , which assumes that, besides differences in degree, nodes belonging to the same group have statistically equivalent connection patterns, according to the joint probability with λ rs determining the average number of edges between groups r and s and κ i the average degree of node i. the marginal prior is obtained by integrating over all remaining parameters weighted by their respective prior distributions, which can be computed exactly for standard prior choices, although it can be modified to include hierarchical priors that have an improved explanatory power [ ] (see supplemental material [ ] for a concise summary.). the use of the dc-sbm as a prior probability in eq. ( ) is motivated by its ability to inform link prediction in networks where some fraction of edges have not been observed or have been observed erroneously [ , ] . the latent conditional probabilities of edges existing between groups of nodes is learned by the collective observation of many similar edges, and these correlations are leveraged to extrapolate the existence of missing or spurious ones. the same mechanism is expected to aid the reconstruction task, where edges are not observed directly, but the observed functional behavior yields a posterior distribution on them, allowing the same kind of correlations to be used as an additional source of evidence for the reconstruction, going beyond what the dynamics alone says. our reconstruction approach is finalized by defining an appropriate model for the functional behavior, determining pðdjaÞ. here, we will consider two kinds of indirect data. the first comes from a susceptible-infected-susceptible (sis) epidemic spreading model [ ] , where σ i ðtÞ ¼ means node i is infected at time t, , otherwise. the likelihood for this model is where is the transition probability for node i at time t, with fðp; σÞ ¼ ð − pÞ σ p −σ , and where m i ðtÞ ¼ p j a ij lnð − τ ij Þσ j ðtÞ is the contribution from all neighbors of node i to its infection probability at time t. in the equations above, the value τ ij is the probability of an infection via an existing edge ði; jÞ, and γ is the → recovery probability. with these additional parameters, the full posterior distribution for the reconstruction becomes since τ ij ∈ ½ ; , we use the uniform prior pðτÞ ¼ . note, also, that the recovery probability γ plays no role on the reconstruction algorithm, since its term in the likelihood does not involve a [and, hence, gets cancelled out in the denominator pðσjγÞ ¼ pðγjσÞpðσÞ=pðγÞ]. this means that the above posterior only depends on the infection events → and, thus, is also valid without any modifications to all epidemic variants susceptible-infected (si), susceptibleinfected-recovered (sir), susceptible-exposed-infectedrecovered (seir), etc., [ ] , since the infection events occur with the same probability for all these models. the second functional model we consider is the ising model, where spin variables on the nodes s ∈ f− ; g n are sampled according to the joint distribution where β is the inverse temperature, j ij is the coupling on edge ði; jÞ, h i is a local field on node i, and zða; β; j; hÞ ¼ p s expðβ p i c à ; ; otherwiseg. the value of c à was chosen to maximize the posterior similarity, which represents the best possible reconstruction achievable with this method. nevertheless, the network thus obtained is severely distorted. the inverse correlation method comes much closer to the true network, but is superseded by the joint inference with community detection. empirical dynamics.-we turn to the reconstruction from observed empirical dynamics with unknown underlying interactions. the first example is the sequence of m ¼ votes of n ¼ deputies in the to session of the lower chamber of the brazilian congress. each deputy voted yes, no, or abstained for each legislation, which we represent as f ; − ; g, respectively. since the temporal ordering of the voting sessions is likely to be of secondary importance to the voting outcomes, we assume the votes are sampled from an ising model [the addition of zero-valued spins changes eq. ( ) only slightly by replacing coshðxÞ → þ coshðxÞ]. figure shows the result of the reconstruction, where the division of the nodes uncovers a cohesive government and a split opposition, as well as a marginal center group, which correlates very well with the known party memberships and can be used to predict unseen voting behavior (see supplemental material [ ] for more details). in fig. , we show the result of the reconstruction of the directed network of influence between n ¼ twitter users from retweets [ ] using a si epidemic model (the act of "retweeting" is modeled as an infection event, using eqs. ( ) and ( ) with γ ¼ ) and the nested dc-sbm. the reconstruction uncovers isolated groups with varying propensities to retweet, as well as groups that tend to influence a large fraction of users. by inspecting the geolocation metadata on the users, we see that the inferred groups amount, to a large extent, to different countries, although clear subdivisions indicate that this is not the only factor governing the influence among users (see supplemental material [ ] for more details). conclusion.-we have presented a scalable bayesian method to reconstruct networks from functional observations that uses the sbm as a structured prior and, hence, performs community detection together with reconstruction. the method is nonparametric and, hence, requires no prior stipulation of aspects of the network and size of the model, such as number of groups. by leveraging inferred correlations between edges, the sbm includes an additional source of evidence and, thereby, improves the reconstruction accuracy, which in turn also increases the accuracy of the inferred communities. the overall approach is general, requiring only appropriate functional model specifications, and can be coupled with an open ended variety of such models other than those considered here. [ , ] for details on the layout algorithm), and the edge colors indicate the infection probabilities τ ij as shown in the legend. the text labels show the dominating country membership for the users in each group. inferring gene regulatory networks from multiple microarray datasets dynamic models of large-scale brain activity estimating spatial coupling in epidemiological systems: a mechanistic approach bootstrapping topological properties and systemic risk of complex networks using the fitness model the role of social networks in information diffusion network inference with confidence from multivariate time series revealing network connectivity from response dynamics inferring network topology from complex dynamics revealing physical interaction networks from statistics of collective dynamics learning factor graphs in polynomial time and sample complexity reconstruction of markov random fields from samples: some observations and algorithms, in approximation, randomization and combinatorial optimization. algorithms and techniques which graphical models are difficult to learn estimation of sparse binary pairwise markov networks using pseudo-likelihoods inverse statistical problems: from the inverse ising problem to data science inferring networks of diffusion and influence on the convexity of latent social network inference learning the graph of epidemic cascades statistical inference approach to structural reconstruction of complex networks from binary time series maximum-likelihood network reconstruction for sis processes is np-hard network reconstruction from infection cascades escaping the curse of dimensionality in estimating multivariate transfer entropy causal network inference by optimal causation entropy reconstructing propagation networks with natural diversity and identifying hidden sources efficient reconstruction of heterogeneous networks from time series via compressed sensing robust reconstruction of complex networks from sparse data universal data-based method for reconstructing complex networks with binary-state dynamics reconstructing weighted networks from dynamics reconstructing network topology and coupling strengths in directed networks of discrete-time dynamics community detection in networks: a user guide bayesian stochastic blockmodeling exact recovery in the ising blockmodel community detection in networks with unobserved edges network structure from rich but noisy data reconstructing networks with unknown and heterogeneous errors stochastic blockmodels and community structure in networks nonparametric bayesian inference of the microcanonical stochastic block model for summary of the full generative model used, details of the inference algorithm and more information on the analysis of empirical data efficient monte carlo and greedy heuristic for the inference of stochastic block models missing and spurious interactions and the reconstruction of complex networks epidemic processes in complex networks spatial interaction and the statistical analysis of lattice systems equation of state calculations by fast computing machines monte carlo sampling methods using markov chains and their applications asymptotic analysis of the stochastic block model for modular networks and its algorithmic applications artifacts or attributes? effects of resolution on the little rock lake food web note that, in this case, our method also exploits the heterogeneous degrees in the network via the dc-sbm, which can refinements of this approach including thouless-anderson-palmer (tap) and bethe-peierls (bp) corrections [ ] yield the same performance for this example pseudolikelihood decimation algorithm improving the inference of the interaction network in a general class of ising models the simple rules of social contagion hierarchical block structures and high-resolution model selection in large networks hierarchical edge bundles: visualization of adjacency relations in hierarchical data key: cord- -oqe gjcs authors: strano, emanuele; viana, matheus p.; sorichetta, alessandro; tatem, andrew j. title: mapping road network communities for guiding disease surveillance and control strategies date: - - journal: sci rep doi: . /s - - - sha: doc_id: cord_uid: oqe gjcs human mobility is increasing in its volume, speed and reach, leading to the movement and introduction of pathogens through infected travelers. an understanding of how areas are connected, the strength of these connections and how this translates into disease spread is valuable for planning surveillance and designing control and elimination strategies. while analyses have been undertaken to identify and map connectivity in global air, shipping and migration networks, such analyses have yet to be undertaken on the road networks that carry the vast majority of travellers in low and middle income settings. here we present methods for identifying road connectivity communities, as well as mapping bridge areas between communities and key linkage routes. we apply these to africa, and show how many highly-connected communities straddle national borders and when integrating malaria prevalence and population data as an example, the communities change, highlighting regions most strongly connected to areas of high burden. the approaches and results presented provide a flexible tool for supporting the design of disease surveillance and control strategies through mapping areas of high connectivity that form coherent units of intervention and key link routes between communities for targeting surveillance. networks, the regular and planar nature of road networks precludes the formation of clear communities, i.e. roads that cluster together shaping areas that are more connected within their boundaries than with external roads. highly connected regional communities can promote rapid disease spread within them, but can be afforded protection from recolonization by surrounding regions of reduced connectivity, making them potentially useful intervention or surveillance units , , . for isolated areas, a focused control or elimination program is likely to stand a better chance of success than those highly connected to high-transmission or outbreak regions. for example, reaching a required childhood vaccination coverage target in one district is substantially more likely to result in disease control and elimination success if that district is not strongly connected to neighbouring districts where the target has not been met. the identification of 'bridge' routes between highly connected regions could also be of value in targeting limited resources for surveillance . moreover, progressive elimination of malaria from a region needs to ensure that parasites are not reintroduced into areas that have been successfully cleared, necessitating a planned strategy for phasing that should be informed by connectivity and mobility patterns . here we develop methods for identifying and mapping road connectivity communities in a flexible, hierarchical way. moreover, we map 'bridge' areas of low connectivity between communities and apply these new methods to the african continent. finally, we show how these can be weighted by data on disease prevalence to better understand pathogen connectivity, using p. falciparum malaria as an example. african road network data. data on the african road network (arn) were obtained from gps navigation and cartography as described in a previous study . the dataset maps primary and secondary roads across the continent, and while it does have commercial restrictions, it is a more complete and consistent dataset than alternative open road datasets (e.g. openstreetmap , groads ). visual inspection and comparison between the arn and other spatial road inventories validated the improved accuracy and consistency of arn, however a quantitative validation analysis was not possible due to the lack of consistent ground-truth data at continental scales. figure a shows the african road network data used in this analysis. the road network dataset is a commercial restricted product and requests for it can be directly addressed to garmin . plasmodium falciparum malaria prevalence and population maps. to demonstrate how geographically referenced data on disease occurrence or prevalence can be integrated into the approaches outlined, gridded data on plasmodium falciparum malaria prevalence were obtained from the malaria atlas project (http:// www.map.ox.ac.uk/). these represent modelled estimates of the prevalence of p. falciparum parasites in per × km grid square across africa . additionally, gridded data on estimated population totals per × km grid square across africa in were obtained from the worldpop program (http://www.worldpop.org/). the population data were aggregated to the same × km gridding as the malaria data, and then multiplied together to obtain estimates of total numbers of p. falciparum infections per × km grid square. detecting communities in the african road network. we modeled the arn as a'primal' road network, where roads are links and road junctions are nodes . spatial road networks have, as any network embedded in two dimensions, physical spatial constraints that impose on them a grid-like structure. in fact, the arn primal network is composed of , road segments that account for a total length of , , km, with an average road length of . km ± . km. such large standard deviations, as already observed elsewhere , , , are due to the long tailed distribution of road lengths, as illustrated in fig. c . another property of road network structure is the frequency distribution of the degree of nodes, defined as the number of links connected to each node. most networks in nature and society have a long tail distribution of node degree, implying the existence of hubs (nodes that connect to a large amount of other nodes) , with the majority of nodes connecting to very few others. for road networks, however, the degree distribution strongly peaks around , indicating that most of the roads are connected with two other roads. the long tail distribution of the length of road segments, coupled with the peaked degree distribution, indicates the presence of translational invariant grid-like structure, in which road density smoothly varies among regions while their connectivity and structure does not. within such gridlike structures it is very difficult to identify clustered communities, i.e. groups of roads that are more connected within themselves than to other groups. this observation is confirmed by the spatial distribution of betweenness centrality (bc), which measures the amount of time the shortest paths between each couple of nodes pass through a road. the probability distribution of bc is long tailed (fig. d) , while its spatial distribution spreads across the entire network, with a structural backbone form, as shown in fig. b. again, under such conditions and because of the absence of bottlenecks, any strategy to detect communities that employs pruning on bc values , will be minimally effective. to detect communities in road networks we follow the observation that human displacement in urban networks is guided by straight lines . therefore, geometry can be used to detect communities of roads by assuming that people tend to move more along streets than between between streets. we developed a community detection pipeline that converts a primal road network, where roads are links and roads junction are nodes , to a dual network representation, where link are nodes and street junction link between nodes , by mean of straightness and contiguity of roads. it is important to note here that the units of analysis are road segments, which here are typically short and straight between intersections, making the straightness assumption valid. community detection in the dual network is then performed using a modularity optimization algorithm . the communities found in the dual network are then mapped back to the original primal road network. these communities encode information about the geometry of road pattern but can also incorporate weights associated with a particular disease to guide the process of community detection. nodes in the dual network represent lines in the primal network. the conversion from primal to dual is done by using a modified version of the algorithm known as continuity negotiation . in brief, we assume that a pair of adjacent edges belongs to the same street if the angle θ between these edges is smaller than θ c = °. we also assume that the angle between two adjacent edges (i, j) and (j, p) is given by the dot product cos (θ) = r i, j r j,p /r i, j r j,p , where r i, j = r j r i . under these assumptions, the angle between two edges belonging to a perfect straight line is zero, while it assumes a value of ° for perpendicular edges. our algorithm starts searching for the edge that generates the longest road in the primal space, as can be seen in fig. a . then, a node is created in the dual space and assigned to this road. next, we search for the edge that generates the second longest road, and a new node is created in the dual space and assigned to this road. if there is at least one interception between the new road and the previous one, we connect the respective nodes in the dual space. the algorithm continues until all the edges in the primal space are assigned to a node in the dual space, as shown in fig. b . note that the conversion from primal to the dual road network has been used extensively to estimate human perception and movement along road networks (space syntax, see ) , which also supports our use of road geometry to detect communities. despite the regular structure of the network in the primal space, the topology of these networks in the dual space is very rich. for instance the degree distribution in dual space follows the power-law p(k) k −γ . this property has been previously identified in urban networks and it is strongly related to the long tailed distribution of road lengths in these networks (see fig. c ). since most of the roads are short, most of the nodes in dual space will have a small number of connections. on the other hand, there are a few long roads (fig. a ) that originate at hubs in the dual space (fig. b ). our approach for detecting communities in road networks consists then in performing classical community detection in the dual representation ( fig. c) and then bringing the result back to the primal representation, as shown in fig. d . the algorithm used to detect the communities is the modularity-based algorithm by clauset and newman . the hierarchical mapping of communities on the african road network, with outputs for , , and sets of communities, is shown in fig. . the maps highlight how connectivity rarely aligns with national borders, with the areas most strongly connected through dense road networks typically straddling two or more countries. the hierarchical nature of the approach is illustrated through the breakdown of the large regions in fig. a into further sub-regions in b, c and d, emphasizing the main structural divides within each region in mapped in a. some large regions appear consistently in each map, for example, a single community spans the entire north african coast, extending south into the sahara. south africa appears as wholly contained within a single community, while the horn of africa containing somalia and much of ethiopia and kenya in consistently mapped as one community. the four maps shown are example outputs, but any number of communities can be identified. the clustering that maximises modularity produces communities, and these are mapped in fig. . even with division into communities, the north africa region remains as a single community, strongly separated from sub-saharan africa by large bridge regions. south africa also remains as almost wholly within its own community, with somalia and namibia showing similar patterns. the countries with the largest numbers of communities tend to be those with the least dense infrastructure equating to poor connectivity, such as drc and angola, though west africa also shows many distinct clusters, especially within nigeria. apart from the sahara, the largest bridge regions of poor connectivity are located across the central belt of sub-saharan africa, where population densities are low and transport infrastructure is both sparse and often poor. the communities mapped in figs and align in many cases with recorded population and pathogen movements. for example, the broad southern and eastern community divides match well those seen in hiv- subtype analyses and community detection analyses based on migration data . at more regional scales, there also exist similarities with prior analyses based on human and pathogen movement patterns. for example, the western, coastal and northern communities within kenya in fig. b , identified previously through mobile phone and census derived movement data , . further, guinea, liberia and sierra leone typically remain mostly within a single community in fig. , with some divides evident in fig. c . this shows some strong similarities with the spread of ebola virus through genome analysis , particularly the multiple links between rural guinea and sierra leone, though fig. c highlights a divide between the regions containing conakry and freetown when africa is broken into the communities. figure highlights the connections between kinshasa in western drc and angola, with the recent yellow fever outbreak spreading within the communities mapped. figure d shows the'best' communities map for an area of southern africa, and the strong cross-border links between swaziland, southern mozambique and western south africa are mapped within a single community, as well as wider links highlighted in fig. , matching the travel patterns found from swaziland malaria surveillance data . integrating p. falciparum malaria prevalence and population data with road networks for weighted community detection. the previous section outlined methods for community detection on unweighted road networks. to integrate disease occurrence, prevalence or incidence data for the identification of areas of likely elevated movement of infections or for guiding the identification of operational control units, an adaptation to weighted networks is required. we demonstrate this through the integration of the data on estimated numbers of p. falciparum infections per × km grid square into the community detection pipeline. the final pipeline for community detection calculated a trade-off between form and function of roads in order to obtain a network partition. the form is related to the topology of the road network and is taken into account during the primal-dual conversion. the topological component guarantees that only neighbor and well connected locations could belong to the same community. the functional part, on the other hand, is calculated by the combination of estimated p. falciparum malaria prevalence multiplied by population to obtain estimated numbers of infections, as outlined above. the two factors were combined to form a weight to each edge of our primal network. the weight w i, j of edge (i, j) is defined as where m(r) is the p. falciparum malaria prevalence and p(r) is the population count, both at coordinate r. these values are obtained directly from the data. when the primal representation is converted into its dual version, the weights of primal edges, given by eq. , are converted into weights of dual nodes, which are defined as where i represents the i th dual node and Ω i represents the set of all the primal edges that were combined together to form the dual node i (see fig. a,b) . finally, weights for the dual edges are created from the weights of dual nodes, by simply assuming the dual network weighted by values of λ i,¯j was used as input for a weighted community detection algorithm. ultimately, when the communities detected in the dual space are translated back to primal space, we have that neighbor locations with similar values of estimated p. falciparum infections belong to the same communities. for the example of p. falciparum malaria used here, the max function was used, representing maximum numbers of infections on each road segment in . this was chosen to identify connectivity to the highest burden areas. areas with large numbers of infections are often 'sources' , with infected populations moving back and forward from them spreading parasites elsewhere , . therefore, mapping which regions are most strongly connected to them is of value. alternative metrics can be used however, depending on the aims of the analyses. the integration of p. falciparum malaria prevalence and population (fig. a ) through weighting road links by the maximum values across them produces a different pattern of communities (fig. b) to those based solely on network structure (fig. ) . the mapping of communities is shown here, as it identifies key regions of known malaria connectivity, as outlined below. the mapping shows areas of key interest in malaria elimination efforts connected across national borders, such as much of namibia linked to southern angola , but the zambezi region of namibia more strongly linked to the community encompassing neighbouring zambia, zimbabwe and botswana . in namibia, malaria movement communities identified through the integration of mobile phone-based movement data and case-based risk mapping show correspondence in mapping a northeast community. moreover, swaziland is shown as being central to a community covering, southern mozambique and the malaria endemic regions of south africa, matching closely the origin locations of the majority of internationally imported cases to swaziland and south africa , , . the movements of people and malaria between the highlands and southern and western regions of uganda, and into rwanda , also aligns with the community patterns shown in fig. b . finally, though quantifying different factors, the analyses show a similar east-west split to that found in analyses of malaria drug resistance mutations , and malaria movement community mapping . the emergence of new disease epidemics is becoming a regular occurrence, and drug and insecticide resistance are continuing to spread around the world. as global, regional and local efforts to eliminate a range of infectious diseases continue and are initiated, an improved understanding of how regions are connected through human transport can therefore be valuable. previous studies have shown how clusters of connectivity exist within the global air transport network , and shipping traffic network , but these represent primarily the sources of occasional long-distance disease or vector introductions , , rather than the mode of transport that the majority of the population uses regularly. the approaches presented here focused on road networks provide a tool for supporting the design of disease and resistance surveillance and control strategies through mapping (i) areas of high connectivity where pathogen circulation is likely to be high, forming coherent units of intervention; (ii) areas of low connectivity between communities that form likely natural borders of lower pathogen exchange; (iii) key link routes between communities for targetting surveillance efforts. the outputs of the analyses presented here highlight how highly connected areas consistently span national borders. with infectious disease control, surveillance, funding and strategies principally implemented country by country, this emphasises a mismatch in scales and the need for cross-border collaboration. such collaborations are being increasingly seen, for example with countries focused on malaria elimination (e.g. , ), but the outputs here show that the most efficient disease elimination strategies may need to reconsider units of intervention, moving beyond being constrained by national borders. results from the analysis of pathogen movements elsewhere confirm these international connections (e.g. , , , , building up additional evidence on how pathogen circulation can be substantially more prevalent in some regions than others. the approaches developed here provide a complement to other approaches for defining and mapping regional disease connectivity and mobility . previously, census-based migration data has been used to map blocks of countries of high and low connectivity , but these analyses are restricted to national-scales and cover only longer-term human mobility. efforts are being made to extend these to subnational scales , , but they remain limited to large administrative unit scales and the same long timescales. mobile phone call detail records (cdrs) have also been used to estimate and map pathogen connectivity , , but the nature of the data mean that they do not include cross-border movements, so remain limited to national-level studies. an increasing number of studies are uncovering patterns in human and pathogen movements and connectivity through travel history questionnaires (e.g. , , , ), resulting in valuable information, but typically limited to small areas and short time periods. there exist a number of limitations to the methods and outputs presented here that future work will aim to address. firstly, the hierarchies of road types are not currently taken into account in the network analyses, meaning that a major highway and small local roads contribute equally to community detection and epidemic spreading. the lack of reliable data on road typologies, and inconsistencies in classifications between countries, makes this challenging to incorporate however. moreover, the relative importance of a major road versus secondary, tertiary and tracks is exceptionally difficult to quantify within a country, let alone between countries and across africa. finally, data on seasonal variations in road access does not exist consistently across the continent. our focus has therefore been on connectivity, in terms of how well regions are connected based on existing road networks, irrespective of the ease of travel. a broader point that deserves future research is that while intuition suggests a correspondence in most places, connectivity may not always translate into human or pathogen movement. future directions for the work presented here include quantitative comparison and integration with other connectivity data, the integration of different pathogen weightings, and the extension to other regions of the world. qualitative comparisons outlined above show some good correspondence with analyses of alternative sources of connectivity and disease data. a future step will be to compare these different connections and communities quantitatively to examine the weight of evidence for delineating areas of strong and weak connectivity. this could potentially follow similar studies looking at community structure on weighted networks, such as in the us based on commuting data , or uk and belgium from mobile network data , . here, p. falciparum malaria was used to provide an example of the potential for weighting analyses by pathogen occurrence, prevalence, incidence or transmission suitability. moreover, future work will examine the integration of alternative pathogen weightings. the maximum difference method was used here to pick out regions well connected to areas high p. falciparum burden, but the potential exists to use different weighting methods depending on requirements, strategic needs, and the nature of the pathogen being studied. despite the rapid growth of air travel, shipping and rail in many parts of the world, roads continue to be the dominant route on which humans move on sub-national, national and regional scales. they form a powerful force in shaping the development of areas, facilitating trade and economic growth, but also bringing with them the exchange of pathogens. results here show that their connectivity is not equal however, with strong clusters of high connectivity separated by bridge regions of low network density. these structures can have a significant impact on how pathogens spread, and by mapping them, a valuable evidence base to guide disease surveillance as well as control and elimination planning can be built. results were produced through four main phases. phase : road network cleaning and weighted adjacency list production: the road cleaning operation aimed to produce a road network from the georeferenced vectorial network of roads infrastructure. this phase was conducted using esri arcmap . (http://desktop.arcgis.com/en/ arcmap/) through the use of the topological cleaning tool. the tool integrates contiguous roads, removes very short links and removes overlapping road segments. road junctions were created using the polyline to node conversion tool, while road-link association was computed using the spatial join tool. malaria prevalence values were assigned to each road using the spatial join tool. the adjacency matrix output, containing also the coordinates for each road junctions, was extracted in form of text file. phase : conversion from the primal to the dual network: the primal network created in phase was then used as input for a continuity negotiation-like algorithm. the goal of this algorithm was to translate the primal network into its dual representation (see fig. a,b) . the implementation of the negotiation-like algorithm used the igraph library in c++ (http://igraph.org/c/) on an octa-core imac. the conversion took around hours for a primal network with ~ k nodes running. the algorithm works by first identifying roads composed of many contiguous edges in the primal space. two primal-edges are assumed to be contiguous if the angle between them is not greater than ° degrees. because the dual representation generated by the algorithm strongly depends on the starting edge, we started by looking for the edge that produces the longest road. as soon as this edge was found, a dual-node was created to represent that road. next we proceeded to look for the edge that produced the second longest road and create a dual-node for that road. we continued this process until every primal-edge had been assigned to a road. finally, dual-nodes were connected to each other if their primal counterparts (roads) crossed each other in the primal space. phase : community detection: we used a traditional modularity optimization-based algorithm to identify communities in the dual representation of the road network. the modularity metrics were computed in r using the igraph library (http://igraph.org/r/). to incorporate the prevalence of malaria, we used the malaria prevalence values as edge weights for community detection. phase : mapping communities. detected communities were mapped back to the primal road network with the use of the spatial join tool in arcmap. all maps were produced in arcmap. global transport networks and infectious disease spread severe acute respiratory syndrome h n influenza-continuing evolution and spread geographic dependence, surveillance, and origins of the influenza a (h n ) virus the global tuberculosis situation and the inexorable rise of drug-resistant disease the transit phase of migration: circulation of malaria and its multidrug-resistant forms in africa population genomics studies identify signatures of global dispersal and drug resistance in plasmodium vivax air travel and vector-borne disease movement mapping population and pathogen movements unifying viral genetics and human transportation data to predict the global transmission dynamics of human influenza h n the blood dna virome in , humans spatial accessibility and the spread of hiv- subtypes and recombinants the early spread and epidemic ignition of hiv- in human populations spread of yellow fever virus outbreak in angola and the democratic republic of the congo - : a modelling study virus genomes reveal factors that spread and sustained the ebola epidemic commentary: containing the ebola outbreak-the potential and challenge of mobile network data world development report : reshaping economic geography population distribution, settlement patterns and accessibility across africa in the structure of transportation networks elementary processes governing the evolution of road networks urban street networks, a comparative analysis of ten european cities the scaling structure of the global road network street centrality and densities of retail and services in bologna integrating rapid risk mapping and mobile phone call record data for strategic malaria elimination planning international population movements and regional plasmodium falciparum malaria elimination strategies cross-border malaria: a major obstacle for malaria elimination information technology outreach services -itos-university of georgia. global roads open access data set, version (groadsv ) the effect of malaria control on plasmodium falciparum in africa between the network analysis of urban streets: a primal approach random planar graphs and the london street network. the eur finding community structure in very large networks networks and cities: an information perspective the network analysis of urban streets: a dual approach modularity and community structure in networks the use of census migration data to approximate human movement patterns across temporal scales quantifying the impact of human mobility on malaria travel patterns and demographic characteristics of malaria cases in swaziland human movement data for malaria control and elimination strategic planning malaria risk in young male travellers but local transmission persists: a case-control study in low transmission namibia the path towards elimination reviewing south africa's malaria elimination strategy ( - ): progress, challenges and priorities targeting imported malaria through social networks: a potential strategy for malaria elimination in swaziland association between recent internal travel and malaria in ugandan highland and highland fringe areas multiple origins and regional dispersal of resistant dhps in african plasmodium falciparum malaria the worldwide air transportation network: anomalous centrality, community structure, and cities' global roles the complex network of global cargo ship movements asian pacific malaria elimination network mapping internal connectivity through human migration in malaria endemic countries census-derived migration data as a tool for informing malaria elimination policy key traveller groups of relevance to spatial malaria transmission: a survey of movement patterns in four subsaharan african countries infection importation: a key challenge to malaria elimination on bioko island, equatorial guinea an economic geography of the united states: from commutes to megaregions redrawing the map of great britain from a network of human interactions uncovering space-independent communities in spatial networks e.s., m.p.v. and a.j.t. conceived and designed the analyses. e.s. and m.p.v. designed the road network community mapping methods and undertook the analyses. all authors contributed to writing and reviewing the manuscript. competing interests: the authors declare no competing interests.publisher's note: springer nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.open access this article is licensed under a creative commons attribution . international license, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the creative commons license, and indicate if changes were made. the images or other third party material in this article are included in the article's creative commons license, unless indicated otherwise in a credit line to the material. if material is not included in the article's creative commons license and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. to view a copy of this license, visit http://creativecommons.org/licenses/by/ . /. key: cord- -lkoyrv s authors: salathé, marcel; jones, james h. title: dynamics and control of diseases in networks with community structure date: - - journal: plos comput biol doi: . /journal.pcbi. sha: doc_id: cord_uid: lkoyrv s the dynamics of infectious diseases spread via direct person-to-person transmission (such as influenza, smallpox, hiv/aids, etc.) depends on the underlying host contact network. human contact networks exhibit strong community structure. understanding how such community structure affects epidemics may provide insights for preventing the spread of disease between communities by changing the structure of the contact network through pharmaceutical or non-pharmaceutical interventions. we use empirical and simulated networks to investigate the spread of disease in networks with community structure. we find that community structure has a major impact on disease dynamics, and we show that in networks with strong community structure, immunization interventions targeted at individuals bridging communities are more effective than those simply targeting highly connected individuals. because the structure of relevant contact networks is generally not known, and vaccine supply is often limited, there is great need for efficient vaccination algorithms that do not require full knowledge of the network. we developed an algorithm that acts only on locally available network information and is able to quickly identify targets for successful immunization intervention. the algorithm generally outperforms existing algorithms when vaccine supply is limited, particularly in networks with strong community structure. understanding the spread of infectious diseases and designing optimal control strategies is a major goal of public health. social networks show marked patterns of community structure, and our results, based on empirical and simulated data, demonstrate that community structure strongly affects disease dynamics. these results have implications for the design of control strategies. mitigating or preventing the spread of infectious diseases is the ultimate goal of infectious disease epidemiology, and understanding the dynamics of epidemics is an important tool to achieve this goal. a rich body of research [ , , ] has provided major insights into the processes that drive epidemics, and has been instrumental in developing strategies for control and eradication. the structure of contact networks is crucial in explaining epidemiological patterns seen in the spread of directly transmissible diseases such as hiv/aids [ , , ] , sars [ , ] , influenza [ , , , ] etc. for example, the basic reproductive number r , a quantity central to developing intervention measures or immunization programs, depends crucially on the variance of the distribution of contacts [ , , ] , known as the network degree distribution. contact networks with fat-tailed degree distributions, for example, where a few individuals have an extraordinarily large number of contacts, result in a higher r than one would expect from contact networks with a uniform degree distribution, and the existence of highly connected individuals makes them an ideal target for control measures [ , ] . while degree distributions have been studied extensively to understand their effect on epidemic dynamics, the community structure of networks has generally been ignored. despite the demonstration that social networks show significant community structure [ , , , ] , and that social processes such as homophily and transitivity result in highly clustered and modular networks [ ] , the effect of such microstructures on epidemic dynamics has only recently started to be investigated. most initial work has focused on the effect of small cycles, predominantly in the context of clustering coefficients (i.e. the fraction of closed triplets in a contact network) [ , , , , ] . in this article, we aim to understand how community structure affects epidemic dynamics and control of infectious disease. community structure exists when connections between members of a group of nodes are more dense than connections between members of different groups of nodes [ ] . the terminology is relatively new in network analysis and recent algorithm development has greatly expanded our ability to detect sub-structuring in networks. while there has been a recent explosion in interest and methodological development, the concept is an old one in the study of social networks where it is typically referred to as a ''cohesive subgroups,'' groups of vertices in a graph that share connections with each other at a higher rate than with vertices outside the group [ ] . empirical data on social structure suggests that community structuring is extensive in epidemiological contacts [ , , ] relevant for infectious diseases transmitted by the respiratory or close-contact route (e.g. influenza-like illnesses), and in social groups more generally [ , , , , ] . similarly, the results of epidemic models of directly transmitted infections such as influenza are most consistent with the existence of such structure [ , , , , , ] . using both simulated and empirical social networks, we show how community structure affects the spread of diseases in networks, and specifically that these effects cannot be accounted for by the degree distribution alone. the main goal of this study is to demonstrate how community structure affects epidemic dynamics, and what strategies are best applied to control epidemics in networks with community structure. we generate networks computationally with community structure by creating small subnetworks of locally dense communities, which are then randomly connected to one another. a particular feature of such networks is that the variance of their degree distribution is relatively low, and thus the spread of a disease is only marginally affected by it [ ] . running standard susceptible-infected-resistant (sir) epidemic simulations (see methods) on these networks, we find that the average epidemic size, epidemic duration and the peak prevalence of the epidemic are strongly affected by a change in community structure connectivity that is independent of the overall degree distribution of the full network ( figure ). note that the value range of q shown in figure is in agreement with the value range of q found in the empirical networks used further below, and that lower values of q do not affect the results qualitatively (see suppl. mat. figure s ). epidemics in populations with community structure show a distinct dynamical pattern depending on the extent of community structure. in networks with strong community structure, an infected individual is more likely to infect members of the same community than members outside of the community. thus, in a network with strong community structure, local outbreaks may die out before spreading to other communities, or they may spread through various communities in an almost serial fashion, and large epidemics in populations with strong community structure may therefore last for a long time. correspondingly, the incidence rate can be very low, and the number of generations of infection transmission can be very high, compared to the explosive epidemics in populations with less community structure (figures a and b ). on average, epidemics in networks with strong community structure exhibit greater variance in final size (figures c and d) , a greater number of small, local outbreaks that do not develop into a full epidemic, and a higher variance in the duration of an epidemic. in order to halt or mitigate an epidemic, targeted immunization interventions or social distancing interventions aim to change the structure of the network of susceptible individuals in such a way as to make it harder for a pathogen to spread [ ] . in practice, the number of people to be removed from the susceptible class is often constrained for a number of reasons (e.g., due to limited vaccine supply or ethical concerns of social distancing measures). from a network perspective, targeted immunization methods translate into indentifying which nodes should be removed from a network, a problem that has caught considerable attention (see for example [ ] and references therein). targeting highly connected individuals for immunization has been shown to be an effective strategy for epidemic control [ , ] . however, in networks with strong community structure, this strategy may not be the most effective: some individuals connect to multiple communities (so-called community bridges [ ] ) and may thus be more important in spreading the disease than individuals with fewer inter-community connections, but this importance is not necessarily reflected in the degree. identification of community bridges can be achieved using understanding the spread of infectious diseases in populations is key to controlling them. computational simulations of epidemics provide a valuable tool for the study of the dynamics of epidemics. in such simulations, populations are represented by networks, where hosts and their interactions among each other are represented by nodes and edges. in the past few years, it has become clear that many human social networks have a very remarkable property: they all exhibit strong community structure. a network with strong community structure consists of smaller sub-networks (the communities) that have many connections within them, but only few between them. here we use both data from social networking websites and computer generated networks to study the effect of community structure on epidemic spread. we find that community structure not only affects the dynamics of epidemics in networks, but that it also has implications for how networks can be protected from large-scale epidemics. the betweenness centrality measure [ ] , defined as the fraction of shortest paths a node falls on. while degree and betweenness centrality are often strongly positively correlated, the correlation between degree and betweenness centrality becomes weaker as community structure becomes stronger ( figure ). thus, in networks with community structure, focusing on the degree alone carries the risk of missing some of the community bridges that are not highly connected. indeed, at a low vaccination coverage, an immunization strategy based on betweenness centrality results in fewer infected cases than an immunization strategy based on degree as the magnitude of community structure increases ( figure a ). this observation is critical because the potential vaccination coverage for an emerging infection will typically be very low. a third measure, random walk centrality, identifies target nodes by a random walk, counting how often a node is traversed by a random walk between two other nodes [ ] . the random walk centrality measure considers not only the shortest paths between pairs of nodes, but all paths between pairs of nodes, while still giving shorter paths more weight. while infections are most likely to spread along the shortest paths between any two nodes, the cumulative contribution of other paths can still be important [ ] : immunization strategies based on random walk centrality result in the lowest number of infected cases at low vaccination coverage (figure b and c ). to test the efficiency of targeted immunization strategies on real networks, we used interaction data of individuals at five different universities in the us taken from a social network website [ ] , and obtained the contact network relevant for directly transmissible diseases (see methods). we find again that the overall most successful targeted immunization strategy is the one that identifies the targets based on random walk centrality. limited immunization based on random walk centrality significantly outperforms immunization based on degree especially when vaccination coverage is low (figure a ). in practice, identifying immunization targets may be impossible using such algorithms, because the structure of the contact network relevant for the spread of a directly transmissible disease is generally not known. thus, algorithms that are agnostic about the full network structure are necessary to identify target individuals. the only algorithm we are aware of that is completely agnostic about the network structure network structure identifies target nodes by picking a random contact of a randomly chosen individual [ ] . once such an acquaintance has been picked n times, it is immunized. the acquaintance method has been shown to be able to identify some of the highly connected individuals, and thus approximates an immunization strategy that targets highly connected individuals. we propose an alternative algorithm (the so-called community bridge finder (cbf) algorithm, described in detail in the methods) that aims to identify community bridges connecting two groups of clustered nodes. briefly, starting from a random node, the algorithm follows a random path on the contact network, until it arrives at a node that does not connect back to more than one of the previously visited nodes on the random walk. the basic goal of the cbf algorithm is to find nodes that connect to multiple communities -it does so based on the notion that the first node that does not connect back to previously visited nodes of the current random walk is likely to be part of a different community. on all empirical and computationally generated networks tested, this algorithm performed mostly better, often equally well, and rarely worse than the alternative algorithm. it is important to note a crucial difference between algorithms such as cbf (henceforth called stochastic algorithms) and algorithms such as those that calculate, for example, the betweenness centrality of nodes (henceforth called deterministic algorithms). a deterministic algorithm always needs the complete information about each node (i.e. either the number or the identity of all connected nodes for each node in the network). a comparison between algorithms is therefore of limited use if they are not of the same type as they have to work with different inputs. clearly, a deterministic algorithm with information on the full network structure as input should outperform a stochastic algorithm that is agnostic about the full network structure. thus, we will restrict our comparison of cbf to the acquaintance method since this is the only stochastic algorithm we are aware of the takes as input the same limited amount of local information. in the computationally generated networks, cbf outperformed the acquaintance method in large areas of the parameter space ( figure d ). it may seem unintuitive at first that the acquaintance method outperforms cbf at very high values of modularity, but one should keep in mind that epidemic sizes are very small in those extremely modular networks (see figure a ) because local outbreaks only rarely jump the community borders. if outbreaks are mostly restricted to single communities, then cbf is not the optimal strategy because immunizing community bridges is useless; the acquaintance method may at least find some well connected nodes in each community and will thus perform slightly better in this extreme parameter space. in empirical networks, cbf did particularly well on the network with the strongest community structure (oklahoma), especially in comparison to the similarly effective acquaintance method with n = . (figure c ). as immunization strategies should be deployed as fast as possible, the speed at which a certain fraction of the . assessing the efficacy of targeted immunization strategies based on deterministic and stochastic algorithms in the computationally generated networks. color code denotes the difference in the average final size s m of disease outbreaks in networks that were immunized before the outbreak using method m. the top panel (a) shows the difference between the degree method and the betweenness centrality method, i.e. s degree s betweenness . a positive difference (colored red to light gray) indicates that the betweenness centrality method resulted in smaller final sizes than the degree method. a negative difference (colored blue to black) indicates that the betweenness centrality method resulted in bigger final sizes than the degree method. if the difference is not bigger than . % of the total population size, then no color is shown (white). panel (a) shows that the betweenness centrality method is more effective than the degree based method in networks with strong community structure (q is high). (b) and (c): like (a), but showing s degree s randomwalk (in (b)) and s betweenness s randomwalk (in (c)). panels (b) and (c) show that the random walk method is the most effective method overall. panel (d) shows that the community bridge finder (cbf) method generally outperforms the acquaintance method (with n = ) except when community structure is very strong (see main text). final epidemic sizes were obtained by running sir simulations per network, vaccination coverage and immunization method. doi: . /journal.pcbi. .g network can be immunized is an additional important aspect. we measured the speed of the algorithm as the number of nodes that the algorithm had to visit in order to achieve a certain vaccination coverage, and find that the cbf algorithm is faster than the similarly effective acquaintance method with n = at vaccination coverages , % (see figure ). a great number of infectious diseases of humans spread directly from one person to another person, and early work on the spread of such diseases has been based on the assumption that every infected individual is equally likely to transmit the disease to any susceptible individual in a population. one of the most important consequences of incorporating network structure into epidemic models was the demonstration that heterogeneity in the number of contacts (degree) can strongly affect how r is calculated [ , , ] . thus, the same disease can exhibit markedly different epidemic patterns simply due to differences in the degree distribution. our results extend this finding and show that even in networks with the same degree distribution, fundamentally different epidemic dynamics are expected to be observed due to different levels of community structure. this finding is important for various reasons: first, community structure has been shown to be a crucial feature of social networks [ , , , ] , and its effect on disease spread is thus relevant to infectious disease dynamics. furthermore, it corroborates earlier suggestions that community structure affects the spread of disease, and is the first to clearly isolate this effect from effects due to variance in the degree distribution [ ] . second, and consequently, data on the degree distribution of contact networks will not be sufficient to predict epidemic dynamics. third, the design of control strategies benefits from taking community structure into account. an important caveat to mention is that community structure in the sense used throughout this paper (i.e. measured as modularity q ) does not take into account explicitly the extent to which communities overlap. such overlap is likely to play an important role in infectious disease dynamics, because people are members of multiple, potentially overlapping communities (households, schools, workplaces etc.). a strong overlap would likely be reflected in lower overall values for q; however, the exact effect of community overlap on infectious disease dynamics remains to be investigated. identifying important nodes to affect diffusion on networks is a key question in network theory that pertains to a wide range of fields and is not limited to infectious disease dynamics only. there are however two major issues associated with this problem: (i) the structure of networks is often not known, and (ii) many networks are too large to compute, for example, centrality measures efficiently. stochastic algorithms like the proposed cbf algorithm or the acquaintance method address both problems at once. to what extent targeted immunization strategies can be implemented in a infectious diseases/public health setting based on practical and ethical considerations remains an open question. this is true not only for the strategy based on the cbf algorithm, but for most strategies that are based on network properties. as mentioned above, the contact networks relevant for the spread of infectious diseases are generally not known. stochastic algorithms such as the cbf or the acquaintance method are at least in principle applicable when data on network structure is lacking. community structure in host networks is not limited to human networks: animal populations are often divided into subpopulations, connected by limited migration only [ , ] . targeted immunization of individuals connecting subpopulations has been shown to be an effective low-coverage immunization strategy for the conservation of endangered species [ ] . under the assumption of homogenous mixing, the elimination of a disease requires an immunization coverage of at least - /r [ ] but such coverage is often difficult or even impossible to achieve due to limited vaccine supply, logistical challenges or ethical concerns. in the case of wildlife animals, high vaccination coverage is also problematic as vaccination interventions can be associated with substantial risks. little is known about the contact network structure in humans, let alone in wildlife, and progress should therefore be made on the development of immunization strategies that can deal with the absence of such data. stochastic algorithms such as the acquaintance method and the cbf method are first important steps in addressing the problem, but the large difference in efficacy between stochastic and deterministic algorithms demonstrates that there is still a long way to go. to investigate the spread of an infectious disease on a contact network, we use the following methodology: individuals in a population are represented as nodes in a network, and the edges between the nodes represent the contacts along which an infection can spread. contact networks are abstracted by undirected, unweighted graphs (i.e. all contacts are reciprocal, and all contacts transmit an infection with the same probability). edges always link between two distinct nodes (i.e. no self loops), and there must be maximally one edge between any single pair of nodes (i.e no parallel edges). each node can be in one of three possible states: (s)usceptible, (i)nfected, or (r)esistant/immune (as in standard sir models). initially, all nodes are susceptible. simulations with immunization strategies implement those strategies before the first infection occurs. targeted nodes are chosen according to a given immunization algorithm (see below) until a desired immunization coverage of the population is achieved, and then their state is set to resistant. after this initial set-up, a random susceptible node is chosen as patient zero, and its state is set to infected. then, during a number of time steps, the initial infection can spread through the network, and the simulation is halted once there are no further infected nodes. at each time step (the unit of time we use is one day, i.e. a figure . assessing the efficacy of targeted immunization strategies in empirical networks based on deterministic and stochastic algorithms. the bars show the difference in the average final size s m of disease outbreaks (n cases) in networks that were immunized before the outbreak using method m. the left panels show the difference between the degree method and the random walk centrality method, i.e. s degree s randomwalk . if the difference is positive (red bars), then the random walk centrality method resulted in smaller final sizes than the degree method. a negative value (black bars) means that the opposite is true. shaded bars show non-significant differences (assessed at the % level using the mann-whitney test). the middle and right panels are generated using the same methodology, but measuring the difference between the acquaintance method (with n = in the middle column and n = in the right column, see methods) and the community bridge finder (cbf) method, i.e. s acquaintance s cbf and s acquaintance s cbf . again, positive red bars mean that the cbf method results in smaller final sizes, i.e. prevents more cases, than the acquaintance methods, whereas negative black bars mean the opposite. final epidemic sizes were obtained by running sir simulations per network, vaccination coverage and immunization method. doi: . /journal.pcbi. .g time step is one day), an infected node can get infected with probability exp( bi), where b is the transmission rate from an infected to a susceptible node, and i is the number of infected neighboring nodes. at each time step, infected nodes recover at rate c, i.e. the probability of recovery of an infected node per time step is c (unless noted otherwise, we use c = . ). if recovery occurs, the state of the recovered node is toggled from infected to resistant. unless mentioned otherwise, the transmission rate b is chosen such that r ,(b/c) * d< where d is the mean network degree, i.e the average number of contacts per node. for the networks used here, this approximation is in line with the result from static network theory [ ] , r ,t(,k ./,k. ), where ,k. and ,k . are the mean degree and mean square degree, respectively, and where t is the average probability of disease transmission from a node to a neighboring node, i.e. t