key: cord-0045612-mtq89ja0 authors: Pazura, Patryk; Bortko, Kamil; Jankowski, Jarosław; Michalski, Radosław title: A Dynamic Vote-Rank Based Approach for Effective Sequential Initialization of Information Spreading Processes Within Complex Networks date: 2020-05-26 journal: Computational Science - ICCS 2020 DOI: 10.1007/978-3-030-50371-0_47 sha: eb72c994a3b32b6a171e7ae5528809c63cb9e025 doc_id: 45612 cord_uid: mtq89ja0 Seed selection is one of the key factors influencing information spread within networks. Whereas most solutions are based on single-stage seeding at the beginning of the process, performance increases when additional seeds are used. This enables the acquisition of knowledge about ongoing processes and activating new nodes for further influence maximisation. This paper describes an approach based on the Vote-Rank algorithm with dynamic rankings for sequential seed selection. The results prove the increased performance of dynamic rankings compared to the static version and show how the frequency of ranking updates affects both performance and computational costs. Information spreading processes are observed in various aspects of social interaction and commercial activity. They are behind social movements [9] , viral marketing [21] , political campaigns [4] , and spread of misleading information [2] . After gathering additional knowledge about their performance, further actions are often taken to change their dynamics and increase or decrease coverage [25] . Apart from solutions focused on seeding at the beginning of the process without any further actions, other solutions gather knowledge from the process and use additional seeds to improve the process including but not limited to sequential seeding [13] , seeding scheduling [34] , or adaptive seeding [33] . The approach proposed in this paper is based on the selection of sequential seeds with the use of the Vote-Rank algorithm based on adaptive rankings recomputed before additional seeds are selected. Recomputation uses knowledge gathered about infections within the network and Vote-Rank only considers inactive nodes, not the nodes already activated. This enables selecting only those nodes with higher potential for spreading as seeds, selected within areas not covered by infections. In this study, we compared the results from static Vote-Rank with the proposed dynamic approach and investigated the influence of recomputation frequency on the final outcome. The remainder of the paper is organized as follows: Sect. 2 provides a literature review, the conceptual framework is presented within Sect. 3, followed by experimental results presented in Sect. 4 . The results are summarized and the paper is concluded in Sect. 5. Information spread within social networks has received attention from researchers from various disciplines. Studies related to information spread have focused on factors affecting their dynamics, the roles of social ties, network topology, and the roles of the links within the network [26] . Models derived from epidemiology research, like SIS, SIR, and their variants, were initially used for prediction and analysis [19] . Later, more dedicated approaches, like the independent cascade model [20] and linear threshold model [5] considered network structures. They were verified with the use of agent-based simulations and the Monte Carlo method [6] or analytical solutions like mean-field models [32] or branching processes [15] . Research was initially performed mainly on single layer static networks, but in recent years, more attention is being focused on multilayer networks [32] and spreading processes within temporal networks [11, 16] . Apart from single processes, multiple processes were analysed with mechanisms related to competition, cooperation, and other forms of interaction [3] . Information spread processes are usually initialised by selected nodes, called seeds, with the use of a dedicated seed selection methods [10] . The influence maximisation problem leads to several challenges and solutions for initial nodes selection [20] . The main goal is to select a set of seeds with high potential to initiate the spread and activate their neighbours. Early approaches were mainly based on heuristics with high degree or other centrality measures like closeness or eigenvector centrality [36] . Apart from simple heuristics, the greedy approach is much more effective, delivering results closer to optimum [20] . Further attempts were made to improve its computational performance with possible applications within larger networks [7] . The number of seeds was analysed to find the minimal effective seed sets [27] . Other solutions considered costs in a form of budgeted solutions [29] . The negative impact of high intensity seeds on users was identified as an over-exposure problem [1] . Another possible goal is limiting overlapping seeds and maximising distance between seeds with the Vote-Rank algorithm, which ranks seed candidates by its votes acquired from direct and indirect neighbours, with higher ranks grouped together with increasing distance from other seeds [38] . Most of earlier solutions focused on selection of seeds at the beginning of the process, without additional actions during the process. Another possibility is spreading seeding over time, with only a fraction of the seeds used at the beginning, in the form of sequential seeding [13] , seeding scheduling [34] , or adaptive seeding [33] . The main mechanics are based on avoiding selecting nodes with a high potential to be naturally activated by their neighbours as seeds. Sequential seeding can be used to revive stopped processes or add seeds when the processes are still ongoing [14] . This approach was proven to never deliver worse results than single-stage seeding [17] . Extensions showed how performance of sequential seeding is affected by the topology of networks [26] , entropy-based centrality [30] , and effective degree [12] . Static rankings used for seed selection only at the beginning of the process create the threat of selected seeds becoming victims of the natural diffusion process. Sequential seeding uses the sequence of seeds instead of using all in a single step to deliver a better result due to the potential of a natural diffusion process. This paper presents an approach that improves the Vote-Rank algorithm with the recomputations and the use of nodes that are effective from the perspective of diffusion processes, that are not yet activated. Earlier study showed that seeds generated with Vote-Rank used sequentially deliver better results than with a single stage [13] , but only static Vote-Rank generated at the beginning of the process was analysed and used. The approach presented here avoids gathering votes from network nodes that are no longer effective for information spreading, meaning they were already activated and used for information spreading. Although Vote-Rank is effective for seed selection, the static ranking generated only at the beginning of the process creates the possibility that the initially good candidates with a high number of votes are no longer good candidates for seeding in the next stages. The proposed approach assumes creation of new ranks based on Vote-Rank before seeding actions. During rank computation, voting only considers votes from nodes available for activation. Already active nodes are not considered. As a result, not yet activated seed candidates with a higher number of direct and indirect connections are preferred. To demonstrate the potential performance of the proposed approach, a toy example is presented in Fig. 1 . A small network based on nine nodes was generated with the use of the Watts-Strogatz model [37] . A rewiring probability of 0.1 was used and a value of two was assigned to the neighbourhood within which the vertices of the lattice were connected. Simulation was performed with the use of the coordinated execution proposed in [17] , where agent-based simulations processes are not based on randomly generated values in each run, but on values assigned to network edges A → B, representing the probability of passing information from node A to B and from B to A. If propagation probability is assigned to whole the process with the value p, then information is passed through all edges with weights ≤ p. The number of network versions with weights is equal to the planned number of runs. The main advantage of this process is the ability to compare methods within identical conditions. The same approach was used for all simulations here. Figure 1 shows the network used with weights assigned to the edges and simulated spreading according to the independent cascade model [20] . Figure 1 (A) shows the process based on sequential seeding with the use of static Vote-Rank and Fig. 1 (B) shows the process based on dynamic Vote-Rank. Descriptions are provided within the figure caption together with the mechanics behind the dynamic Vote-Rank approach and the advantage produced by a higher number of activations within the network. In general, experimental study was planned within two stages, with different experimental plans, goals and assumptions. In the first stage, illustrated in Fig. 2 (Stage I), the main goal was to analyse differences between coverage of processes with the use of static (computed only one at the beginning) and dynamic (recomputed before additional seeding takes place) Vote-Rank. Ranking of nodes is created in step 0 and Vote-Rank VR(0) is created. In n subsequent steps additional seeds are selected from the ranking according to the sequential approach and new activations are performed. Seeding is conducted in revival mode, after process dies out. While nodes ranking based on the Vote-Rank is effective at the beginning of the process, together with ongoing process more and more nodes are activated. Proposed approach is focused on better utilization of the knowledge about activations. Vote-Rank is recomputed before each seeding action. Recomputation is based on reduced network without activated nodes. It is assumed that dynamic raking will deliver better network coverage, what is illustrated in Fig. 2 (A2) . At each seeding step i, Vote-Rank is computed and new ranking VR(i) is used. In the Stage II ( Fig. 2 ) study is focused on effects of frequency of rank recomputation on final coverage and computational costs related to the time of rank calculation. Figure 2 (B1) shows approach with recomputations taken in every seeding step. It is expected to be the most effective in terms of coverage, but with highest computational time needed for new rankings creation. Another possibility is to reduce computational time with medium intervals between computations and medium coverage increase Fig. 2 (B2) . Together with growing intervals between recomputations, the performance of selected seeds will be dropping. It is considered as lowest performance at Fig. 2 (B3) , but still better than for processes based on static rankings. Initially computed ranking is used during whole process. (B) Information spreading process based on sequential seeding and dynamic Vote-Rank computed in every step when seeding takes place. Both sequential seeding processes use three seeds with single seeds used per simulation step. Contagion process is based on Independent Cascade Model with propagation probability P P = 0.1. Every edge has assigned two values of weights, therefore to activate neighbour of node Asuppose that is node B, first value on the edge between A and B must be smaller or equal to propagation probability value. (A) Information spreading is initiated by three seeds used in sequence in a form of sequential seeding. Nodes in the network are ranked by Vote-Rank algorithm, with top four nodes presented in the table. First seed is used at the beginning (AI), second seed is used when the process dies out (AII), and the third seed in the same way (AIII). In step AIII we select last node as seed. While node 3 from Vote-Rank was activated in a natural way in stage AIII, node 7 is selected as a seed. This process ends with the total of 6 activated nodes. (B) During this process we compute Vote-Rank at each seeding step. While only one seed is used in each step, only one node is needed with the highest value of Vote-Rank. (BI) In first step according to the ranking node 0 is selected as seed. It tries to activate its neighbors with P P = 0.1 and nodes 2 and 4 become active, because of appropriate weighs assigned to the edges allowing transmission. (BII) In the next step, previous activated nodes infecting further, node number 3 becomes active, and node 6 is activated as a seed. In the last stage, (BIII), node 1 is selected as a seed and it is activating node number 8. As a result 7 nodes within the network are activated. An experimental setup is based on two types of Vote-Rank rankings (static and dynamic) and 10 real networks N 1-N 10 from [8, 18, 22, [22] [23] [24] 28, 31, 35, 37] respectively, containing from 1133 to 16264 nodes and from 5451 to 146160 edges. All values used for all parameters are presented in Table 1 . Simulations were performed ten times per configuration and results were averaged. For Stage I, together with other diffusion parameters we obtained R × N × P P × SF × SP for simulations when we were seeding each time when diffusion dies out. Propagation probability (PP) represents propagation probability according to Independent Cascade Model [20] . Each activated node is contacting all not active neighbours and with given probability activates them with only single possible attempt. Seeds fraction (SF) represents the percentage of nodes selected as seeds. Number of seeds per step (SP) represents the number of seeds used in each step of sequential seeding. They are resulting in 4,500 configurations. While the goal of the Stage I was to compare performance of Vote-Rank based on the static and dynamic rankings, in the Stage II main goal was to analyse the impact of recomputations frequency on final coverage with the network. Five recomputation intervals were used. It creates experimental space with R × N × P P × SF × SP × RI combinations for simulations when we were seeding with fixed intervals what makes total 22,500 combinations. Overall analysis compared the dynamic Vote-Rank based approach with static Vote-Rank for sequential seeding with results presented in Fig. 3 (A) . All simulation cases of dynamic approach results with not worse, and in most cases higher coverage than static approach. The finest obtained improvement is at the level of 40%. Comparing the value of the Wilcoxon test there is a statistically valid (p < 0.05) difference between results from spreading coverage for static and dynamic rankings. The value of Hodges-Lehmann estimator at the level 27.564 indicates a significant improvement in result continuity. Figure 4 (A) and (B) show all simulation cases in terms of (A) seeds per step and (B) propagation probabilities. For propagation probability the most noticeable difference is for PP = 0.1. Further tendency is visible -the higher propagation probability, the lower increase of performance. In the next stage the role of propagation probability, number of seeds and network was analysed. Figure 5 (A) shows a systematic decline of performance as PP increases. The value of increase of coverage from over 8% for PP = 0.1 to nearly 2% for PP = 0.9 is observed. Wilcoxon test was used to analyse results for propagation probability (PP). For PP = 0.1 Hodges-Lehmann estimator was obtained at the level 11.69 while for PP = 0.9 it was at the lower level 5.37. Statistical significance of results was confirmed with p < 0.05. Here it is also visible a two-fold drop in the difference. The values oscillate mostly in the range from 8.4 to 11.69. Similar tendency is visible in Fig. 5 (B) , i.e. for number of seeds per step. At the value of 1 it is 8%, then at 2 about 7% and almost twice to the level of about 4% for the number of seeds 4. A decrease to the level of 1% occurs for 16 number of seeds. Based on the Wilcoxon test, taking into account the number of seeds per step, differences for dynamic and static rankings are visible. A downward trend in the statistical difference is observed, which is statistically significant for each number of seeds per step. We see a two-fold difference of Hodges-Lehmann estimator from 17.11 to 8.8 (for SP = 2: 17.11; for SP = 4: 15.33; for SP = 8: 12.46; for SP = 16: 8.8). Figure 5 (C) shows differences in results for used network. With networks N1 and N2, it maintains the increase of coverage level above 9%, followed by a decrease to about 1% for network N3. Network 4 maintains a level close to 10%. Starting from network N5 to N10, we see a clear decline to 2-4%. Here we can divide the fall into two groups: a large decrease in network from N5 to N10 and a increase in networks N1, N2, N4. Wilcoxon test was also used for comparing results for pairs of dynamic and static rankings for all used networks N1-N10. Statistical significance (p < 0.05) was obtained. The range of results is from 5.37 for network N3 and up to 10.01 for network N2. Other results reach values close to 9.00, so they are closer to the network with the best result. In this stage of analysis we considered pros and cons of dynamic Vote-Rank based approach in terms of seeding percentage and recomputation interval to figure out how it affect on coverage performance and computational time. We assumed as seeds per step (SP) number of seeds for every stage of sequential seeding process. In terms of number of seeds (SP), in Fig. 6 (A) is shown how the number of seeds per step (SP) affected coverage performance for each recomputation interval. The lowest coverage performance was observed for 16 seeds per step and interval with value 16, while for 1 seed per step with recomputation in each step the highest coverage performance was obtained. Mean value of coverage performance for 1 seed group is 77.79, for 2 seeds is 76.29, for 4 seeds is 75.75, for 8 seeds is 74.96, for 16 seeds is 70.85. It means that coverage performance decreases along with the number of seeds per step and growing interval between recompuations. Regarding how number of seeds per step and interval affects computational time, we show it in Fig. 6 (B) , the longest calculation time was observed for 16 seeds with the highest interval, and was equal to 75.25 s. The shortest calculation time was observed for 1 seed, and was 10.99 s. For 2 seeds was 11.72 s, for 4 seeds was 14.76 s, for 8 seeds was 22.09 s. As we can infer, adding more seeds isn't profitable. Both computational time and coverage performance fare worse than with lower number of seeds. In terms of recomputation interval (RI) we carried out analysis to find out how recomputation interval effects on the efficiency. In Fig. 7 (A) is showed coverage performance with each of the colors representing results for different number of seeds per step, while in Fig. 7 (B) computational time is showed. In Fig. 7 (A) as we can see tendency that the greater we set interval, the smaller coverage is obtained. We can also notice relationships concerning seeds per step similar to those in Fig. 7 (A) and (B) . The most effective combinations values of recomputation interval and seeds per step is small value of these both. When it comes to Fig. 7 (B) computational time, for smaller recomputation interval, there is no need to calculate a rank for steps forward. Consequently we calculate smaller rank, which turned out to have a positive impact on computational time. Analyzing the intergroup comparison using Wilcoxon tests we can see that the smallest differences between the intervals are showed when interval 1 is compared to 4, and interval 2 to 4. They are at the level of 14.61. The biggest differences are more pronounced when comparing intervals 8 to 16, where the differences reach about 35.24, i.e. over two and a half times, than the values from the top. On average, the range of results is in the range of 14.61 to 21.41. All comparison results are presented in the Table 2 . The main goal of this study was to analyse the effects of seed selection for sequential seeding with the use of dynamic rankings generated with the Vote-Rank algorithm. In the typical approach, network nodes are ranked once at the beginning of the process and seeds are selected according to their rank. Together with ongoing spreading processes within network changes, nodes with high potential for seeding at the beginning may no longer be effective. This occurs, for example, if a high fraction of their neighbours are already activated. In the proposed approach, nodes are ranked with the use of a network reduced by already activated nodes. Votes are gathered only from nodes able to be activated. The results demonstrated the performance of the proposed approach with a revival mode when additional seeding occurs after the process dies out. The results were dependent on network characteristics and the increase in performance when compared to the static version was above 10%. The best results were observed for low propagation probabilities. High performance was observed for a low number of seeds used in each step, with best result for one seed per step. Recomputation frequency increased the performance, with the best results obtained for recomputation in every step, but this results in higher computational costs. In many cases, larger intervals between recomputations still improved the performance with lower computational costs. From the perspective of real applications, it is observed that recent marketing solutions focus on adaptive approaches with the use of knowledge gathered from earlier stages of campaigns. The same can be applied to viral marketing with more natural strategies based on spreading budgets and seeds allocation over the time. It creates potential for dynamic Vote Rank usage, with the ability to cope with large networks, same like it was proved for its static version. The presented findings provide several future directions for adaptive seeding and usage of knowledge from network states observed when information spreading occurs. Future work could extend the proposed approach toward a more adaptive version with the ability to estimate the time when the recomputation should be performed to maximise the outcome. Another direction is modification of the vote counting method with the use of information about activations within the network. Mitigating overexposure in viral marketing Spread of (mis) information in social networks Competitive influence maximization in social networks Automated diffusion? Bots and their influence during the 2016 U.S. presidential election Scalable influence maximization in social networks under the linear threshold model An agent-based model of epidemic spread using human mobility and social network information Celf++: optimizing the greedy algorithm for influence maximization in social networks Self-similar community structure in a network of human interactions It was a facebook revolution: exploring the meme-like spread of narratives during the egyptian protest Seeding strategies for viral marketing: an empirical comparison Temporal network structures controlling disease spreading Dynamic rankings for seed selection in complex networks: balancing costs and coverage Balancing speed and coverage by sequential seeding in complex networks Seeds buffering for information spreading processes The multidimensional study of viral campaigns as branching processes Compensatory seeding in networks with varying avaliability of nodes Probing limits of information spread with sequential seeding Reactome: a knowledgebase of biological pathways How to run a campaign: optimal control of sis and sir information epidemics Maximizing the spread of influence through a social network The dynamics of viral marketing Graph evolution: densification and shrinking diameters Learning to discover social circles in ego networks The DBLP computer science bibliography: evolution, research issues, perspectives Decomposing the value of word-of-mouth seeding programs: acceleration versus expansion Sequential seeding for spreading in complex networks: influence of the network topology Minimizing seed set for viral marketing Scientific collaboration networks. I. Network construction and fundamental results On budgeted influence maximization in social networks Sequential seeding strategy for social influence diffusion with improved entropy-based centrality Why anchorage is not (that) important: Binary ties and sample selection Generalized epidemic mean-field model for spreading processes over multilayer complex networks Adaptive seeding in social networks Improving information spread through a scheduled seeding approach Software systems through complex networks science: review, analysis and applications Maximizing the spread of influence via generalized degree discount Collective dynamics of 'small-world' networks Identifying a set of influential spreaders in complex networks Acknowledgments. This work was supported by the National Science Centre, Poland, grant no. 2016/21/B/HS4/01562.