key: cord-0137977-wqc49r9a
authors: Sahnan, Dhruv; Goel, Vasu; Masud, Sarah; Jain, Chhavi; Goyal, Vikram; Chakraborty, Tanmoy
title: DiVA: A Scalable, Interactive and Customizable Visual Analytics Platform for Information Diffusion on Large Networks
date: 2021-12-12
journal: nan
DOI: nan
sha: 39be6c9305844a23812f7ab9b42337ab73b80c49
doc_id: 137977
cord_uid: wqc49r9a

With an increasing outreach of digital platforms in our lives, researchers have taken a keen interest to study different facets of social interactions that seem to be evolving rapidly. Analysing the spread of information (aka diffusion) has brought forth multiple research areas such as modelling user engagement, determining emerging topics, forecasting virality of online posts and predicting information cascades. Despite such ever-increasing interest, there remains a vacuum among easy-to-use interfaces for large-scale visualisation of diffusion models. In this paper, we introduce DiVA -- Diffusion Visualisation and Analysis, a tool that provides a scalable web interface and extendable APIs to analyse various diffusion trends on networks. DiVA uniquely offers support for simultaneous comparison of two competing diffusion models and even the comparison with the ground-truth results, both of which help develop a coherent understanding of real-world scenarios. Along with performing an exhaustive feature comparison and system evaluation of DiVA against publicly-available web interfaces for information diffusion, we conducted a user study to understand the strengths and limitations of DiVA. We noticed that evaluators had a seamless user experience, especially when analysing diffusion on large networks.

Information diffusion is one of the core areas in the study of social networks, both online and offline. A diffusion starts from a sender/seed node that spreads the information to its neighbours, who further spread it to their neighbours, and so on. Thus, a diffusion model achieves its task by identifying a path or a tree (branches representing the transmission), capturing the evolution of information over time. On a larger scale, since communities are interlinked, this information can also jump from one community to another. While in the real world, epidemic models are being successfully applied to the study of dynamics of disease outbreaks [1] , only recently have they been utilized to study the spread of information on the digital platforms [2] .

Diffusion dynamics is used to understand the flow of information, diseases, content [3] , identifying behaviour patterns of influential users [4] , etc. Despite a growing body of work that focuses on mathematical modelling of information diffusion, limited work has been attempted to accommodate visual analysis of diffusion dynamics [5, 6] . Recently, there has been tremendous development in analysing epidemics and their spread due to the COVID-19 pandemic [7] [8] [9] [10] . However, due to their niche application and platform dependency, the tools as mentioned above cannot be easily exported to work for different social platforms and settings. In addition, while all major technology companies have access to an extensive network of users and resources, the visualisation supporting these networks is usually in-house 1 , and close-sourced 2 . Moreover, there is a limited number of publicly-available interfaces to simulate and visualise information diffusion on large networks.

To overcome the shortcomings as mentioned earlier, in this paper, we introduce DiVA (Diffusion Visualisation and Analysis) -an open-source, customised web interface for the study of information diffusion. The proposed tool allows users to study complex social networks visually, in the form of a clustered node-link structure, making it easier to study numerous diffusion patterns (spread/adoption) of posts, news articles, blogs, music, social trends, epidemic outbreaks, political campaigns and public opinion. DiVA presents the diffusion analysis in three different forms -(a) statistics, (b) plots, and (c) an interactive diffusion visualisation over the network. While the first two are macroscopic results that any network analysis library can reproduce, the last feature allows for the microscopic study of information diffusion at a community (or even node) level and at specific timestamps. Note that the focus of this paper is to provide a salable and easy-to-use system for studying the existing diffusion models rather than proposing another information diffusion model. The most unique and promising feature of DiVA is its ability to visually and statistically compare the simulations of two diffusion algorithms side-by-side. Hence by extension, it supports simultaneous comparison of the results of a diffusion algorithm against the real-world (a.k.a, ground-truth) diffusion patterns. This allows the users to analyze the changes in diffusion trends when models or parameters are modified and can help practitioners determine the best course of action under varying settings. Furthermore, keeping the needs of fellow researchers in mind, DiVA allows the user to simulate their custom algorithms on the platform and analyze how they fare in a real-world network as well. Finally, to provide the holistic experience of working with social networks, DiVA also supports the traditional statistical network measures such as PageRank, node clustering, degree distribution, etc.

We provide a detailed comparison of the features of DiVA with other available applications and point out how DiVA fares much better than any of these existing tools. Although there are a few tools and packages available to visualise diffusion, to the best of our knowledge, there are none with a comparable set of features and performance exhibited by DiVA.

Contributions. In short, our major contributions are as follows:

(1) DiVA is the first visualisation platform that provides users with the opportunity to visually compare two diffusion models or one diffusion model with the ground-truth diffusion pattern. (2) DiVA provides a highly customised environment to the users -it allows them to upload custom diffusion algorithms and networks for analysis. (3) DiVA provides several metrics to explain the diffusion pattern of a model with downloadable reports. (4) Unlike existing tools/platforms, DiVA is highly scalable. We perform both qualitative and quantitative evaluations to test the same.

To help the readers get a better sense of the complete workflow of DiVA, we have prepared a video to demonstrate the tool, which is publicly accessible at the anonymous link. 3 

Our primary motivation for developing this tool is a lack of an easy-to-use, scalable and interactive interface for diffusion visualisation. The results of diffusion algorithms are hard to explain, given the complex interactions underpinning them. Visualising these results on the network at different timestamps and comparing them against the aggregated metrics can help better understand how the information is gradually spreading. With DiVA, practitioners can effectively communicate their findings to a broader range of audiences. Combining different forms of information and presenting them in different visual formats enables the user to cherry-pick the level of granularity and reporting desired by them. For example, analysing the raw results of a diffusion model is not easy for a non-expert user. Meanwhile, manually visualising the complex sequence/traversal of information is untraceable even for expert users.

As a customised and open-source tool, our primary target audience is the research community that can quickly check the viability of existing or custom diffusion methods or perform ground-truth analysis. This analysis can range from studying computer viruses spread over a network to real viruses over a population. Practitioners within the research community and stakeholders like law enforcement and news media can use DiVA to study the dissipation of hate speech or fake news and test control strategies for the same. The dual diffusion visualisation acts as a visual source for A/B testing of two similar (but differently tuned) or two contrasting hypotheses/policies to compare their impact side-by-side. This can help policymakers decide on the best way forward for a range of applications from vaccination schema to content marketing. Additionally, researchers working at mathematical modelling of newer diffusion systems can dry run the new model's robustness against baselines even on large networks.

While for the initial user study of DiVA, we started with a pool of participants that more closely resemble researchers, we will be surveying other stakeholders for future iterations and feature integration.

In this section, we explain some of the challenges faced in performing background study and initial iterations of building DiVA. We also highlight how these challenges shaped our design decisions.

• Closed-sourced desktop tools: Network visualisation tools are often closed-source and licensed as desktop applications. Testing such tools for performance is difficult (and reverse engineering can even be illegal [12] and NDlib-Viz. LargeViz was build with the purpose of supporting visualisations for large graphs but is no longer maintained. The last commit on LargeViz happened in 2016 6 . A similar fate has befallen NDlib-Viz -a visualisation web tool designed to run and analyse diffusion models supported by its parent library NDlib [13] . While NDlib is still actively maintained, the visualisation library to attract non-experts and non-programming users fell short of its purpose by not providing active documentation support. The last commit on NDlib was in 2018 7 . While we included the NDlib-Viz in our comparative study, we found that the front-end could not support a random graph of more than 10 nodes. As discussed in detail in Section 6.2, we see the choice of embedding framework impacted NDlib-Viz's ability to load and visualise large graphs. We had to run a few iterations to reach the combination of WebGL and three.js that proved best for our use case (details of the system architecture are provided in Section 5). • Handling large graphs: In a user survey on challenges in graph processing [14] , scalability and visualisation were reported as top challenges when dealing with large graphs. Loading large graphs is resource-intensive; we look at ( 2 ) relationships to map in the worst-case scenario, where is the number of nodes. It also causes an issue with the readability of large graphs, as quickly adjusting layouts is not possible. Various layout mechanisms have been proposed and examined [15] to increase the readability of large graphs. In the case of DiVA, since we aim to look at the spread of information at both macro and microscopic levels, dimensionality reduction-based techniques were not considered. We initially started with layouts that supported node features. However, it quickly became apparent that visualising node features on the main canvas was not feasible for large graphs. We shifted this functionality under the node viewing component where the information of the currently clicked/selected node can be viewed, and this information dynamically changes upon clicking another node. More than the node feature, the underlying graph structure does the heavy lifting of determining how the information will flow with the network. Thus, by abstracting node-level information, we could map the network with edgelists which are faster to process. We use force-directed based layout optimisation. D3.js has out-of-box support for this layout. By default, it helps preserve the community structure and ensure that adjacent nodes are as close as possible.

Lastly, we realised that we could decrease the reloading time by having the layout coordinates precomputed for the next session by allowing users to save the loaded graph (details of the system architecture are provided in Section 5). • Level of modularity: R Epidemics Consortium (RECON) 8 provides a suite of packages covering various aspects of analysing and visualising epidemic models. However, to use different functionalities, different libraries need to be imported. Instead of opting for such a breakdown of our application, we decided to keep various aspects intact to provide both networks and diffusion level analytics and visualisation under one hood. Once running, the users are not required to set up or integrate any additional resources and stay within the browser for a consistent user experience. It should be noted that the code structure is modular, with API endpoints defined for different execution units.

• UI Design decisions: It took a few iterations to determine the placement of various components within DiVA's UI. While we did not have the resources to run exhaustive UI testing to determine the best placement of the components, the features were clubbed based on the functionality they provided. The default panels were set based on the core usage of simulating a diffusion model and visualising its output. Additionally, instead of keeping the interface as a single-page web application (like NDlib-Viz), we provide separate browser tabs for the single and dual visualisation modes. When users click on dual visualisation, they are taken to a new browser tab. It frees users to run multiple simulations on the same graph separately and download their results. In hindsight, our two-step user study (Section 6.3) highlights that user finds the design framework intuitive and easy to follow.

DiVA offers two main diffusion visualisation modes -Primary Visualisation mode, and Dual Diffusion mode. Additionally, the system offers a wide array of features to analyse the network provided by the user. This section explores these features for both the visualisation modes and their utilities. An end-to-end workflow of the interface is enlisted in Appendix C.

The home screen allows the user to input an initial network. The user can choose to generate a random graph using the Erdős-Rényi generator model [16] or load any graph representation supported by NetworkX ( Figure 1 ). In addition, users can also load a previously exported .diva file. 

Once a network is loaded, users can click on the graph's nodes to view the details. However, to view label or attribute information about all the nodes, we have a different view format -the Data View. It provides a columnar representation of all nodes at once. By default, it displays the following node attributes -node-id, total degree, in/out-degree and label. As and when the user runs more node level statistics, the respective columns appear in the Data View. Consequently, it provides the ability to search and sort by id and attributes. An example of this view is provided in Figure 2 . 

It contains the timeline that controls the visualisation. The user can either play it as an interactive video by dragging a marker for a time-lapse effect or view the frames step by step. In dual diffusion mode, the panel contains the timeline that simultaneously controls visualisation on both networks.

In both Primary and Dual Visualisation modes, the navigation bar houses (a) some download options -Download Network and Download Report, and (b) user information of the active user, with an option to end the current session.

By default, this mode is loaded by the system, as illustrated in Figure 3 . The Main Panel loads the central canvas, which contains the network, and all the visualisations takes place here. The panels on both sides of the main canvas contain the tools necessary to perform the diffusion analysis. The panels on top and bottom act as the navigation bar and media control panel, respectively. Utility of Primary Visualisation Mode: The platforms that support network-level statistics hardly support diffusion analysis and vice-versa. DiVA, on the other hand, offers a one of its kind set-up bringing the two together. It is only natural that while analysing the diffusion patterns of a network, the user may also be interested in running network-level statistics. For example, a user interested in simulating the information spread starting at the most influential nodes can get the Page Rank information of the nodes from the Data View (Section. 4.2) and provide a custom seed list accordingly.

At the core of DiVA lies its ability to visualise largescale networks on a web-based interface. Once a user sets the initial network (Section 4.1), DiVA displays it in the main panel where a node's colour is graded according to its degree. The darker the node's colour, the higher its degree of centrality. After submitting a diffusion query, the user can graphically view the spread of the diffusion model by clicking the Play Button at the bottom panel. Once the graph view is triggered, it dynamically updates the colour of a node as per the node status at the given iteration, running from 0 ≤ ≤ maximum_iterations. An intentional delay is added to help the user grasp the node colourisation change between two iterations. The main aim of dynamically visualising the diffusion model is to help the users better understand the depth and breadth of the spread of the contagion entity. Similar to the play backwards/forward functionality of a video clip, it allows the user to check out the diffusion snapshot at a particular timestamp and even pause it. As the slider is moved backwards/forward in timestamps, the graph appearance and diffusion statistics are updated accordingly. The user is also free to download (as a JSON file) iteration-wise results returned by the selected diffusion algorithm.

• The Diffusion Tab allows the user to select a diffusion model along with its configurations as described below.

-Diffusion Algorithm: Users can select from the already supported diffusion algorithms or upload their own algorithms to run as shown in Figure 4 . The code format for custom algorithm is provided in Appendix A. The set of diffusion models and their parameters currently supported by DiVA are listed in Table 7 of Appendix D. -Maximum Iterations: Each iteration is a discrete-time process where all the nodes are evaluated, and their state is updated based on the update rules of the model. The maxi-mum_iterations parameter specifies how many such iterations are to be run. -Reset Iterations: Completely resets the diffusion states of the network.

• The Initial Configuration Tab is used to specify the set of nodes that are considered infected at timestamp = 0. DiVA has provision for randomly setting a fraction of nodes to be initially infected. One can also upload a list of infected node ids and modify them once added ( Figure 4 ). • Under the Appearance and Layout Tab (Figure 4 ), a user can customise the colour for the nodes and edges of a graph. Once the diffusion query has been provided, the user can set node colours for each state provided by the selected algorithm. 

Analytic and Stats.

• The Context and Network Statistics Tab primarily displays the basic information about the network as well as the currently selected node. It also provides other network and node level algorithms that can be run by the users as described in Figure 5 . • Diffusion Statistics Tab lists down the node count per class per iteration, updating as the iterations proceed.

Toggling Views.

• The Graph View houses the entire interactive network that the user works upon, and all visualisations take place in this view. It is described as the Main panel in Section 4.5.1.

• The Report View contains a plot for the diffusion trends of the selected diffusion algorithm, after its iterations are complete. As depicted in Figure 6 , a diffusion trend of the SIR diffusion model displays the count per class among the three classes (infected, susceptible and removed) per iteration. 

In DiVA, we introduce an advanced diffusion visualisation view, wherein a user can compare two different diffusion algorithms in a split view on the Main Panel which is illustrated in Figure 7 . 4.6.1 Main Panel: Dual Diffusion Visualisation. As stated before, DiVA aims to visually understand the breadth and depth of a diffusion process. It becomes even more necessary in the dual diffusion mode as more than one moving component is involved. The same set of parameters of maximum_iterations and the initially infected nodes are used for both the models so that a one-to-one mapping of the results per iteration can be obtained. Users can visualise the diffusion in one of the formats:

• Split View: As the name suggests, here, the Main Panel is horizontally split into two, with the same graph appearing in the two splits. The two splits update the node colour of the respective algorithms, with the second diffusion algorithm treated as a ground truth. The main aim here is to view how a change in the model or its parameters causes a change in the rate of spread. • Single View: Here, the results of both the simulations are visualised on a single graph.

Each model updates the nodes as per the corresponding status for each iteration. In case if two models operate on the same node, then the second model's status and colourisation (treated as ground-truth by default) prevail. The main aim of the single view mode is to provide a visual understanding of the convergence and divergence of the results obtained from the two models/ground-truth. The user can switch between the two modes at any point in time, as both the views share the common bottom panel, i.e., the same iteration-number mapping. Here also, the time-lapsed functionality is provided. As the slider moves, the node colourisation in both the views and the diffusion statistics are updated per iteration. The user modification for a class's colour and its impact on the visualisation also are applied in the same manner as in the Graph View of the Primary Diffusion Mode. Figure  7 shows a snapshot of dual diffusion comparison in action for two different diffusion models in Split View format. • Split and Combined View provides the functionality of viewing the two diffusion models post simulation as described above (Section 4.6.1). • The Enhanced Report View contains three set of plots: (a) Diffusion trends one per algorithm- Figure 8 . (b) Plot of F1-score per iteration. (c) Plot of commonly infected nodes per iteration. If for a given iteration, a node's status is set as infected by both the models, then it is counted towards the commonly infected nodes for that iteration. • The Enhanced Diffusion Statistics Tab dynamically lists the number of nodes in each class for both the diffusion algorithms. As an extension of the confusion matrix, we also report Precision, Recall, and F1-score per iteration concerning the "Infected" class. An example is provided in Figure 9 . These metrics help map the effectiveness of the selected model against the ground-truth/benchmark (default assumed to be the second algorithm).

This section provides a breakdown of various technologies used to support different components of DiVA. Being a web application, DiVA is implemented using front-end technologies, namely JavaScript and HTML5 and back-end technologies including Flask 9 (a microservice framework written in Python) and SQLite 10 (a SQL database) for storage. The architectural overview of the web interface that supports DiVA is provided in Figure 10 . We use Google's authentication system and file system based session management, which let multiple users work on DiVA in parallel when used as a deployed application. As DiVA is meant to be an online deployed tool, secure authentication and session management are necessary for the system.

For the visualisations, we rely on the advancement of modern browsers and their increasing computational power to provide a cross-platform tool that is not limited by the user's operating system. The structure and layout of the tool are primarily managed with our custom CSS and JavaScript code. For the basic user-interface elements, we rely on UIkit 3 11 and Bootstrap 4 12 . The front-end and back-end communicate using AJAX and REST APIs.

Due to large file sizes, visualising and analysing large networks can tax the processor and memory. Creating a dynamic network visualisation requires more processing and memory than static 9 Flask: https://flask.palletsprojects.com/ 10 SQLite: https://sqlite.org/ 11 UIkit 3: https://getuikit.com/ 12 Bootstrap 4: https://getbootstrap.com/ visualisations. As such, the storage, file sharing and analytical processes must be done carefully to avoid bottlenecks and blocked processes.

Back-end. Within the Flask server, we use NetworkX for (a) modelling the network into an optimized data structure, (b) computing network statistics, and (c) making use of existing highperformance graph representation file formats. We chose NetworkX, given its ease of use and popularity. Although libraries such as graph-tool [17] and igraph [18] offer better performance in terms of computation, they require complicated setups and do not offer equal support for different operating systems. While we plan to bring out integration for igraph for users looking to work with even more extensive networks; however, for networks with up to 100k nodes, it has been observed that NetworkX is a viable solution.

We make use of a custom network representation file (a JSON based edge list format) and stream it to the front-end asynchronously in-chunks through a custom streaming API using JavaScript web workers. The size of the representation file can reach up to 46 MBs for a network of 75k nodes and 4.5M edges, which, if transferred in one chunk, hogs the server's resources. Additionally, we provide a mechanism for the users to reuse the computed locations of nodes and edges by saving a custom .diva file, which stores a compressed version of all the computations run by the user. It allows the user to load a previously saved graph quicker, as the saved file lets us skip some initial steps that the server performs for setting up the network (effectively reduces the load time by a factor of five for larger graphs. Results reported in Table 2 ). The streaming mechanism helps reduce the server load, essentially enhancing the multi-user functionality of the system.

Front-end. The representation so obtained is fed into d3-force's layout 13 generation algorithm, which calculates node positions by modelling the network as a physical system of interacting particles, and simulates physical forces between the particles, resulting in a clustered node-link structure of the network. Being able to highlight different communities clearly, provides an easy way to study the diffusion process. Large-scale networks are loaded within a few minutes (see Table 2 ) because of the method used to stream coordinates of nodes and edges from the server as computed by the force-directed algorithm. This system essentially results in a network topology that is quite similar to one that is generated by ForceAtlas2 [19] . Further, we use three.js 14 to plot these nodes and the corresponding edges onto a WebGL context, which is used for all network visualisations of DiVA. During the initial architectural analysis of existing web interfaces for diffusion, we found that systems using HTML5 Canvas or SVG contexts for network visualisations become unresponsive for larger graphs; case in point the NDlib-Viz. As depicted in Table 2 , we have been able to overcome this problem by employing a combination of WebGL and d3.js 15 .

Back-end. The server also maintains a user's network state and provides all the statistics related to the network and diffusion algorithms. All network statistics are stored in the database and can be reused. The server calls are executed asynchronously, and the respective visual components are updated on the front-end with the response as generated by the server. For running diffusion algorithms, we use NDlib -a network diffusion library. Additional, DiVA supports custom algorithms (details provided in Appendix A).

Front-end. The additional visualisations of charts and plots for reporting statistics (as depicted in Figure 6 ) are created using the JavaScript charting library, Chart.js 16 . The reports are generated entirely on the front-end and rely only on the server for pre-computed network statistics. Finally, 13 d3-force-layouthttps://github.com/d3/d3-force 14 three.js:https://threejs.org/ 15 d3.js: https://d3js.org/ 16 Chart.js: https://www.chartjs.org/ we also use DataTables 17 , a module built on top of jQuery 18 , which helps us to list out the complete statistical information for all nodes of the network in a tabular format, to allow users to perform a detailed analysis of all network entities (Section 4.2).

As discussed in Section 7, very few visual analytics tools are available for information diffusion. With a prime focus on web interface tools for diffusion visualisation, we compare DiVA with two other web interfaces built for the same purpose -Epinet and NDlib-Viz. Epinet is tool built upon the EpiModel [20] package and hosted through Shiny, an R package. NDlib-Viz is a visualisation module built on top of NDlib [13] . In this section, we perform a three-way evaluation of DiVA comparing feature, performance and user interaction level. The comparative study helps us provide an in-depth analysis of the supported functionalities and the easy-of-use of those functionalities from an end-user perspective for both DiVA and the competing web interfaces.

It is evident from the exhaustive feature comparison in Table 1 that NDlib-Viz presents a bare minimal set of features, which are provided by Epinet and DiVA as well. In addition, DiVA supports a wider array of features in terms of interactive visualisation, advanced network statistics and comparative diffusion analysis, as well as extendable back-end APIs. 

While native desktop applications are easier to scale, they are notorious for lack of cross-platform support and complicated setups. With a web-first approach, using three.js and D3.js, one can scale loading large networks within considerable time. We tested DiVA on some real-world networks going up to a 4.5 million edge count. The results can be seen in Table 2 . The largest network is loaded within eight minutes. Note that under similar system configuration, Epinet and NDlib-Viz became unresponsive once the number of input nodes exceeds 10, 000. Epinet is hosted via Shiny, and the issue around scalability of hosted Shiny Apps is still an active area of development 19 . Being a hosted app that does not support uploading of networks, it is not possible to directly capture the performance metric as for Epinet it was for DiVA. Meanwhile, NDlib-Viz is available as an open-source tool, but it too does not support the uploading of networks. Additionally, the library has not been maintained and updated since 2018, which means adding any code snippet to test its performance is likely to break the existing setup and create overheads that cannot be accounted. Thus, we restricted our manual inspection of these tools to simply testing the creation of random graphs of varying sizes in both these tools, and as reported earlier, we witnessed the systems becoming unresponsive for graphs with even 10 nodes.

In addition to becoming unresponsive on large networks, the technologies used for Epinet and NDlib-Viz restrict performance on smaller networks as well. For example, while NDlib-Viz uses D3.js and D3-force to render the network, it uses a Canvas component which limits the size of the network and causes a considerable drop in the frame rate while zooming and panning the network. Although DiVA also uses D3-force for setting the layout of the network, it does not suffer from these restraints because the final visualisations are plotted on the WebGL component using Three.js, making DiVA's network visualisation relatively interactive and allowing a high frame rate on large graphs as evident from Table 2 . It should also be noted that reloading networks using the .diva format provides a significant reduction in time to load.

We conducted user studies that spanned the development of the tool. In the first study, the participants got a chance to review DiVA and the two other competitor web-interface tools. The process helped us analyse the user's pain points and the efficiency of the system in use when compared to similar tools. In the second stage of the user study, DiVA was exclusively tested for its accessibility and usability.

6.3.1 Comparative User Study. As a part of our first user study, we reviewed the competing web interface based diffusion visualisation tools (Epinet, NDlib-Viz and DiVA). The participants (human subjects) reviewed all the services hosted for them and provided comparative feedback on the three systems. In order to reduce bias towards any single tool, the following steps were considered:

• As our tool is a web-based system; we perform comparative analysis with two other established diffusion visualisation and analytic tools that are also web interfaces by themselves. This makes sure the user experience is uniform. • While Epinet is already hosted, our framework and NDlib-Viz need to be locally set up first. Thus, we hosted both the services and let the users access all three services from an anonymous server. Additionally, using a hosted service further unifies the user experience and focuses on the tools, not the installation and dependencies. • The human subjects were not initially told that DiVAis a new tool; it was instead posed as another established tool. This helps reduce the participants' bias towards an existing system vs a new system. The study inducted 44 human evaluators/subjects. The age of the human subjects ranged from 24-40 years; 40% of them are female, and the remaining were male. Each participant took ∼ 30 minutes to fill the entire survey. At the beginning of the survey, we asked participants about their familiarity with the concept of information diffusion in general ('Yes', 'No', 'Somewhat'). During analysis of the survey, based on the above answer, we divided the participants into three groupsexperts, non-experts and enthusiasts, respectively. Out of the 44 participants, we had 11 experts, 14 non-experts and 19 enthusiasts.

While DiVA supports custom networks, Epinet and NDlib-Viz only provide support to set up a random network; so to begin with, all the subjects were asked to set a random network consisting of 1.5k nodes and 5k edges. Though small, these numbers ensured the competitive services were able to load the network smoothly (Section 6.2). Once the network was established, the subjects were asked to run the SI model of diffusion (for non-expert users, we provided a link to the model's documentation 20 ) and then to evaluate the reports generated for that diffusion. Once the participants completed the workflow, then they comparatively rated the three interfaces on a scale of 1-3 (3 being the best rating). The ratings concerned (a) easy of use and learning, (b) minimal actions to achieving analysis, and (c) overall system capabilities. A summary of these ratings is provided in Table 3 . While DiVA is given the highest rating in each category, when it comes to ease of learning, our tool and NDlib-Viz are seemly closer in rating. This puts DiVA in a favourable position as our framework is more feature-rich yet easy to follow. Notably, it is in providing the overall system capabilities that our tool outshines the others by a considerable margin. To reduce the barrier to entry in the field of information diffusion, we strike a challenging balance between providing a feature-rich yet easy-to-use tool. Our survey results point that while we still have a long way to go, DiVA is a step in the right direction.

6.3.2 Accessibility Study. For our second user study, we approached 58 users 21 (39 male and 19 female), where most of the users belonged to the age group of 24 to 40. While some of the human subjects had already participated in our Comparative User Study (Section 6.3.1), we also inducted new evaluators. It helps reduce the familiarity bias 22 of having interacted with DiVA before. At the same time, we did not replace all the previous evaluators to maintain the continuity of a long-range study for future user evaluations of the tool. Table 3 . Comparative rating of the interfaces -Epinet, NDlib-Viz and DiVA as perceived by the participants. The results are summarised as % of total user (44) agreeing upon a rating for the respective tool. Rating scales from 3 (best) to 1 (worst). Results are rounded off to one decimal.

Rating We again asked the users about their familiarity with the concepts of diffusion modelling. 18 were marked as "expert" users -who had prior experience with similar network diffusion visualisation tools. The 40 "novice" users did not have any experience with a similar tool. However, they explicitly mentioned that they were interested in information diffusion.

As a metric of evaluating usability, we employed the System Usability Scale (SUS) metric [21, 22] (see Appendix B for more details). Based on a ten-item questionnaire, the SUS score assigns a numerical value between 0 and 100 to the system under review. A score above 68 is considered to be average 23 Table 4 . Statistics of user responses for SUS evaluation. The values indicate the number of users who marked that option of the corresponding question. ("S. Disagree": Strongly Disagree, "S. Agree": Strongly Agree).

S. Disagree Disagree Neutral Agree S. Agree I think I would like to use this system frequently. 1 3 8 37 9 I found the system unnecessarily complex. 21 25 9 3 0 I thought the system was easy to use. 0 3 7 24 24 I think that I would need the support of a technical person to be able to use this system. The study proceeded as follows:

• Novice users were given a brief explanation about networks and information diffusion with the help of some epidemic models. • The users were tasked with analyzing the spread of a virus over a static dummy network, representing a population of over 1000. • The users were given a set of instructions to follow to run a diffusion algorithm along with tinkering with the customization and analytical features of DiVA enabling them to explore both the tool's technical and user experience aspects.

• The users were also asked to complete the same task by exploring the system and algorithm configurations independently. • Finally, the users were asked to fill out the survey for us to calculate the SUS score. Table 4 lists the questions and summarizes the user response. Note that the set of questionnaires used was the standard SUS survey template, and we did not tweak any part of it. This helps reduce any bias that can be introduced from our side. Additionally, before the survey, we deployed the tool and shared the link of the hosted application that the users then ran on their browsers. This created a more consistent experience for the users and further mitigated the user bias. Table 5 lists different statistical measures calculated on the SUS scores. The Upper Bound of the SUS score was 97.5 for all user types. From our evaluation, we obtained a mean score of 77.2 (among all users), which is quite above the average score of 68, rating DiVA as "Acceptable" (according to the SUS guideline). Individually for the expert and novice users, their mean scores were way above the threshold (83.1 and 74.6 respectively). As expected, the expert users had a higher mean SUS given their familiarity with diffusion terminology; they could adapt faster to the interface. Even though many of our participants were a novice, they all felt confident in using the system by themselves and did not seem to need additional technical assistance in exploring the system. This validates the design decisions made regarding where different tool components are placed. 

In the past decade, social media platforms have been on an exponential rise. Data availability on platforms like Twitter, Facebook, and Reddit makes it feasible to understand online user behaviour. While there are many studies and tools for visual analysis of user behaviour, little work has been done for visualising information diffusion, simulated or otherwise. To the best of our knowledge, DiVA is the first tool that provides a highly scalable and customisable platform for visualising information diffusion models. Analysing User Interaction. Some existing tools aid in the analysis of social interaction amongst users. This information can be beneficial in tasks like community detection, social recommendations, influential user detection, etc. These tools generally combine computational aspects like graph theory, network analysis, demographic analysis [23] , ethnography and epidemiology to provide users with important semantic and cluster information. Tools such as Vizster [24] , iOLAP [25] , and NodeXL [26] use edge grouping algorithms to create clusters of nodes. Some algorithms such as hierarchical edge bundles [27, 28] , geometric based edge clustering [29] , and motif clustering [30] , use network properties. Incorporating semantic information in clustering algorithms such as semantic substrates [31] , top-N node filtering [32] , and probabilistic topic modeling [33] helps analyse social interaction better.

Analysing Social Media Content. Many tools add to these features by providing content analysis. Tools like FireAnt [34] , Visual-VM [35] , SAVIZ [36] , EpiGrass [37] and Google+ Ripples [38] allow a content-based analysis of the network. Data like geo-tags, though sparse, are useful in analysing mobility patterns. Geographical data is also helpful in analysing user stance, behaviour and knowledge across the world. Other tools like FireAnt [34] , SocialFlow [39] , HistoryFlow [40] and Whisper [41] also provide topic analysis. Extracting the terms, hashtags and topics from the posts helps opinion, product and marketing analysis. This is also useful to study information diffusion in blogs and entertainment sites [42] [43] [44] . Chen et al. [45] introduced an extensive usercentral visualisation and analysis tool for social media data through graphs. On the one hand, some browser extensions [46] [47] [48] are made available to run on top of social media platforms, providing platform-specific analysis of users, content flow, and even content moderation [49] . Meanwhile, some web interfaces like Reddit-Network-Vis 24 that focuses on visualising the top comments of a nested Reddit post, or Topic-Flow [50] , and TweetViz [51] which focus on visualising the temporal evolution of a topic on Twitter, etc. have also been proposed. Changing the platform or the content format will lead to a significant change in these tools' capabilities. We, thus, need a tool that can help us visualise information spread irrespective of the underlying social media platforms.

Analyzing Information Diffusion. While many studies visualise social media data, very few tools support information diffusion algorithms. NDlib [13] is a Python package for analysing information diffusion algorithms. It also contains a basic module that visualises and compares multiple diffusion algorithms on a randomly generated graph. EpiModel [20] is an R package that provides epidemic diffusion models. Built on top of NDlib and EpiModel are the NDlib-Viz and Epinet respectively. While both these visualisation modules provide algorithm simulation, Epinet provides a more robust and customised tool than NDlib-Viz which has a bare minimum visualisation interface. DiVA provides a much more vast array of features and customisation than these tools. RECON 25 provides a collection of R packages for visualisation and analysis of outbreak data. In Section 6, we compare DiVA with these tools in more detail.

Visualizing Large Networks. Analysing large social networks with millions of nodes requires agile visualisation tools. Early works used C/C++ libraries to create fast visualisations [53] [54] [55] . With the development in computational hardware and other languages, many libraries are available for fast, clutter-free, and easy network visualisation. While DyNetVis [56] is one of the opensource, free and actively developed projects in the area of dynamic and large-scale visualisation, it is a stand-alone tool, unlike the web-based options this study explores. Even though DyNetVis provides an excellent set of tools for understanding the temporally evolving networks, it has limited support for diffusion visualisation algorithms. Table 6 lists some of the popular complex network visualisation tools/libraries. Some of them facilitate network analysis, but any of them support diffusion analysis hardy. Some works [57, 58] tried to visualise high dimensional data by reducing it to 2D similar to t-SNE [59] . Nevertheless, they create static graphs, which are not too helpful in analysing diffusion. For diffusion visualisation, a dynamic and customised visualisation library is required. In a web browser environment, we find that using WebGL 26 backed libraries renders the network faster and results in smoother animations than other libraries. Meanwhile, libraries like NepidemiX 27 (outdated now) and Outbreak2 28 (a part of RECON projects) only provide statistical analysis like the growth charts, unlike DiVA which provides an animation-like visualisation of the diffusion process. Epicontacts (RECON package) supports large graph visualisation (not diffusion visualisation), but overall, the projects are very distributed, do not provide an integrated analysis tool and lack many features introduced in DiVA. Other tools like Epigrass [37] , and GLEAMviz [60] focus on epidemic and outbreak simulations. Epigrams support custom simulations but fall short in the visualisation aspect. GLEAMviz is an advanced tool that supports multiple simulation algorithms and geographical visualisation. However, it cannot be employed for social networks. 24 

Type Features Tableau Tool Business analytics tool; supports input from excel and database systems; includes a large collection of visualisations and dynamic geographical data analysis. NodeXL [26] Tool Complex network visualisation and analysis tool; supports input through excel files. Gephi [11] Tool Open source network visualisation and analysis software; supports multiple input formats; extensive support for visual customisation. NetworkX [52] Open source collection of packages for complex network visualisation; supports data manipulation and exporting for easier integration with common data science libraries.

With the increasing availability of large-scale networks, interest in information diffusion has increased. To this effect, we present DiVA, an interactive web-interface based visualisation tool that scales well for large networks. Majorly, it provides three features -support for large networks (along with the ability to save their state for future use), simulation and visualisation of diffusion algorithms, and comparison of two diffusion algorithms in a split window. Additionally, DiVA allows high customizability for the users in terms of the layout and visuals of the network, provides diffusion analysis reports, and facilitates the users to visualise a ground-truth diffusion and even compare it to a simulation of another algorithm. Despite the large array of features supported by the system, we believe that the systematic workflow makes it easy for users of all experience levels to use the tool effectively. The current version of DiVA uses NetworkX for network modelling given its popularity, community support and its pure Python implementation. We plan to provide integration for igraph in further increments. We aim to add some more advanced diffusion algorithms. The world is ever-changing, and so are the networks. We hope to incorporate more tools and feature sets such as the introduction of geotagging information, user influence analysis, deeper community analysis and even dynamic networks into DiVA to make it more useful for both expert and novice users working in the field of network science.

browser. On the landing page, the user needs to authenticate via any google account and then click Start Tool. Note, that in case of using DiVA as a hosted service, these steps can be forgone.

(1) When prompted with the Setup Network window, the user can generate a random graph using Create Network button after setting the necessary parameters, or upload an existing network via the Upload Network. Then, the primary diffusion visualisation mode as described in Section 4.5 will appear. (2) To change the network's appearance, the user needs to go to the Left panel, Appearance ↩→ Nodes or Edges sub-tab. 

Once the user is familiar with the network, they may want to run a few diffusion models on it to see how information spread occur over this network. By default, the primary diffusion visualisation mode is loaded. To download this report as a PDF, they can click on Diffusion ↩→ Download Report. (7) At any point, the user can seamlessly toggle between the Report View and the Graph View modes.

In order to perform comparative analysis between two different diffusion setups, the user needs to use the Compare View tab on the top panel which will open a new window. The interface for the dual diffusion visualisation mode is outlined in Section 4.6.

(1) By default, the network will be loaded in the Split View.

(2) Once the networks are loaded, two diffusion algorithms can be set in the Diffusion tab on the upper Left panel. Here, the user will be able to select a ground-truth result for comparative evaluation. (3) Now the user can operate like Section C.3. The Play button, the Timeline slider and the Report View can be used in the same manner. Although, the enhanced Diffusion Statistics will now be available on the upper left panel. (4) Here, the user has the option to switch to the Single View to visualise both algorithms working on a single network simultaneously. (5) The user needs to switch over to the Report View on the top panel to see a plot of diffusion trends, along with plots for commonly infected nodes and F1-score per iterations.

As it is difficult to accommodate the vast variety of information diffusion models available, in the current version of DiVA, we provide the models (as supported in back-end by NDlib), listed below. For details regarding these diffusion models and their limitations, readers are referred to these surveys of diffusion models [61, 62] . 

The uses of epidemic models

Modeling epidemics spreading on social contact networks

Scoring users' privacy disclosure across multiple online social networks

Analysis of online social network connections for identification of influential users: Survey and open research issues

Socialwave: Visual analysis of spatio-temporal diffusion of information on social media

Opinionflow: Visual analysis of opinion diffusion on social media

Coronatracker: World-wide covid-19 outbreak data analysis and prediction

An interactive simulator for covid-19 trend analysis

An interactive platform to track global covid-19 epidemic

IEEE/ACM International Conference on Advances in Social Networks Analysis and Mining

Gephi: an open source software for exploring and manipulating networks

Visualizing large-scale and high-dimensional data

Ndlib: a python library to model and analyze diffusion processes over complex networks

The ubiquity of large graphs and surprising challenges of graph processing

What would a graph look like in this layout? a machine learning approach to large graph visualization

On the evolution of random graphs

The graph-tool python library

The igraph software package for complex network research

Forceatlas2, a continuous graph layout algorithm for handy network visualization designed for the gephi software

Epimodel: an r package for mathematical modeling of infectious disease over networks

SUS -a quick and dirty usability scale

The system usability scale: Beyond standard usability testing, Proceedings of the Human Factors and Ergonomics Society Annual Meeting

Visualizing geo-demographic urban data, CSCW '18

Vizster: Visualizing online social networks

iolap: a framework for analyzing the internet, social networks, and other networked data

Analyzing social media networks with NodeXL: Insights from a connected world

Hierarchical edge bundles: Visualization of adjacency relations in hierarchical data

Shareflow: A visualization tool for information diffusion in social media

Geometry-based edge clustering for graph visualization

Motif clustering and overlapping clustering for social network analysis

Network visualization by semantic substrates

Visual social network analytics for relationship discovery in the enterprise

Modeling flickr communities through probabilistic topic-based analysis

Introducing fireant: A freeware, multiplatform social media data-analysis tool

Visual-vm: A social network visualization tool for viral marketing

Saviz: Interactive exploration and visualization of situation labeling classifiers over crisis social media data

Epigrass: a tool to study disease spread in complex networks

Proceedings of the 22nd International World Wide Web Conference

Visual analysis of topic competition on social media

Studying cooperation and conflict between authors with history flow visualizations

Whisper: Tracing the spatiotemporal process of information diffusion in real time

Visualizing information diffusion and polarization with key statements

Quantitative study of music listening behavior in a social and affective context

Multi-source-driven asynchronous diffusion model for video-sharing in online social networks

D-map: Visual analysis of ego-centric information diffusion patterns in social media

Browser extension for fake news detection

Pop: Bursting news filter bubbles on twitter through diverse exposure, CSCW '19

Tweety holmes: A browser extension for abusive twitter profile detection

Feedreflect: A tool for nudging users to assess news credibility on twitter

DiVA: A Scalable, Interactive and Customizable Visual Analytics Platform for Information Diffusion on Large Networks 111:23 Social Computing, CSCW '18

Topicflow: Visualizing topic alignment of twitter data over time

Tweetviz: Visualizing tweets for business intelligence

Exploring network structure, dynamics, and function using networkx

Pajek-analysis and visualization of large networks

Ask-graphview: A large scale graph visualization system

The art and science of dynamic network visualization

Dynetvis: A system for visualization of dynamic networks

Visualizing large-scale and high-dimensional data

Effective visualization of information diffusion process over complex networks

Visualizing data using t-sne

The gleamviz computational tool, a publicly available software to explore realistic epidemic spreading scenarios at the global scale

A survey on information diffusion in online social networks: Models and methods

def __init__(self, G, seeds=None, fraction_infected=None, iterations=10): """ G -> NetworkX graph. seeds -> An array of actual labels of the seed nodes fraction_infected -> float[0,1] iterations -> Max iterations for a single simulation Returns -> null """ def run_model(self): """ Return the iterations in the same format as NDlib does. Returns -> List of Dictionary

The System Usability Scale was a metric created by John Brooke in 1996 as a "quick and dirty" way to measure the usability of products [21] and has been used extensively to test and evaluate numerous systems and applications. The scale consists of a ten-item questionnaire: (1) I think I

I found the system unnecessarily complex

I thought the system was easy to use

I think that I would need the support of a technical person to be able to use this system

I found the various functions in this system were well integrated

I thought there was too much inconsistency in this system

I would imagine that most people would learn to use this system quickly

I found the system very cumbersome to use

I felt very confident using the system

The SUS score presents users with five options for each question, where "Strongly Disagree" has a score of 1 and "Strongly Agree" has a score of 5, and the others are uniformly distributed between these two. Once the user fills up the SUS score form, the score is calculated as follows: (1) For every odd-numbered question

Sum up the new values and multiply them by 2.5. (4) The final value is between 0and 100 and gives us the SUS score for one user. Here, response towards the "Strongly Agree" area is considered suitable for the odd-numbered statements (meant to be positive for the system), while for the even-numbered statements (meant to be negative for the system)

Steps to Setup and Start DiVA Before setting up DiVA locally, the users are required to set up an app on the Google Developer Console 29 . The system requires a Python-3 environment. The user can install the required python modules through the command pip install -r requirements.txt, and run the tool through flask run. Once the server starts

The users can upload their algorithm in the form of a python script. The iterations that form the output of the template code should form a list of python dictionaries, where each dictionary corresponds to an iteration and indicates the change in status of each node after each iteration. The dictionary only needs to specify the states of those nodes that have undergone a state change since the previous iteration. Furthermore, the dictionaries must also report the total number of nodes in each state at each iteration. An example code and output is shown below (Code A). import networkx as nx import json import pandas as pd import app # Other imports needed ... class Model():