key: cord-0522147-uojupw29 authors: Schreiber, Andreas title: Visualization of Contributions to Open-Source Projects date: 2020-10-17 journal: nan DOI: nan sha: 0f234726dd982623be010bc1c68235f8ccb8a773 doc_id: 522147 cord_uid: uojupw29 We want to analyze visually, to what extend team members and external developers contribute to open-source projects. This gives a high-level impression about collaboration in that projects. We achieve this by recording provenance of the development process and use graph drawing on the resulting provenance graph. Our graph drawings show, which developers are jointly changed the same files -- and to what extent -- which we show at Germany's COVID-19 exposure notification app 'Corona-Warn-App'. In open-source projects, team composition and development process is transparent and traceable, which is one of the advantages of the open-source model [11] . Understanding of patterns and characteristics of open-source projects, where-sometime many-developers with different roles [26] work together, is an important research question; especially for projects with high public interests. The COVID-19 pandemic [3] raises challenges for scientists of many disciplines. Computer scientists and software developers help to fight the pandemic with software systems, which must be developed under time pressure [2] . For example, apps for mobile devices that support contact tracing of infected persons are useful to identify local COVID-19 hot-spots and find other persons, who are potentially infected, too. We focus on Germany's exposure notification app Corona-Warn-App (CWA; see Section 4). For the CWA, we want to analyze and see visually, to what extend team members and external contributors contributed to the various sub-projects of CWA on GitHub. Our method is to record the provenance of software development processes [13, 28] and store it according to a standard provenance data model. Technically, we do repository mining to extract provenance and store it as a labeled property graph in graph databases. We query the graph for information to answer our research questions directly or for parts of the graph to visualize it with "standard" graph drawing. We describe the contributions and the emerging results of our works as follows: • A brief description of provenance of software development processes with focus on open-source processes that use the version control system git (Section 2). • An overview how we draw graphs that visually show contributions by developers with different roles (Section 3). • As an example towards a user study, we present graph drawings for the Corona-Warn-App (Section 4). Provenance can be expressed in many formats. We focus on the standard W3C PROV [14] , which defines the provenance data model PROV-DM [15] . The core structure of PROV-DM relies on the definition of the model class elements entities Entity , activities Activity , and agents Agent that are involved in producing a piece of data or artifact and on definitions of relations to relate these class elements, such as wasGeneratedBy, wasAssociatedWith, wasAttributedTo, and used. Each of the class elements and relations can have additional attributes. Provenance of an entity (e.g., a software artefact) is a directed acyclic graph (DAG). Since all nodes and edges of this graph have a defined semantics, the provenance graph is a specific knowledge graph. The provenance graph can be stored in graph databases as a labeled property graph. To analyze software development processes, we extract retrospective provenance [10] from repositories and store it in a graph database for further analysis ( Figure 2 ) [22] . Organization corona-warn-app To extract provenance from git-based projects we use tools, which crawl the git repositories and additional information, such as issues or pull requests (Git2PROV [4, 25] and GitHub2PROV [19] ). To analyze provenance graphs, many visual and analytical methods exist-including graph summarization [12, 24] , or visual exploration [27] . For example, we illustrate querying and using the provenance graph to answer the question: "Which files have commits by team members as well as external contributors?" We generate a CYPHER query, that adds information about contributors roles. We retrieve member information via the GitHub API and store it in Python lists of team members and external contributors, which we insert in a CYPHER template. This CYPHER query creates new directed relations between persons Agent and files Entity ; for example, the relation for team members is: We visualize parts of the property graph that is derived from the provenance graph. We use a graph visualization that is readable and faithful [16, 17] . Using a Python script, we export the relevant nodes and edges from NEO4J and store them in intermediate files; specifically in CSV, JSON, and GraphML files, which we import into graph drawing software (Figure 3 ). In the following, we use GEPHI [1] to draw our graphs. During querying and exporting for visualization, we map the property graph as follows: • PROV elements entities Entity (i.e., files) and agents Agent (i.e., contributors) become graph nodes with two distinct colors. • The relations CONTRIBUTES_TO become edges, which color depends on the property role. For the coloring [9], we choose distinct colors from two different qualitative color schemes generated by COLOR-BREWER [7] . Nodes use colors from the "3-class Set2" schema 2 : files have a green color (•) and contributors have an orange color (•). Edges use colors from the "3-class Set1" schema 3 : contributions from team members have a blue color (−→) and contributions from external contributors have a red color (−→). While the chosen colors are 'print-friendly', they are not safe regarding color blindness. The size of nodes are proportional to their degree. In our current approach, we generate two drawings for each project; one where we scale the node sizes according to the in-degree of file nodes and a second one where we scale according to the out-degree of contributors. For the layout we experimented with layout algorithms that are implemented in GEPHI, such as Fruchterman Reingold [5] , the algorithms that comes with GRAPHVIZ [6] , and ForceAtlas2 [8] . See Figure 4 for an example of a graph drawing for an relatively small project using the ForceAtlas2 layout algorithm. The Corona-Warn-App (CWA) has been developed in a short time frame: development started in April 2020 and the app was released on 16 th June, 2020 for Android and iOS. CWA is developed by SAP and Telekom using an open development process-publicly available from 12 repositories 4 . CWA has a decentralized architecture, accompanied by centrally-managed Java-based server applications to distribute findings about infected users and store test results uploaded by the laboratories. We selected four of the CWA projects for visualization, for which we stored the provenance in NEO4J 5 . These projects differ in their projects statistics regarding number of files in the repository, number of contributing developers, number of commits, and number of files where both team members and external developers made changes-which all leads to different number of nodes and edges for the graph drawings (Table 1) . For each project, we generate two graph drawings with GEPHI 6 as described in Section 3: one where we scale node sizes proportional to the in-degree of file nodes (see Figures 5a and 6a ) and a second one where we scale proportional to the out-degree of contributors (see Figures 5b and 6b ). (a) Entity In-Degree: Size of nodes according to in-degree of nodes that represent files. (b) Agent Out-Degree: Size of nodes according to out-degree of nodes that represent contributors. In our graph drawing, typical patterns are visible: team members and external contributors work collaboratively on many files. Because the drawing are based on provenance data, the interpretation is that over the time of development many files were changed by developers with different roles; where a small numbers of developers made most of the changes. Further, more detailed interpretations and studies of the graph drawing metrics for faithfulness and readability is ongoing work. (a) Entity In-Degree: Size of nodes according to in-degree of nodes that represent files. (b) Agent Out-Degree: Size of nodes according to out-degree of nodes that represent contributors. There are many tools for dynamic history visualization of repository changes over time. A widely used tool is GOURCE 7 , which generated movies that show changed files and developer activities. This different to our approach, since we visualize "condensed" information about the development history that is stored in the provenance data. Especially for visualizing social interaction in open-source software projects, Ogawa et al. [18] use an intuitive, timeseries, interactive summary view of the social groups that form, evolve and vanish during the entire lifetime of the project. We presented graph drawings to visually see how team members and external contributor worked on the same files in open-source projects over the course of development. Since our goal is better understanding of such development patterns, future work foremost is to conduct user studies to evaluate readability and faithfulness. The graph drawings surely can be improved in many ways, for example, with other layouts, color schemes (especially to support color blindness), transparency, or shapes. Gephi: An open source software for exploring and manipulating networks Normalising the "new normal": Changing tech-driven work practices under pandemic time pressure Keep up with the latest coronavirus research Git2PROV: Exposing version control system content as W3C PROV Graph drawing by force-directed placement. Software: Practice and Experience An open graph visualization system and its applications to software engineering. Software: Practice and Experience ColorBrewer.org: An online tool for selecting colour schemes for maps ForceAtlas2, a continuous graph layout algorithm for handy network visualization designed for the Gephi software A study of colormaps in network visualization Retrospective provenance without a runtime provenance recorder Factors affecting the success of open source software Aggregation by provenance types: A technique for summarising provenance graphs The provenance of electronic data Provenance: An Introduction to PROV PROV-DM: The PROV data model On the faithfulness of graph visualizations Towards faithful graph visualizations Visualizing social interaction in open source software projects GitHub2PROV: Provenance for supporting software project management Graph Visualizations of Corona-Warn-App-Repositories (Gephi) Neo4j Database Dump of Corona-Warn-App Repository Provenance Graphs Modelling knowledge about software processes using provenance graphs and its application to git-based version control systems Towards automated, provenance-driven security audit for git-based repositories: Applied to Germany's Corona-Warn-App Efficient aggregation for graph summarization onyame/git2prov: Improved error handling Unveiling elite developers' activities in open source projects Visual exploration of multivariate graphs Provenance of software development processes We plan, to apply our methods to other projects than CWA; especially, to huge projects with a very long development history. We plan to compare different projects, where the proportion of regular team member and external contributors is different.We already work on using the provenance data for non-visual analytics of open-source projects. For example, to investigate whether vulnerabilities are introduced by external contributors (e.g., via pull requests)-we apply static code analysis for revisions in development history determined on the provenance data [23] .