key: cord-0949127-utvu8ckg
authors: Braga, L.; Feingenbaun, D.
title: Assessing Global Covid-19 Cases Data through Compositional Data Analysis(CoDa)
date: 2020-12-19
journal: nan
DOI: 10.1101/2020.12.17.20248424
sha: 540e643df9caecfc9f86456c2f6e7e003a3a59e9
doc_id: 949127
cord_uid: utvu8ckg

Background Covid-19 cases data pose an enormous challenge to any analysis. The evaluation of such a global pandemic requires matching reports that follow different procedures and even overcoming some countries' censorship that restricts publications. Methods This work proposes a methodology that could assist future studies.Compositional Data Analysis (CoDa) is proposed as the proper approach as Covid-19 cases data is compositional in nature. Under this methodology, for each country three attributes were selected: cumulative number of deaths (D); cumulative number of recovered patients(R) ; present number of patients (A). Results After the operation called closure, with c=1, a ternary diagram and Log-Ratio plots, as well as, compositional statistics are presented. Cluster analysis is then applied,splitting the countries into discrete groups. Conclusions This methodology can also be applied to other data sets such as countries, cities,provinces or districts in order to help authorities and governmental agencies to improve their actions to fight against a pandemic.

The purpose of this work is to explore the relationship between the proportions of the attributes by applying Compositional Data Analysis (CoDa) 2 . Three attributes were selected for each country: cumulative number of deaths (D); cumulative number of recovered patients(R) ; present number of patients (A). After the operation called closure, with c=1 (acomp scale) a ternary diagram and Log-Ratio plots, as well as, basic compositional statistics were obtained. Cluster analysis was then applied, splitting the countries into groups. The results must be understood as descriptive epidemiologic assumptions, even though some associate patterns are also suggested 3 , they have an interpretation coherent with the spread of the pandemic.

The Compositional Data contains information about relative magnitudes. Dependency among variables of a composition can be examined in real space by analyzing the covariance structure of the log-ratios. The compositional approach consists of a change of representation of the original sample space, the simplex S D , onto a new sample space, namely real space D-1.

Components are the individual parts x i of the composition x. The sum over the amount of all components c is called total, c usually is 1:

0, 1,2,..., ;

The closure of a set is defined by:

It is proved that the simplex, with the operations of perturbation, powering and Aitchinson inner product, has a (D-1)-dimensional Euclidean vector space structure. So through an isometric transformation virtually anything could be translated from real vectors to compositions and vice versa 4 , such as the Center Log-Ratio and the Isometric Log-Ratio Transformations. It is an isometric transformation between the simplex S D and the (D-1) dimensional real space. A set of (D-1) orthornormal directions is a base, called balance base. An ilr transformation of a composition is the projection onto this base 5 .

The arithmetic mean and the variance or standard deviation of individual parts do not fit as the value of central tendency and measure of dispersion because the parts of a composition are linked to each other and they are multivariate by nature. Taking this into account, it follows that the center (cen) or compositional mean of a data set X with N observations and D parts is a composition, i.e., the closed geometric mean. It is defined as:

The variation matrix T describes the dispersion in a compositional set. It has D 2 components, each defined as:

The smaller the matrix variation element is, the better the proportionality between the two components.

The grouping of samples presupposes the definition of two rules -the distance formula between two samples and an agglomeration criterion. This subject is well known in the Euclidean case, where several distances are possible: Euclidean, Manhattan, Mahalanobis, Minmax. Most distance measures for multivariate datasets can be generalized to composition formalism 6 .

The Aitchison's distance formula is:

In other cases, it is necessary to adapt the formulas by using the Centered Log-Ratio transformation (crl) as it is done for the Euclidean distance in this work or the Isometric Log-Ratio transformation (ilr).

Once the distance formula has been defined, the next step should be the definition of the crowding criteria. Among others, in addition to Ward´s criterion 7 there are the simple, medium and complete binding criteria.

In the Ward´s method 7 , for a given group, the distance of each external object to the group average is calculated and the object that causes the smallest increase in the All rights reserved. No reuse allowed without permission. perpetuity. preprint (which was not certified by peer review) is the author/funder, who has granted medRxiv a license to display the preprint in The copyright holder for this this version posted December 19, 2020. ; sum of distances is incorporated. The connection criteria allow to calculate the distance between elements and, also, between groups of elements. The simple option considers that the distance between two groups (unitary or not) is the shortest distance possible between an element in one group and an element in another one. The average option calculates the average distance for all possible pairs. Furthemore, the complete option considers that the distance between groups corresponds to the greatest possible distance between an element from one group and an element from another one.

There are several packages that implement the CoDa theory. The software "Compositions" 8 in "R" 9 was the one adopted.

In this study the data sources are WHO, CDC, ECDC, NHC, DXY, 1point3acres, Worldometers.info, BNO, the COVID Tracking Project (testing and hospitalizations), State and National Government Health Departments, and local media reports. A layer in the package ArcGis 10 was created and maintained by the Center for Systems Science and Engineering (CSSE) at the Johns Hopkins University (CSSE 2020). This feature layer is supported by ESRI Living Atlas team, JHU APL and JHU Data Services. This layer is opened to the public and free to share. The cases dataset was downloaded from that repository on the 6 th of September, 2020, and includes the following attributes: Country Name, Deaths(D), Recovered(R) and Active(A) patients. Note that the second and the third are cumulative figures until that day and the last one is the value available on that day. The raw cases data are displayed in Appendix I and the closure (acomp scale)

in Appendix II, both are available at http://dx.doi.org/10.17632/wt7nd5jv6s.1

For each frequency (Deaths, Recovered, Active) proportions of the population were calculated in the acomp scale, entries with null values were converted to Below Detection Limits (BDL). The original attributes will now be named in the acomp scale: P D : cumulative relative frequency of deaths (8) P R : cumulative relative frequency of recovered (9) P A : relative frequency of active * (10) P c : frequency (cumulative and 6 th of September) of confirmed ** (11) *People still being treated **People that caught the Covid-19

[P D , P R , P A ] shows the inner rapports between the confirmed categories.

The results are divided into two categories: compositional descriptive and compositional associative. All rights reserved. No reuse allowed without permission. perpetuity. preprint (which was not certified by peer review) is the author/funder, who has granted medRxiv a license to display the preprint in The copyright holder for this this version posted December 19, 2020. ; https://doi.org/10.1101/2020.12.17.20248424 doi: medRxiv preprint

In this case the closure is performed over the three components and the values represent relative proportions among them. The null counts are treated as "Below Detection Limit", therefore to input the data in the package "Compositions" they must be converted to -1, these cases are depicted as red lines in the ternary diagrams below. Table 1 Center Center P D 0.03025174

This result shows that the P R is the main overall feature. The components P D and P R are more proportional than P D and P A or P R and P A ,formula (5), what is understandable because the Active patients will turn into the category of either Recovered or Dead , A => R or D .

The following analysis will show the rapport between the proportions in the composition [PD, PR, PA] through the signal of the Log-ratio between them. The results will be displayed in non-centralized ternary diagrams in order to highlight both All rights reserved. No reuse allowed without permission. perpetuity. preprint (which was not certified by peer review) is the author/funder, who has granted medRxiv a license to display the preprint in The copyright holder for this this version posted December 19, 2020. ; the relationship between countries and the attributes. In the diagram, the closer a country is to a vertex, the greater the proportion of the attribute associated with it in relation to the other attributes. Fig. 1 Plots of samples against a Log-ratio In this figure the following countries were not depicted: Buthan, Cambodia, Djibouti, Dominica, Eritrea, Grenada, Holy See, Laos, Mongolia, Saint Kitts and Nevis, Saint Lucia, Saint Vincent and Grenadine, Serbia, Seychelles, Sweden and Timor Leste. The reason is because they presented null values either on Deaths or on Active patients, therefore P D , P R or P A =0.

This plot, fig. 1a and b, shows the lack of effectiveness of the country to save lives against the recovery of patients. At the date of the evaluation, The Netherlands and The United Kingdom had the greatest proportion of deaths over recoveries, as well as Qatar.

This plot, fig. 1c and d, shows the country loss of lives among infected people. At the date of the evaluation. In Netherlands the past cases are pushing the log ratio up, the country had a total of 6277 deaths against 116 active patients.

This plot, fig. 1e and f shows the effectiveness of the country ability to recover lives among infected people. At the date of the evaluation. The United Kingdom presented a very low proportion of recovering and the largest proportions of active patients.

The bisectors in the ternary diagrams determine the samples order relations. For instance: P D >P R => samples belonging to the bisector defined by P A and the edge P A P D P D >P A => samples belonging to the bisector defined by P R and the edge P R P D P R >P A => samples belonging to the bisector defined by P D and the edge P D P R

The final step of the analysis provided an insight on how the groups of countries are formed. From a pure mathematical criterion it was possible to uncover order 1e 1f All rights reserved. No reuse allowed without permission. perpetuity. preprint (which was not certified by peer review) is the author/funder, who has granted medRxiv a license to display the preprint in The copyright holder for this this version posted December 19, 2020. ;

relations between the countries. An hierarchical clusterization algorithm following the Ward criterion was applied to the subcomposition [P D ,P R , P A ] resulting in 3 different groups for height=10 in the hierarchical tree. The Euclidean distance was calculated after a clr transformation of the acomp transformed data was done.

The groups are shown in different colors. The table 3 discriminates each country by group. The interpretation of the clustering is done in terms of the order relations between the different categories.

The gray group is closer to Pa, the yellow closer to Pr and the black between Pd and Pr. The three groups reproduce the categories seen in the ternary diagram with the additional information on the log-ratios. The gray group has lower proportions of dead and recovered. The yellow group has intermediate levels of these proportions and the black group is composed of dominant proportions of dead or recovered. The lists in the table above just show vis a vis the past incidences, the potential stages of the pandemic.

The Composition Data Analysis (CoDa) theory is the natural choice for uncovering the hidden aspects of the pandemic cases. Developed and wealth countries surprised the world as their populations were vastly touched by the virus with large numbers of deaths and sick people. However, there are restrictions with the data as they come from different sources and, as in the Spanish fever, frequently censored 11 . Spite of those drawbacks, the CoDa was able to unveil a few trends in the pandemic.

The levels of the proportions of deaths, recovered and active cases revealed unexpected outliers, as well as a common behaviour among countries. The ternary diagram, the log ratio plots and the clusterization suggested three categories of countries: (black) those recovery dominant, but death subdominant, as on the 6 th of september had more deaths than active; (yellow) a very big group which is recovery dominant and at last, the (gray) group which is recovery dominant and deaths subdominant or active dominant and death subdominant. None group is the good or the bad group, they just show different characteristics like what has been observed in the Log-Ratio plots. The method can be applied to provinces in a country, or counties in a state in order to support public action planning This study is a snapshot of the pandemic on the 6 th of September, it does not make any inference about the rate, or the aftermath of the first wave. The purpose was All rights reserved. No reuse allowed without permission. perpetuity. preprint (which was not certified by peer review) is the author/funder, who has granted medRxiv a license to display the preprint in The copyright holder for this this version posted December 19, 2020. ;

to show the ability of the CoDa theory to unveil the proportions between the three different types of cases: Deaths, Recoveries and Active. A dynamical study seems to be the natural sequence of this approach, as well as the production of inference models. This methodology can be applied to further data sets such as countries, cities, provinces or districts in order to help authorities and governmental agencies to improve their actions to fight the pandemic 12 . 

United States Countries in this group have Deaths < Recovered < Active, see Appendix I for more details

An interactive web-based dashboard to track COVID-19 in real time. The Lancet Infectious Disease

Compositional Data as a Methodological Concept

Biostatistics & epidemiology : a primer for health professionals

Isometric logratio transformations for compositional data analysis

Analyzing Compositional Data with R

Compositional data analysis with 'R' and the package 'compositions'. Geological Society

R: A language and environment for statistical computing. R Foundation for Statistical Computing

How the Horrific 1918 Flu Spread Across America. Smithsonian Magazine

The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.