key: cord-0965014-ce6afoqv authors: van Lissa, Caspar J.; Stroebe, Wolfgang; vanDellen, Michelle R.; Leander, N. Pontus; Agostini, Maximilian; Draws, Tim; Grygoryshyn, Andrii; Gützgow, Ben; Kreienkamp, Jannis; Vetter, Clara S.; Abakoumkin, Georgios; Abdul Khaiyom, Jamilah Hanum; Ahmedi, Vjolica; Akkas, Handan; Almenara, Carlos A.; Atta, Mohsin; Bagci, Sabahat Cigdem; Basel, Sima; Kida, Edona Berisha; Bernardo, Allan B.I.; Buttrick, Nicholas R.; Chobthamkit, Phatthanakit; Choi, Hoon-Seok; Cristea, Mioara; Csaba, Sára; Damnjanović, Kaja; Danyliuk, Ivan; Dash, Arobindu; Di Santo, Daniela; Douglas, Karen M.; Enea, Violeta; Faller, Daiane Gracieli; Fitzsimons, Gavan J.; Gheorghiu, Alexandra; Gómez, Ángel; Hamaidia, Ali; Han, Qing; Helmy, Mai; Hudiyana, Joevarian; Jeronimus, Bertus F.; Jiang, Ding-Yu; Jovanović, Veljko; Kamenov, Željka; Kende, Anna; Keng, Shian-Ling; Thanh Kieu, Tra Thi; Koc, Yasin; Kovyazina, Kamila; Kozytska, Inna; Krause, Joshua; Kruglanksi, Arie W.; Kurapov, Anton; Kutlaca, Maja; Lantos, Nóra Anna; Lemay, Edward P.; Jaya Lesmana, Cokorda Bagus; Louis, Winnifred R.; Lueders, Adrian; Malik, Najma Iqbal; Martinez, Anton P.; McCabe, Kira O.; Mehulić, Jasmina; Milla, Mirra Noor; Mohammed, Idris; Molinario, Erica; Moyano, Manuel; Muhammad, Hayat; Mula, Silvana; Muluk, Hamdi; Myroniuk, Solomiia; Najafi, Reza; Nisa, Claudia F.; Nyúl, Boglárka; O’Keefe, Paul A.; Olivas Osuna, Jose Javier; Osin, Evgeny N.; Park, Joonha; Pica, Gennaro; Pierro, Antonio; Rees, Jonas H.; Reitsema, Anne Margit; Resta, Elena; Rullo, Marika; Ryan, Michelle K.; Samekin, Adil; Santtila, Pekka; Sasin, Edyta M.; Schumpe, Birga M.; Selim, Heyla A.; Stanton, Michael Vicente; Sultana, Samiah; Sutton, Robbie M.; Tseliou, Eleftheria; Utsugi, Akira; Anne van Breen, Jolien; Van Veen, Kees; Vázquez, Alexandra; Wollast, Robin; Wai-Lan Yeung, Victoria; Zand, Somayeh; Žeželj, Iris Lav; Zheng, Bang; Zick, Andreas; Zúñiga, Claudia; Bélanger, Jocelyn J. title: Using Machine Learning to Identify Important Predictors of COVID-19 Infection Prevention Behaviors During the Early Phase of the Pandemic date: 2022-03-09 journal: Patterns (N Y) DOI: 10.1016/j.patter.2022.100482 sha: 4429ba125f7e19ddc6ff8726387c497c8d7ab230 doc_id: 965014 cord_uid: ce6afoqv Before vaccines for COVID-19 became available, a set of infection prevention behaviors constituted the primary means to mitigate the virus spread. Our study aimed to identify important predictors of this set of behaviors. Whereas social and health psychological theories suggest a limited set of predictors, machine learning analyses can identify correlates from a larger pool of candidate predictors. We used random forests to rank 115 candidate correlates of infection prevention behavior in 56,072 participants across 28 countries, administered in March-May 2020. The machine-learning model predicted 52% of the variance in infection prevention behavior in a separate test sample—exceeding the performance of psychological models of health behavior. Results indicated the two most important predictors related to individual-level injunctive norms. Illustrating how data-driven methods can complement theory, some of the most important predictors were not derived from theories of health behavior—and some theoretically-derived predictors were relatively unimportant. In the absence of a vaccine or cure, virus containment depended on individual-level compliance with behaviors recommended by the World Health Organization. We used machine learning to identify the most important indicators of compliance, based on a large international psychological survey and country-level secondary data. The most important indicators were not the "usual suspects", such as personal threat of virus infection, but rather injunctive norms-namely, the belief that one's community should engage in such behavior and that society should take restrictive virus containment measures. People appear who tend to engage in infection prevention behaviors also tend to believe that general compliance is necessary to defeat the pandemic, which extends to endorsement of 'ought' norms and support for behavioral mandates. These results highlight the potential to intervene by shaping social norms and expectations. This was especially the case in the early phase of the COVID-19 pandemic, between March and May 2020, when no vaccines were available. In this first phase of the pandemic, three infection prevention behaviors were recommended by most governments: frequent hand washing, social distancing, and self-quarantining [1] . The efficacy of these measures for curbing the virus depends on the extent to which individuals engage in these behaviors. The COVID-19 pandemic represented a public health emergency with rich social and system-level data available to evaluate engagement in compliance and focus research and future policy interventions on the most important predictors of such behaviors. Although one approach might be to test whether a specific variable explains important variance in predicting health behaviors. The present work applies machine learning to a large psychological dataset, which was assembled in the early phase of the pandemic and enriched with country-level societal data in order to consider a wider pool of candidate variables. Our primary aim was to identify the most important predictors of infection prevention behavior, given the available data; a secondary aim was to illustrate how inductive methods can help to inform crisis response. Social and health psychology entered the pandemic with a large toolbox of personal, social, and societal-level theories that may all independently predict individual-level infection prevention behavior to some extent. These individual health theories each involve some overlapping and some distinct predictors. However, when numerous disconnected studies use disparate research methods, levels of analysis, limited samples, and narrow contexts, it is difficult to compare the relative predictive utility of variables indicated by these theories. In other words, when any given study J o u r n a l P r e -p r o o f focuses only on the variables that fall within the scope of its theory, it is hard to tell how important the variables are, relative to other variables considered by other theories (or variables not considered at all). Machine learning is a more holistic methodology, as it can assess and compare a large number of potential predictors simultaneously, including theoretically relevant ones, and identify which predictors ultimately explain the most variance in the outcome measure of interest. The aim of this study is to use machine learning to identify the most important predictors of infection prevention behaviors during the early stages of the COVID-19 pandemic from a multinational, rapid-response survey. We combine multi-national survey data, country-level secondary database integration, and machine learning methods with the practical aim of identifying the most important predictors that could serve as targets for future research and behavioral interventions by governments and organizations such as the WHO. This method offers a holistic evaluation of numerous candidate predictor variables. The candidate variables cover different theoretical domains, so the results might speak to the relative importance of different theories as well as specific predictors. Moreover, the results of this inductive, exploratory approach might suggest promising avenues for future confirmatory research, to investigate direction of causality, and could support the allocation of scientific resources towards the most promising predictors of compliance in future crises that resemble the current pandemic. Results can also provide input for theory development or refinement [2] . Our study was conducted between March to May of 2020that is, in the initial phase of the pandemic, several months before the first COVID-19 vaccine (Pfizer-BioNTech was approved by the US Food and Drug Administration in August of the same year. At the time, there was hope a future vaccine could bring an end to the pandemic, implying that behavioral measures were mainly an interim or short-term solution. However, by 2021, hopes surrounding vaccines had still not fully materialized, partly because the available vaccines waned in efficacy over time and across new virus strains, and because much of the global population remained unvaccinated (e.g., COVID-19 vaccine hesitancy has since become a major area of research [3, 4] . By winter 2022, with new virus strains, recurring lockdowns, and the return of behavioral restrictions, the infection prevention behaviors recommended during the initial period of our study remained highly relevant. Machine learning can complement theory-driven approaches by identifying important determinants, or correlates, of a particular outcome, identifying blind spots in existing knowledge, and ranking predictors by their relative importance [2] . Machine learning instead estimates predictive performance in new datasets, and thus, generalizability of the results. Further, it includes checks and balances to prevent spurious findings (i.e., overfitting; see [5] ). The random forests algorithm, in particular, is free from certain assumptions of regression/correlation analysis, namely the assumption of linearity, absence of interactions, and normality of residuals. Random forests intrinsically capture non-linear associations and higher-order interaction effects, and can account for multilevel data: The clustering variable can be included as a predictor, which allows for relationships to differ across clusters (e.g., if measurement or associations differ between countries) [6] . Our approach incorporated both individual-level (psychological) predictors and country-level (societal) variables. To identify key individual-level predictors of infection prevention behaviors-at least during the initial phase of the pandemic-we J o u r n a l P r e -p r o o f launched a large-scale psychological survey in 28+ countries in the immediate weeks after the World Health Organization (WHO) declared COVID-19 a pandemic. The survey was designed with country-level database integration and machine learning in mind, and a separate team set out to perform machine learning analysis in isolation of any confirmatory analysis. The a priori objective was to recruit tens of thousands of survey responses globally, to assess their attitudes towards and to society's prescriptions, and examine how these factors relate to individual infection prevention behaviors. The survey provided individual-level variables, such as basic demographic characteristics (e.g., gender, age, education, religiousness), brief self-report measures of various psychological factors (e.g., subjective states and well-being, work and financial concerns, societal attitudes, COVID-relevant attitudes and beliefs), and individual infection prevention behaviors (e.g., hand washing, avoiding crowds). Deductive research, or hypothesis-testing, is the predominant focus of contemporary behavioral research. It tends to focus on a relatively narrow set of theoretically-derived variables, and the results revolve around statistical inference: Whether the theoretical hypotheses are supported by significant or reliable effects. In deductive research, less emphasis is placed on comprehensiveness or breadth of candidate predictors. Relatedly, the relative importance of different predictors is often of secondary importance, as is the model's predictive performance. Thus, although an advantage of deductive approaches is that they can be used to draw inferences about theoretical hypotheses, they also have specific limitations. These are particularly poignant in the context of the COVID-19 pandemic. To allocate scientific resources effectively in a crisis, it is important to cast a wide net among potential predictors, across different theories, and even include under-theorized factors to unearth potential J o u r n a l P r e -p r o o f blind spots in the extant literature. Inductive research-that is, rigorous exploratory work that identifies reliable patterns in data, is more suited to these demands. In recent years, inductive research has been gaining traction as a technique to complement existing theories by identifying important omissions [2] . In particular, machine learning offers powerful new tools for systematic exploration that can identify relevant predictors and complex relationships that have eluded theoreticians [7] . Machine learning is an approach to data analysis that focuses on maximizing predictive performance. This involves the use of flexible models to find reliable patterns in data. Machine learning models can distill a large set of candidate variables down to the ones that are most important in predicting the outcome of interest, and also indicate the direction and shape of the marginal association between those predictors and the outcome. In a context where predictor variables are likely to be related to each other, machine learning is better suited to manage these complex relationships than, e.g., multiple regressions. Moreover, it incorporates checks and balances to prevent spurious findings [5] . However, it is important to note that inductive and deductive approaches are interwoven, as the set of variables used as input for a machine learning analysis is typically based on theoretical considerations. Thus, as we describe below, we included in our survey a large set of candidate individual-and societal-level indicators, of infection prevention behavior, that were of theoretical interest to our international group of psychology experts. Infection control that relies on individual compliance with health recommendations constitutes a public good. The main characteristic of public goods (e.g., clean air) is that people can benefit from it even if they have not contributed to its production or purchase. This creates the temptation to free-ride on the J o u r n a l P r e -p r o o f contributions of others [8, 9] . The COVID-19 pandemic has some characteristics of a public goods dilemma, in that control of the virus can only be achieved if most members of society contribute to the effort [8, 9] . However, a pandemic also differs from many other public goods dilemmas, due to the immediate personal health threat of the virus: engaging in infection prevention behavior not only reduces the societal spread of the infection, it also lowers individual infection risk. Accordingly, individual-level psychological factors could predict infection prevention behavior even when individuals feel unobserved [10] [11] [12] . Thus, we might expect self-reported individual differences to predict compliance, such as perceived personal infection risk and vulnerability. Beyond its potential as a public goods dilemma, the COVID-19 pandemic is also a health emergency with profound social, economic and societal ramifications. In practical terms, millions of people were expected to lose their jobs, experience economic hardship, and suffer psychological strains as result of the lockdowns or self-quarantining [13] . More generally, an international group of behavioral scientists proposed various other psychosocial factors that may predict responses to the COVID-19 pandemic [14] , ranging from individuals' internal states to their societal attitudes and beliefs. This necessitated research that comprehensively (re-)examined potential predictors of infection prevention behavior, with attention to the broad social, economic, and personal ramifications of the pandemic. Our survey also included factors directly relevant to the domain of health behavior, such as those suggested by the Health Belief Model [15, 16] . According to the Health Belief Model, two conditions must be met to motivate people to engage in COVID-19 infection prevention behavior: They have to believe that they are at risk of contracting the virus, and that engaging in the recommended virus protection J o u r n a l P r e -p r o o f behaviors would be effective in reducing that risk [15] . A further assumption of this model is that the effect of perceived effectiveness of a health behavior will be moderated by the perceived costs of engaging in that behavior. If the behavior is too effortful, people might not adopt it, even if they think that doing so would be effective. A second relevant theory is the Theory of Planned Behavior (TPB [17] [18] [19] ). This more general psychological theory of behavior prediction posits that intentions to engage in a specific behavior would be predicted by three constructs: attitude towards the behavior (advantages and disadvantages), subjective norms (e.g., what is expected of me by important others), and perceived behavioral control (i.e., will I be able to do it). Despite the potential relevance of health behavior theories, they illustrate the aforementioned tendency of deductive research to focus on a narrow set of theoretical constructs. Other potentially important predictors, not germane to the given theory, might be overlooked. In line with this narrow focus, models based on such theories typically explain limited variance in the outcome variable. For example, a metaanalysis based on 185 independent tests of the TPB found that attitudes, subjective norms and perceived control explain 39% of the variance in intention, with intention accounting for 22% of variance in behavior [18] . Although this descriptive performance is perceived as relatively strong in the field of social science, it still leaves room for potential predictors from other research domains. Thus, rather than focus exclusively on variables that target the health behavior, the present analysis casts a wide net, by including psychological and societal factors that specifically pertain to the COVID-19 domain, as well as other factors whose relevance may generalize across domains. We sought to distinguish important individual-and societal-level indicators of infection prevention behavior using random forests [6] . The analysis is based on data from a large-scale psychological survey enriched with publicly available countrylevel secondary data. Random forests was used for its relatively competitive performance, computational inexpensiveness, and ease of interpretation [20] . The expected results consist of an estimate of predictive performance, which indicates how well the final model predicts infection prevention behavior in a new sample; a ranking of predictors based on variable importance, which reflects their relative contribution to the model's predictive performance; and partial dependence plots, which reveal the direction and shape of each predictor's marginal association with the outcome. The specific approach used in this paper maximized the reliability and generalizability of results in three ways. First, the data were split into a training sample, used to build the model, and a testing sample. The testing (or "hold out") sample is never used in the initial analysis, but rather is used to estimate the generalizability of the final model after analyses on the training sample are complete (a priori splitting of the dataset can be verified via the project's public historical record). This procedure helps to determine the model's predictive performance: In a classic deductive analysis, performance is traditionally expressed in terms of R 2 , which reflects a theoretical model's descriptive performance: the percentage of variance in the outcome explained by the model in the data. In the machine learning literature, by contrast, it is commonplace to estimate predictive performance by assessing R 2 in an independent test sample that was not used to estimate the model. Predictive performance reflects the generalizability of a model. Second, part of our global data collection efforts included the recruitment of paid subsamples from 20 J o u r n a l P r e -p r o o f countries that were representative of the population's age and gender distribution. Such sampling procedures can improve generalizability to the extent that it includes persons who might otherwise not participate as self-selected volunteers. Third, random forests is a specific machine learning method that includes checks and balances to ensure reliability and generalizability of the results [6] . Random forests analysis accomplishes this by splitting the training data into 1000 bootstrap samples, and estimating a regression tree model on each of these bootstrap samples independently. Each regression tree in turn splits the sample recursively until the post-split groups reach a minimum size. A split is made by determining which predictor (out of a randomly selected subset of predictors) and value of that predictor maximizes the homogeneity of the post-split groups. Thus, a tree resembles a flowchart with relatively homogenous end nodes. Interactions are represented by subsequent splits on different variables; non-linear effects are represented by repeated splits on the same variable; random effects are represented by splits on the cluster variable (country) followed by splits on substantive variables. Naturally, each of these 1000 models will include some spurious findings (overfitting). However, when the predictions from the 1000 models are averaged, these spurious findings tend to balance out, thus leaving only the reliable patterns. Whether this approach is successful in identifying reliable and generalizable patterns can be objectively evaluated based on subsequent predictive performance on the hold-out (test) sample. For a complete archive of all analysis code and results, including fit tables and figures, see https://github.com/cjvanlissa/COVID19_metadata. Prior to analysis, we split our data by randomly assigning 70% of observations to a training set and 30% of observations to a test set [5] . The test set was reserved exclusively for unbiased evaluation of the final model's predictive performance, and was neither used nor examined during model building to prevent cross-contamination. Thus, all models were trained using the training set and evaluated using the test set. We applied a random forest model using the ranger R-package [53] . Random forests offer competitive predictive performance at a low computational cost, intrinsically capture non-linear effects and higher-order interactions, offer a single variable importance metric for multi-level categorical variables (such as country), and afford relatively straightforward interpretation of variable importance and marginal effects of the predictors [6] . With regard to the multilevel structure of the data, random forests inherently accommodate data nested within country, including cross-level interactions where a given predictor has a different effect in different countries. The forest included 1000 trees. The model had two tuning parameters: the number of candidate variables to consider at each split of each tree in the forest, and the minimum node size. The optimal values for these parameters were selected by minimizing the out-of-bag mean squared error (MSE) using model-based optimization with the R-package tuneRanger [54] . The best model considered 31 candidate variables at each split, and a minimum of six cases per terminal node. The outcome metrics considered in the present study consist of 1) predictive performance, which reflects the model's ability to accurately predict new data; 2) variable importance, which reflects each predictor's relative role in accurately predicting the outcome measure, and 3) partial dependence plots, which indicate the direction and (non)linearity of a specific marginal effect [6] . Predictive performance is, J o u r n a l P r e -p r o o f essentially, a measure of explained variance (R 2 ), except that in the machine learning context, predictive performance is evaluated on the test sample, which was not used to estimate the model. Estimates of R 2 on the training sample should be interpreted as a measure of descriptive performance (i.e., how well the model describes the data at hand), and can be (severely) positively biased when used as an estimate of predictive performance in new data. Given that we had recruited paid subsamples (age-gender representative) in 20 countries, we additionally computed predictive performance for the paid-only portion of the test sample, to better examine the generalizability of our findings to the target population. The relative importance of predictor variables is based on permutation importance: Each predictor variable is randomly shuffled in turn, thus losing any meaningful association with the outcome, and the mean decrease in the model's predictive performance after permutation, as compared to the un-permutated model, is taken to reflect the (inverse) importance of that variable [6] . The partial dependence plots are generated using the metaforest R-package [4] . Partial dependence plots display the marginal (bivariate) association between each predictor and the outcome [55] . They are derived by computing predictions of the dependent variable across a range of values for each individual predictor, while averaging across all other predictors using Monte Carlo integration. The random forest model predicted a large proportion of the variance in selfreported infection prevention behaviors in the full test sample (R 2 test = .523), as well as in the paid subsample (R 2 rep = .586). As these samples had not been used in model estimation, this indicates that the results are robust. Notably, the high predictive performance on the paid subsample indicates the generalizability of the findings. The J o u r n a l P r e -p r o o f explained variance in the training sample was of approximately the same magnitude (R 2 train = .518). This correspondence between training and testing R 2 indicates that the model successfully learned reliable patterns in the data, and was not overfit. The top 30 predictors, ranked by relative variable importance, are illustrated in Figure 1 , along with an indication of whether the effect is generally positive, negative, or other (e.g., curvilinear). Table 1 serves as the legend for the variables illustrated in Figure 1 . Table S3 provides full results of all 115 predictors, rank-ordered by variable importance. country-level (database) indices. The shape of the bivariate marginal association between each predictor and the outcome is displayed in the partial dependence plots ( Figure 2 ). Recall that partial dependence plots display the marginal relationship between one predictor and the outcome, while averaging across all other predictors using Monte Carlo integration [55] . Note that the marginal predictions for the two levels of "leave for work" are identical; a denser Monte Carlo integration grid might show a small difference here, but exceeds our computational resources. This included the number of days that respondents reported leaving home (5 th ), the number of days in the past week they had in-person (face-to-face) contact with people J o u r n a l P r e -p r o o f who live outside their home, including "…immigrants" (4 th ), "…other people in general" (6 th ), and "…friends and relatives" (20 th The present study used machine learning to identify and rank predictors of infection prevention behavior among a wide set of potential candidates. After training on one sample, the resulting random forests model predicted over 50% of the variance in self-reported infection prevention behavior in a second (test) sample. This exceeds the standards for explained variance of social and health psychological theories, thus indicating that this data-driven approach can complement theoretical models. Moreover, whereas theoretical models typically focus on a limited narrow set of relevant variables, the present machine learning analysis identified additional, undertheorized predictors (e.g., temporal focus), thus offering complementary insights. A coherent picture emerged from our analysis of the type of person that The descriptive normative belief, that other people in the community do engage in social distance and self-isolation, also emerged as a relatively important predictor. It makes sense that individuals might be less motivated to comply if they were among a community of non-compliers. Furthermore, according to their selfreports about their own behavior, compliant individuals did not engage in behavior that would be inconsistent with self-protection, such as leaving their homes or having personal contacts with other people. If they had contacts with their family and friends, it was not in face-to-face meetings, but online. The findings also point to the idea that people who comply with recommended infection prevention behaviors are forward-looking problem-solvers. That is, they tended to engage in a problem-focused coping style, focus on the present and the future (rather than dwell on the past), and maintained high hopes that the coronavirus situation would soon improve. This optimistic view is important because these individuals were likely aware of the costs of these infection prevention behaviors and perhaps needed psychological resources to alleviate these costs. In this vein, other important predictors were a pro-social willingness to self-sacrifice to protect vulnerable groups from the virus, to limit the economic consequences of the coronavirus on such groups, and to support collective interventions in the economy such as tax increases. These results might also help understand the tension between J o u r n a l P r e -p r o o f It is interesting to consider some of the other 85 variables that were not among the top indicators. From a health psychological perspective, it is surprising that the perceived personal likelihood of getting infected was not among the important predictors. Though, the perceived personal consequence of infection was ranked 10 th . According to the Health Belief Model [15] , perceived vulnerability and severity are both central to health threat appraisal. The fact that the perceived severity of getting infected was a highly ranked predictor, but perceived infection risk was not, might suggest that people's behavior is more strongly driven by expected consequences than probability. Alternatively, the link between compliance and infection risk might be smaller because people implicitly recognize that this risk is largely outside of their control, to the extent that the pandemic constitutes a public goods dilemma. Several other, theoretically relevant variables that were absent from the most important predictors, included loneliness and boredom, emotional and affective states experienced during the last week, subjective well-being, various forms of psychological and financial strain, and job insecurity. It is important to note, however, that the present analysis does not rule out the importance of these personal factors for other outcomes, nor does it serve as evidence for a null-effect. No demographic variables emerged as especially important, even though several are associated with increased risk of complications from COVID-19. For instance, elderly people are at higher risk to die from a COVID-19 infection and are therefore strongly advised to take great care [21] . Furthermore, there is reason to assume that social distancing and self-isolation present more of a dilemma to young J o u r n a l P r e -p r o o f rather than elderly people, especially those on a pension. For young people, the costs of social distancing and self-isolation are typically higher andbecause they usually recover more easily from a COVID-19 infectionthe rewards of those infection prevention behaviors are smaller. Consistent with this argument, the media have framed the pandemic as a potential "intergenerational conflict of interest", where the young bear the brunt of the cost of containment measures, whereas the elderly enjoy most of its benefits. It is therefore noteworthy that our analysis did not identify age as an important predictor. However, this finding is consistent with preregistered research that similarly found no support for the "intergenerational conflict of interest" hypothesis [57] . An important strength of this study is that the questionnaire used was designed by an interdisciplinary consortium of scientists from different countries. This resulted in a questionnaire with a broader scope than those guided by a singular theoretical perspective. It makes the resulting data ideally suited for a machine learning analysis that can distill the most important predictors from many potential candidates. The analysis of this study uses deductive methods maximize predictive performance, typically explain more variance than purely deductive approaches, and in the case of random forests, intrinsically capture non-linear effects and higher order interactions, including between-country differences in effects. However, the results are harder to interpret than the parameters (e.g., regression coefficients and p values). We should note that the variables included in the PsyCorona survey were guided by theory, and thus our approach combines inductive and deductive approaches. Thus, although our application of machine learning is useful for gaining preliminary insights, it also capitalizes on a rich history of theorizing about what drives engagement in health behavior. However, although our study includes potentially important variables and theoretical areas, it is neither exhaustive nor conclusive. Inductive analysis can complement theories or provide an impetus for the development of new hypotheses, but the output is not yet a comprehensive theory. Although the data collected describe infection prevention behaviors during the beginning of the pandemic, they may be useful for understanding later patterns of behavior (e.g, low vaccine rates) or future crises that involve a combination of personal and societal risk. Health behavior theories tend to focus on the intrapersonal factors that predict behavior, possibly because these seem proximal to the health behaviors of interest. However, our data suggest these proximal factors may predict less variance in behavior than broader considerations of communal behavior. Future models may benefit from considerations of perceptions of norms in conjunction with personal risk when they are applied to other health behaviors as well. We began with an assumption that control of the pandemic is analogous to a public goods dilemma, in that COVID-19 is a social challenge that, in the absence of a vaccine at the time of the study, could only be addressed if enough individuals engaged in infection prevention behavior. In accordance with this assumption, social J o u r n a l P r e -p r o o f beliefs and societal factors, rather than exclusively personal psychological states, emerged as the main predictors in our analysis. Resource Availability Lead Contact. The lead contact for this paper is Dr. Caspar van Lissa, who may be contacted at C.J.vanLissa@uu.nl. The full survey is available in the supplemental material, as well as codebooks and translation procedures for all languages (tables S1 & S2). All analysis code is available in an online repository (https://github.com/cjvanlissa/COVID19_metadata), which also includes a full historical record since the start of the project. This can be used to verify that the analysis proceeded transparently and straightforwardly; the random seed used to select participants for the test sample was established before access to data was obtained, and testing data were never used for model training. Data and Code Availability. The data and code used in this analysis are available at DOI: 10.5281/zenodo.5948816 The survey was translated from English into 29 other languages by bilingual members of the international research team. It was distributed online during the early phase of the pandemic (March-May 2020), with most participants completing the survey in March and April (see figure S1 for daily frequencies). Parallel sampling strategies were employed: convenience sampling, snowball sampling, and paid sampling. Given that age and gender were identified early as population vulnerability characteristics to the virus [21, 22] , the self-selected samples were supplemented with paid subsamples that were representative of a given country's population distribution about handwashing, endorsement of stringent regulations for violating quarantine) did not exist prior to the pandemic, we crafted face valid items to assess these constructs. Bavel and colleagues' [14] discussion of candidate domains of inquiry for pandemic behavior, it touches on nearly all of these topics, including navigating threats, stress and coping, science communication, moral decision-making, and political leadership. Personal factors adapted or informed by prior work included affective states (incl. valence and arousal [23] ); boredom [24] ; coping and avoidance [25, 26] ; financial strain [27] ; loneliness [28] ; neuroticism [29] ; happiness and well-being [30] [31] [32] ; time perception, management, and temporal focus [33, 34] , working conditions and job insecurity [35] [36] [37] . The social attitudes and norms domain included generic conspiracy beliefs and paranoia [38, 39] ; immigrant attitudes [40] [41] [42] ; norm perceptions and preferences (adapted [43] ); societal discontent and disempowerment [44] [45] . Virus-relevant personal concerns included perceived norms (both descriptive and injunctive, adapted [46] ); virus-relevant beliefs and perceived knowledge, virus exposure risk and economic risk, and severity of virus and economic consequences (adapted [46, 47] ); trust in government pandemic communication and response (adapted [43, 48, 49] ), and attitudes towards prosocial responses and extraordinary societal responses [48] . This list is not exhaustive; see prevention behaviors were advised across most countries and contexts: washing hands, avoiding crowds, and self-isolation/self-quarantine (wearing a face covering was not universally recommended by the WHO until June 2020 [50] ). Participants were presented with a single screen that read, "to minimize my chances of suffering from coronavirus, I..." and indicated their agreement to "1. …wash my hands more often", "2. ...avoid crowded spaces", and "3. ...put myself in quarantine/self-isolate", each rated on a seven-point scale rated -3 (strongly disagree) to 3 (strongly agree). To ensure items could be combined into a unidimensional scale, we conducted Horn's parallel analysis [51] . Only one component had an Eigenvalue exceeding randomly permuted data. This component explained 70% of the variance in the three items, which is high. The three factor loadings were high and approximately equal in size (range: .78 -.89), indicating that it is justifiable to combine these three items into a mean score representing infection prevention behaviors (M = 2.20, SD = 1.00, α = .75). Note that the items were specifically framed to assess the behavioral intent to reduce the risk of infection, consistent with theories of health behavior that people engage in self-protective actions because they are perceived as instrumental for threat reduction [46] . We subsequently cleaned the data in several steps. First, to ensure that there was enough data on country-level, we excluded observations from countries that accounted for less than 1% of total observations. The final sample included N = 56,072 respondents across 28 countries (see table S4 for samples that remained in the data). Second, we excluded any columns and rows from the data that had a proportion of missing values of more than 20%. Third, we computed mean scores for multi-item scales using the tidySEM R-package [52] . For instance, responses to all 4 items on job insecurity [37] were summarized by creating a single composite score for job insecurity. Scales with low reliability were excluded (Cronbach's alpha < .65 Zúñiga 67 contributed to project design, data collection, translation, and review of the manuscript. The authors declare no competing interests. The PsyCorona data were made available for theory-testing studies by the researchers who helped to collect the data. Portions of the PsyCorona data have been previously reported in specific hypothesis tests [57] [58] [59] [60] [61] . This machine learning analysis was planned a priori as part of the onset of PsyCorona, is the only paper that uses inductive analysis, and is based on the total dataset. Covid prosocial Pro-social willingness to protect vulnerable groups from the coronavirus (4 items) 4 Contact immigrants Days of in-person (face-to-face) contact with immigrants 5 Home.leave.often How many days in the last week did you leave your home? 6 Contact people Days of in-person (face-to-face) contact with other people in general 7 Do social distance Descriptive norm (Right now, people in my area..."-...do self-isolate and engage in social distancing.") 8 Econ prosocial Pro-social willingness to protect vulnerable groups from economic consequences of the coronavirus (3 items) 9 Problem solving Problem-focused coping style (3 items) 10 Consequence contracting How personally disturbing would it be if… "You were infected with coronavirus" 11 Covid hopeful "I have high hopes that the coronavirus situation will soon improve" 2. Global Health Security (GHS) Index Country-level ratings of pandemic preparedness and general health security. Country-level health care resources and health infrastructure. Per-country data on aggregate ratings of: Voice and accountability, regulatory quality, political stability and absence of violence, rule of law, government effectiveness, and control of corruption. Governmental responses and policies with respect to COVID-19 by date per country. J o u r n a l P r e -p r o o f • The strongest predictors related to injunctive norms. In a study of 56,072 participants from 28 countries, we used a machine learning approach to identify the strongest predictors of COVID-19 infection prevention behavior (prevaccine). Few country-level data variables predicted outcomes. Instead, individual psychological variables predicted outcomes. Injunctive norms such as believing people should engage in the behaviors and support for behavioral mandates were the strongest predictors of infection prevention behavior. The results highlight how both data-and theory-driven approaches can increase understanding of complex human behavior. In the absence of a vaccine or cure, virus containment depended on individual-level compliance with behaviors recommended by the World Health Organization. We used machine learning to identify the most important indicators of compliance, based on a large international psychological survey and country-level secondary data. The most important indicators were not the "usual suspects", such as personal threat of virus infection, but rather injunctive normsnamely, the belief that one's community should engage in such behavior and that society should take restrictive virus containment measures. People appear who tend to engage in infection prevention behaviors also tend to believe that general compliance is necessary to defeat the pandemic, which extends to endorsement of 'ought' norms and support for behavioral mandates. These results highlight the potential to intervene by shaping social norms and expectations. Distancing and handwashing could lower flu rates Exploring Heterogeneity in Meta-Analysis using Random Forests (0.1.2) COVID-19 vaccine hesitancy -a scoping review of the literature in high Vaccination against COVID-19: A systematic review and meta-analysis of acceptability and its predictors The elements of statistical learning: Data mining, inference, and prediction Random forests Theoryguided exploration with structural equation model forests Self-interest and collective action: The economics and psychology of public goods The logic of collective action: public goods and the theory of groups Adolescents' motivations to engage in social distancing during the COVID-19 pandemic: Associations with mental and social health A study of normative and informational social influence upon individual judgments Autonomous and controlled motivational regulations for multiple health-related behaviors: Between-and within-participants analyses The psychological impact of quarantine and how to reduce it: rapid review of the evidence Using social and behavioural science to support COVID-19 pandemic response The Health Belief Model: A decade later The Health Belief Model. in Predicting health J o u r n a l P r e -p r o o f Predictors of COVID-19 Infection Prevention Behavior behaviour: Research and practice with social cognition models Attitudes, personality, and behavior Efficacy of the Theory of Planned Behaviour: A meta-analytic review Prospective prediction of health-related behaviours with the Theory of Planned Behaviour: a meta-analysis An introduction to recursive Partitioning Rationale, application, and characteristics of classification and regression trees, bagging, and random forestst COVID-19 guidance for older adults COVID-19: the gendered impacts of the outbreak A circumplex model of affect Development and validation of the Multidimensional State Boredom Scale Assessing coping strategies: a theoretically based approach The cognitive avoidance questionnaire: validation of the English translation Mind your mannerisms: Behavioral mimicry elicits stereotype conformity Social psychology and health Is it a dangerous world out there? The motivational bases of american gun ownership Toward an integrative social identity model of collective action: a quantitative research synthesis of three socio-psychological perspectives American handgun owners differ? Advice on the use of masks in the context of COVID-19: Interim guidance A rationale and test for the number of factors in factor analysis Generate tidy SEM-syntax A Fast Implementation of Random Forests for High Dimensional Data Hyperparameters and tuning strategies for random forest "This research received support from the New York University Abu Dhabi (VCDSF/75-71015), the University of Groningen (Sustainable Society & Ubbo Emmius Fund), and the Instituto de Salud Carlos III (COV20/00086) co-funded by the European Regional Development Fund (ERDF) 'A way to make Europe'" The study was designed by Caspar J. van • We were careful to choose semantically correct translations over more literal ones aiming to accommodate cultural differences. In some cases it was needed to add more words for correct understanding. • "Online vs. offline contact" was translated as "online vs. direct contact"• The item about "belief in one God/more than one God" was translated as "belief in one God" as all the religions in Kosovo and Albania are monotheistic. While these kinds of beliefs were not separated for ex. in two items but were within one item, it may have been confusing for the subjects so the translation was adapted culturally. • Items pertaining to political orientation (left/right wing) may not be relatable due to the terminology used. Longer descriptions may have been needed to explain the terms and ensure they are correctly understood. Arabic• Some words/phrases were changed or removed to accommodate regional religious beliefs.• There wasn't a word for 'local community' in Arabic so used the term 'society' instead.• In the question where there is a distinction between should and do isolate/social distance myself, and want/have to … a formatting error caused the wrong term to be bolded in some items, but the wording was the same. Bengali None • Difficulty translating formidability items as the word formidability does not translate well. We adjusted the translation for better understanding. • Formidability was translated as 'powerful' as the Dutch word for formidability is rarely used.• "Online vs. offline contact" was translated as "online vs. face-to-face contact" Farsi • Multiple questions did not translate well.• Attitude about politics is a relatively western way to categorize people into groups.French None • Some items were hard to translate. E.g. 'community' does not translate well.• Semantically correct translations were sometimes chosen over more literal ones to accommodate cultural differences.Greek• "Online" in the item "In the past 7 days, how many days did you have online (video or voice) contact with …" was translated "Internet". • "Community" (in the present context) does not translate well into Greek. • Some items were too technical and did not translate well, so simpler translations conveying the meaning were chosen. • Some items were difficult to translate accurately. • Some items were difficult to translate accurately due to the inequality of meanings. Italian None Japanese NoneKorean None • Some items were difficult to translate literally due to cultural considerations. E.g., the item about belief in one God/more than one God may be perceived as offensive to Malay Muslim when the item is being written as one item. Agreeing on the item may indicate that the individual believes in either one and this is unacceptable to the majority of Muslims in Malaysia. • Items pertaining to political orientation (left/right wing) may not be relatable to many locally, due to the terminology used.Longer descriptions may be needed to explain the term to ensure the terms could be understood correctly. Polish• Tightness-looseness construct was difficult to translate. Romanian None • Tightness-looseness construct is difficult to express in Russian.• The terms "economic left-right" and "libertarian-authoritarian" make little sense without explanation to most Russians. • Difficulty translating formidability items.• Identification item translated as I feel close to instead of I identify with. • Care taken when finding equivalence between standard Spanish and Latin American Spanish. • Difficulty in translating cross-cultural research terms.• Some items were difficult to translate literally and accurately.• Questions about bodies were confusing. • Translated "in my country" to "in the place I live" in order to accommodate both Taiwan and Hong Kong (which is not a country, but a special administrative region). Ukrainian • Questions about 'bodies' were confusing since the metaphor itself might not have been fully clear for the local population.• The same concerns formidability. Many sentences had to be restructured in order to save the meaning of the question.Urdu None • Some translated items were difficult to express accurately in Vietnamese due to political and social issues (eg. protest/ protesting) and some were not popular to most Vietnamese people (eg. economic left-right or libertarian -authoritarian).