key: cord-0982582-2ho05cnn
authors: Zhang, S.; Cooper-Knock, J.; Weimer, A.; Harvey, C.; Julian, T.; Wang, C.; Li, J.; Furini, S.; Frullanti, E.; Fava, F.; Renieri, A.; Pan, C.; Song, J.; Billing-Ross, P.; Gao, P.; Shen, X.; Timpanaro, I. S.; Kenna, K.; VA Million Veteran Program,; GEN-COVID Network,; Davis, M.; Tsao, P.; Snyder, M.
title: Common and rare variant analyses combined with single-cell multiomics reveal cell-type-specific molecular mechanisms of COVID-19 severity
date: 2021-06-21
journal: medRxiv : the preprint server for health sciences
DOI: 10.1101/2021.06.15.21258703
sha: a4a75540d2b7b9556215f475ad6ebb13c5b92883
doc_id: 982582
cord_uid: 2ho05cnn

The determinants of severe COVID-19 in non-elderly adults are poorly understood, which limits opportunities for early intervention and treatment. Here we present novel machine learning frameworks for identifying common and rare disease-associated genetic variation, which outperform conventional approaches. By integrating single-cell multiomics profiling of human lungs to link genetic signals to cell-type-specific functions, we have discovered and validated over 1,000 risk genes underlying severe COVID-19 across 19 cell types. Identified risk genes are overexpressed in healthy lungs but relatively downregulated in severely diseased lungs. Genetic risk for severe COVID-19, within both common and rare variants, is particularly enriched in natural killer (NK) cells, which places these immune cells upstream in the pathogenesis of severe disease. Mendelian randomization indicates that failed NKG2D-mediated activation of NK cells leads to critical illness. Network analysis further links multiple pathways associated with NK cell activation, including type-I-interferon-mediated signalling, to severe COVID-19. Our rare variant model, PULSE, enables sensitive prediction of severe disease in non-elderly patients based on whole-exome sequencing; individualized predictions are accurate independent of age and sex, and are consistent across multiple populations and cohorts. Risk stratification based on exome sequencing has the potential to facilitate post-exposure prophylaxis in at-risk individuals, potentially based around augmentation of NK cell function. Overall, our study characterizes a comprehensive genetic landscape of COVID-19 severity and provides novel insights into the molecular mechanisms of severe disease, leading to new therapeutic targets and sensitive detection of at-risk individuals.

276 Next, we investigated the baseline expression pattern of RefMap genes in healthy 277 lungs. In particular, we calculated mean expression levels of genes in different cell types 278 based on the lung snRNA-seq data from Wang et al. 33 , and then compared the 279 expression of RefMap genes with the total set of expressed genes in each cell type.

10 All rights reserved. No reuse allowed without permission.

(which was not certified by peer review) is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity.

The copyright holder for this preprint this version posted June 21, 2021. ; https://doi.org/10.1101/2021.06.15.21258703 doi: medRxiv preprint 280 Interestingly, although the gene expression level was not an input to the RefMap model, 281 RefMap genes are expressed at a higher level compared to the background 282 transcriptome in all 19 cell types, including immune and epithelial cells (false discovery 283 rate (FDR)<0.1, one-tailed Wilcoxon rank-sum test; Fig. 3c ) with the exception of 284 pericytes (FDR=0.11, Z -score=1.25); it is interesting to note that pericytes may be 285 downstream in the pathogenesis of COVID-19 because they are protected by an 286 endothelial barrier 47 . This supports the functional significance of RefMap genes across 287 multiple cell types in healthy human lungs. As a negative control, we performed a 288 similar expression comparison between non-developmental genes and all expressed 289 genes in lungs, which yielded no significant difference ( Supplementary Fig. 2 ; 290 Methods ). In summary, our transcriptome analyses indicate that RefMap genes are 291 expressed above background in relevant cell types, supporting their important role in 292 lung function. 293 To obtain further insights into the function of RefMap COVID-19 regions, we tested 294 whether RefMap regions are enriched with cell-type-specific candidate cis-regulatory 295 elements (cCREs, or enhancers and promoters) defined by H3K27ac and H3K4me3. 296 We obtained cCREs for lung tissues and primary cells from the ENCODE project 48 297 ( Supplementary Table 6 ), and examined the overlap between those cCREs and 298 RefMap regions by permutation test 49 . We discovered that RefMap regions specific to 299 Fig. 3 ). 307 We have shown that RefMap COVID-19 regions are enriched within promoters and 308 enhancers responsible for regulating gene expression in healthy lung tissue. On this 309 basis, we hypothesized that genetic variation within RefMap regions would alter the 11 All rights reserved. No reuse allowed without permission.

(which was not certified by peer review) is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. (which was not certified by peer review) is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity.

The copyright holder for this preprint this version posted June 21, 2021. ; 518 activation; thus the genetic architecture we have discovered places NK cell activation 519 upstream in determining severe COVID-19. 520 To further characterize the function of the identified NK-cell modules, we investigated 521 the expression of module genes based on scRNA-seq data from healthy and diseased 522 lung tissues. Genes in all three modules are relatively over-expressed in NK cells of 523 healthy lungs 33 than the background transcriptome (FDR<0.01, one-tailed Wilcoxon 524 rank-sum test; Fig. 6g ). In contrast, in lung tissues infected with SARS-CoV-2, we 543 Our study of common and rare genetic variation associated with severe COVID-19 544 converges on common biology, despite non-overlapping datasets and orthogonal 545 analytical methods. We have achieved this because we have developed effective 546 machine learning methods which offer advantages over traditional methods: RefMap to 19 All rights reserved. No reuse allowed without permission.

(which was not certified by peer review) is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity.

The copyright holder for this preprint this version posted June 21, 2021. ; 547 integrate common variants with epigenetic profiles 32 , and PULSE for rare variant 548 discovery by prediction. The evolution of clinical COVID-19 involves the interaction of 549 multiple viral and host factors in what is likely to be a nonlinear system; our work 550 supports this proposal and suggests that traditional methods may be inadequate given 551 current sample sizes. This study is the first time we have presented PULSE and we 552 have demonstrated a significant power advantage compared to standard methodology. 553 Both methods are ready for application in other disease areas. 554 Our network analysis highlights NK cell activation through type I interferon signalling 555 ( Fig. 6f ) as a key upstream determinant of COVID-19 severity. This links to previous 556 literature describing a delayed interferon response as a precursor of later 557 hyperinflammation associated with potentially fatal ARDS 27 , 65 . NK cells can also be 558 activated via MHC signalling through NKG2 proteins. The CD94/NKG2C/HLA-E axis 559 has been shown to be key to the NK antiviral response 66 but so has the recognition of 560 induced-self antigens via the NKG2D receptor 67 . Deletions of NGK2C have previously 561 been linked to severe COVID-19 28 , whereas both our Mendelian randomization and 562 transcriptome analyses highlight a role for NKG2D+ NK cells. We suggest that all three 563 mechanisms for NK cell activation are critical to the host immune response to 564 SARS-CoV-2. Indeed, a recent study has revealed that autoantibodies which impair NK 565 cell activation are associated with severe COVID-19, and that manipulating the 566 activation of NK cells in a mouse model resulted in a significantly higher viral burden 29 . 567 In the cancer field, NK cell stimulation has been postulated as a therapeutic strategy 68 . 568 We propose that this strategy could protect at-risk individuals in future waves of 569 COVID-19. 570 It is important to note that our analyses also identified genetic risk of severe COVID-19 571 associated with non-NK cell types, including other immune cells and epithelial cells such 572 as AT2 cells, which is consistent with the previous literature 69 . Indeed, the PULSE 573 prediction is based on a total genetic architecture and not limited to NK cell 574 genomics. Future work will determine how these other cell types are essential and how 575 they interact with NK cell activation. 20 All rights reserved. No reuse allowed without permission.

(which was not certified by peer review) is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity.

The copyright holder for this preprint this version posted June 21, 2021. ; 576 We present a validated prediction of COVID-19 severity derived entirely from host 577 characteristics, including age, sex, and genetics. The average AUROC of ~0.72 578 outperforms all comparable strategies 9 ; and we achieve a very high sensitivity of ~85% 579 with a specificity >50%. Our prediction could be applied in advance of infection or even 580 exposure, and thus has the potential to be very useful clinically. We anticipate future use 581 and refinement of our prediction model to guide administration of post-exposure 582 prophylaxis to at-risk individuals, in a similar manner to current standard practice for 583 HIV 70 . 584 Our analyses are based on the largest available datasets to date but increasing sample 585 size could improve the precision of our discovery and prediction. In addition, the vast 586 majority of our data was taken from populations and at times when recently identified 587 SARS-CoV-2 variants were not prevalent in the population (before November 2020, 588 https://covariants.org/per-country ). It is unlikely, but not impossible, that the NK cell 589 responses we have identified as essential determinants of severe COVID-19 are not 590 applicable to new variants. 591 In conclusion, we have uncovered a comprehensive genetic architecture of severe 592 COVID-19 integrated with single-cell-resolution biological functions. Both common and 593 rare variant analyses have highlighted NK cell activation as a potential key factor in 594 determining disease severity. Our novel rare variant method has also achieved age-, 595 sex-, and ancestry-independent prediction of COVID-19 severity from personal 596 genomes. (which was not certified by peer review) is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity.

The copyright holder for this preprint this version posted June 21, 2021. ; All rights reserved. No reuse allowed without permission.

(which was not certified by peer review) is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. 780 In this study, we ran the MFVI algorithm per chromosome to accelerate the 781 computation. The Q + -and Q --scores were defined as and , 782 respectively, and we also defined the Q -score as Q = Q + + Q -. RefMap regions were 783 identified by Q + -or Q --score >0.95.

784 Mapping cell-type-specific genes from RefMap regions 28 All rights reserved. No reuse allowed without permission.

(which was not certified by peer review) is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity.

The copyright holder for this preprint this version posted June 21, 2021. ; https://doi.org/10.1101/2021.06.15.21258703 doi: medRxiv preprint All rights reserved. No reuse allowed without permission.

(which was not certified by peer review) is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. 816 In total, 46 GWAS measures of NK cell subtypes were identified from the IEU Open 817 GWAS Project, including "prot-a-180", "met-b-124", "met-b-245", "met-b-242", 818 "prot-c-5244_12_3", "met-b-237", "met-b-258", "prot-a-1669", "prot-c-2917_3_2", 819 "met-b-246", "prot-a-1671", "met-b-249", "met-b-140", "met-b-240", "prot-a-3159", 820 "prot-c-5104_57_3", "prot-c-3056_11_1", "prot-a-13", "prot-a-3160", "met-b-123", 821 "met-b-250", "met-b-239", "met-b-120", "met-b-154", "prot-a-3162", "met-b-247", (which was not certified by peer review) is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity.

The copyright holder for this preprint this version posted June 21, 2021. ; https://doi.org/10.1101/2021.06.15.21258703 doi: medRxiv preprint 843 were omitted from the analysis in order to reduce the risk of errors due to strand 844 issues 84 . 845 The MR measure with the greatest power is the inverse-variance weighted (IVW) 846 method, but this is contingent upon the exposure IV assumptions being satisfied 85 indicating that Egger is more likely to be biased towards the null 90 . Finally, we performed 871 a leave-one-out analysis using the method of best fit for each exposure SNP within the 31 All rights reserved. No reuse allowed without permission.

(which was not certified by peer review) is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. (which was not certified by peer review) is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. (which was not certified by peer review) is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity.

The copyright holder for this preprint this version posted June 21, 2021. ; https://doi.org/10.1101/2021.06.15.21258703 doi: medRxiv preprint All rights reserved. No reuse allowed without permission.

(which was not certified by peer review) is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. (which was not certified by peer review) is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity.

The copyright holder for this preprint this version posted June 21, 2021. ; https://doi.org/10.1101/2021.06.15.21258703 doi: medRxiv preprint All rights reserved. No reuse allowed without permission.

(which was not certified by peer review) is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. Supplementary Notes for "Common and rare variant analyses combined with single-cell multiomics reveal cell-type-specific molecular mechanisms of COVID-19 severity"

1 Update rules of variational inference for PULSE

We provide update rules for the local and global variational parameters in PULSE.

As described in the Methods section in our main text, we used the local variational method [1] to handle the sigmoid function in variational inference (VI). Indeed, the sigmoid function involved in the Bernoulli distribution in Eq. 19 in the Methods section can be lower bounded by

where

c i = w 1 X i w 2 , and ξ i is a local variational parameter introduced to control the bound tightness. Therefore, the log-likelihood of observations is also lower bounded, i.e., ln p(y 1:N |X 1:N ) = ln p(y 1:

On one hand, we aim to perform variational inference based on the tractability of the lower bound h(c i , ξ i ), matching the proposal distribution with the true posterior. On the other hand, the variational parameters ξ i 's need to be optimized by maximizing the lower bound L(ξ 1:N ) of the marginal likelihood, which achieves a better approximation after each update. Therefore, we adopted the variational expectation-maximization (VEM) algorithm that solves both optimization problems simultaneously. We first note that the "joint distribution", denoted byp , after lower bounding is not a proper density function, but by normalization, the inequality may not hold any more. Indeed, after normalizing, we get

with A(ξ 1:N ) = 

resulting in a similar decomposition of the marginal log-likelihood to that in conventional VI.

As a consequence, we can perform the VEM as follows. (i) In the E-step where the variational parameters ξ 1:N are fixed, the standard variational inference is performed to maximize the computationally feasible ELBOp (q, ξ 1:N ) with respect to q. Here, everything in the mean-field variational inference (MFVI) keeps unchanged except replacing the sigmoid functions in the joint distribution by their lower bounds given by Eq. 1. This computes the approximate distribution best matching the true posterior, i.e., minimizing the KL divergence between q and p (see the last equation in Eq. 6). After the E-step, we approximately tighten the gap between L(ξ 1:N ) and the ELBO, and obtain L(ξ 1:N ) ≈ ELBOp (q, ξ 1:N ). (ii) In the M-step, we fix q and maximize the ELBO with respect to ξ 1:N , which increases L(ξ 1:N ) accordingly, as it is obvious that the inequality L(ξ 1:N ) ≥ ELBOp (q, ξ 1:N ) holds. Using VEM, we update q's and ξ i 's iteratively, gradually increasing the log-likelihood lower bound until reaching a local optimum and simultaneously yielding approximate posteriors with performance guarantee.

According to above discussions, we first get the lower bound of the log-likelihood of the conditional distribution over observations, i.e.,

which serves as the basis for the inference of w 1 and w 2 . Then based on this lower bound and the update principle of MFVI, the logarithm of q(w 1 ) can be calculated as

This indicates that q(w 1 ) follows a Gaussian defined as

The update rules expressed in Eqs. 10 and 11 are batch based, which is inefficient for large sample size or large feature dimension. We will transform this batch update into stochastic or mini-batch one based on the stochastic variantional inference (SVI), scaling up the inference algorithm to big data. More details are shown in Section 1.3.

To perform VI over the spike-and-slab prior defined in Eq. 22 in the Methods section of the main text, we adopted the reparameterization trick introduced in [4] . In particular, as discussed in the Methods section, w 2 can be reparameterized by two additional variables s andw 2 with

where • means element-wise product. It can be easily shown that the new variable constructed byw 2j s j follows the same distribution as w 2j . Then we can perform MFVI over w 2 and s. However, the solution derived from a direct application of the fully factorized MFVI will deviate from the true posterior q(w 2 ) a lot, as the former is unimodal while the latter exponentially multimodal. To solve this problem, we followed [4] , in whichw 2j and s j are bundled together in the factorization. In particular, we assume the proposal distributions factorize as

Given the MFVI principle, after substituting w 2j withw 2j s j in Eq. 7, we get

where we define

and 1 j is a vector with all zeros but the j-th element one.

Since q(w 2j |s j ) ∝ q(w 2j , s j ), based on Eq. 14, we have q(w 2j |s j = 0) = N w 2j ;μw 2j |sj =0 ,λ −1 w2j |sj =0 , 

Similarly, q(w 2j |s j = 1) also follows a Gaussian given by q(w 2j |s j = 1) = N w 2j ;μw 2j |sj =1 ,λ −1 w2j |sj =1 ,

whereμw

The stochastic updates of Eqs. 20 and 21 are shown in Section 1.3.

To derive q(s j ), we use the Bayes' rule given by q(s j ) = q(w 2j , s j )/q(w 2j |s j ), yielding q(s j ) = Bern (s j ;π j ) ,

whereπ j =ρ 1j ρ 0j +ρ 1j ,

and

The posterior statistics of w 2j , including the expectation and variance, can be easily calculated based on Eqs. 12, 17, 18, 20 and 21. In particular, the posterior statistics of the marginal q(w 2j ) can be derived based on the laws of total expectation and variance, respectively.

As discussed in the Methods section in the main text, to scale up the inference algorithm to big data, we adopted SVI proposed in [3] . SVI updates variational parameters by summarizing data points based on stochastic gradient optimization, in which the natural gradient is used to account for measuring similarity between probability distributions. Thanks to the conditional conjugacy introduced in our model, the natural gradient enjoys a simple form without the calculation of the Hessian [3] . Then we can approximate the natural gradient by randomly sampling a single or a mini-batch of samples, greatly reducing the computational complexity per epoch. Here in our inference process, there are two steps where SVI needs to be applied.

(i) For the update of q(w 1 ) whose batch update is given by Eqs. 10 and 11, its stochastic update is given by

where φ 1 and φ 2 are natural parameters in the exponential family form for multivariate Gaussian, and I is a randomly sampled index set from 1 : N with size B. Then the distribuiton parameters in q(w 1 ) can be recovered bỹ

(ii) Similarly, for the update of q(w 2j |s j = 1), its stochastic version is given by

where ψ 1j and ψ 2j are natural parameters in the exponential family form of Gaussian. In particular, the parameters in q(w 2j |s j = 1) can be recovered bỹ

λw 2j |sj =1 = −2ψ 2j .

For other variational parameters, we perform standard MFVI and have

q π;α π ,β π = Beta π;α π ,β π ,

q λ;ã λ ,b λ = Gamma λ;ã λ ,b λ ,

in whichW −1 Λ = W −1 0 + E[w 1 w 1 ], ν Λ = ν 0 + 1, 

In addition to calculating posteriors, we also need to determine the local variational parameters ξ i 's. According to our discussion in Section 1.1, we seek to optimizing ξ i 's by maximizing the lower bound L(ξ 1:N ) in Eq. 3. This corresponds to the M-step, in which the expected complete-data log-likelihood is maximized, i.e.,

in which

By setting the derivate of Eq. 43 with respect to ξ i to zero, we get

indicating that

Note that we can force ξ i 's to be nonnegative without loss of generality due to the monotonicity of χ(ξ i ) when ξ i ≥ 0.

To terminate the algorithm, we need to monitor the change of ELBO, whose computation is intense and undesirable. In this study, we followed the suggestions proposed in [2] , in which we computed the average log predictive for a small held-out dataset to track ELBO evolution. We terminated the updates once the change of average log predictive fell below a threshold, indicating convergence. Here, we set tol = 10 −5 and terminate the algorithm when the proportion of change in ELBO is less than the tolerance. The inference algorithm is summarized in Algorithm 1.

where only rare variants (i.e., absent from the EUR cohort test the prediction performance of PULSE, we first performed 5-fold 374 cross-validation (CV) based on the GEN-COVID discovery cohort

To further validate the prediction power of PULSE, we analyzed whole-genome 388 sequencing (WGS) data of an independent cohort from the Veterans Health

Genome annotation and feature 392 engineering were conducted as for the GEN-COVID cohort. Similarly, to remove the 393 effect of age, we focused on non-elderly adults (age >30 and <65 years) who were 394 critically ill or not hospitalized

Next, we asked if the prediction accuracy is generalizable across different populations

We constructed a test set of non-EUR non-elderly adults (age >30 and <60 years) that 402 passed all other QC criteria within the GEN-COVID cohort, resulting in 12 cases 403 (critically ill) and 6 controls (not hospitalized)

Combining two scores further increased the prediction performance

Furthermore, we applied the same trained PULSE model to predict severe 408 disease for African (AFR) individuals (10 cases versus 92 controls) within the VA cohort

These results demonstrate the 412 prediction power of PULSE and suggest that the rare-variant genetic architecture of 413 COVID-19 severity is conserved across multiple populations. Importantly, we observed 414 that the prediction in the VA cohort based on just age and sex information trained on the 415 GEN-COVID

we 424 discovered that PULSE yielded a significantly higher sensitivity (median values: 0.857 425 versus 0.688, 0.900 versus 0.525, and 0.875 versus 0.560 for EUR, AFR, and all VA 426 samples, respectively; Fig. 4f ). High sensitivity is important for the clinical application of 427 severity prediction to guide the identification of at-risk individuals. Predictions for the 428 GEN-COVID non-EUR samples yielded similar sensitivity and specificity

PULSE genes across different cell types by observing their non-random overlapping 461 with lung snATAC-seq peaks 33 (FDR<0.1, permutation test). Furthermore, the 462 expression levels of PULSE genes measured by scRNA-seq were examined, where we 463 found that PULSE genes are higher expressed in B cells, club cells, lymphatics, matrix 464 fibroblast 1, and NK cells (FDR<0.1, one-tailed Wilcoxon rank-sum test; B cells

Our transcriptome study demonstrates the functional role of 475 PULSE genes in severe disease across multiple cell types. Notably, among all the cell 476 types we investigated, only NK cells are consistently associated with severe COVID-19 COVID-19, we examined the function of NK-cell genes identified by either 484 RefMap or PULSE (377 genes; Supplementary Table 11 ). Indeed, genes do not 485 function in isolation 58,59 and therefore, rather than examining individual genes, we 486 mapped NK-cell genes to the global protein-protein interaction (PPI) network and then 487

we extracted high-confidence (combined score >700) PPIs from STRING 489 v11.0 60 , which include 17,161 proteins and 839,522 protein interactions

Supplementary Table 12 ; Methods ). Next, this smoothed PPI network was

NK-cell COVID-19 genes were mapped to individual modules, and four modules were 498 found to be significantly enriched with NK-cell genes: M237 ( n =471 genes; FDR<0.1, 499 hypergeometric test; Fig. 6a ), M1164 ( n =396 genes

M1311 ( n =14 genes; FDR<0.1, hypergeometric test), and M1540 ( n =226 501 genes; FDR<0.1, hypergeometric test; Fig. 6c

Human Gene Atlas), demonstrating their specificity in the NK cell 506 function. Moreover, these three modules relate to different stages of NK cell activation. 507 M237 is enriched with GO/KEGG terms including 'mRNA processing

Rap1 512 signalling pathway') key for NK cell activation (adjusted P <0.1; Fig. 6e , Supplementary 513 Tables 14 and 15 ). M1540 is highly enriched with GO/KEGG terms linked to type I 514 interferon signalling (e.g., 'type I interferon signalling pathway (GO:0060337)' and 515 'Antigen processing and presentation') (adjusted P <0.1; Fig. 6f , Supplementary Tables 516 14 and 15 ). In summary, the functional enrichment of M237, M1164, and M1540 genes 517 includes extracellular, cytoplasmic, and nuclear processes necessary for NK cell 604 genes are mapped using single-cell multiomic profiling

Mendelian randomization (Panel 5), and transcriptome analysis (Panel 6) validate the 606 functional importance of RefMap genes, particularly for NK cells

Jaccard index. f , RefMap regions overlap significantly with COVID-19-associated 614 genetic variation in an independent COVID-19 GWAS study. cCRE: candidate 615 cis-regulatory element

Figure 2. Severe-COVID-19-associated common variants are linked to NK cell 617 function

Enrichment was 619 calculated as the proportion of total SNP-based heritability adjusted for SNP number. b , 620 Proportion of SNP-based heritability associated with risk genes identified using RefMap 621 or conventional methodology. c , d , e , Significant Mendelian randomization results for 622 three exposures linked to severe COVID-19

CD314+ NK cells. f , Sensitivity analyses and 624 robust tests for MR analyses ( Methods ). g , Comparative gene expression analysis of 625 NK-cell RefMap genes in NKG2D+ and NKG2D-NK cells. Fold change was calculated 626 as the ratio of gene expression levels in NKG2D+ NK cells to NKG2D-NK cells. The each group

The red dashed line denotes the median value of fold

Figure 3. Functional enrichment and transcriptome analyses of RefMap COVID-19

GO) terms that are significantly enriched in cell-type-specific RefMap 635 gene lists corresponding to hematopoietic cell types; only terms with adjusted P <0.05, 636 OR>3, and character number<60 are visualized. b , KEGG Pathways that are genes for each cell type ( Methods ). Violin plots show the distributions of log 642 expression levels within each group, and point plots indicate the median and IQR. d , 643 Overlap between cell-type-specific

ChIP-seq peaks from ENCODE lung and immune cell samples. Z -scores calculated by 645 regionR 49 (1,000 permutations) were normalized into the 0-1 range for visualization. e , f , 646 Comparative gene expression analysis of cell-type-specific RefMap genes in severe 647 COVID-19 patients versus moderately affected patients based on scRNA-seq datasets 648 from ( e ) Ren et al. and ( f ) Liao et al., respectively. The Z -score of Wilcoxon rank-sum each group

Enrichment analysis of cell-type-specific RefMap COVID-19 different technologies: 661 whole-exome sequencing (WES) and whole-genome sequencing (WGS) (Panel 1)

Variants are annotated using ANNOVAR (Panel 2) and encoded as input for the PULSE 663 model

Receiver operating characteristic (ROC) curves of different models, including PULSE, 668 age+sex, and integrative models, in the 5-fold cross-validation. Solid lines represent the 669 mean values, and the grey area indicates the standard errors. d , Summary statistics of 670 the VA COVID-19

PULSE, age+sex, and integrative models. f , Comparison of prediction sensitivity 673 between PULSE and age+sex models. EHRs: electronic health records

Transcriptome analysis of PULSE COVID-19 genes

Analysis of convergence between PULSE and RefMap COVID-19 genes. The

Z -scores were calculated per cell-type by Wilcoxon rank-sum test of the difference in 677 PULSE weights between RefMap genes and the background transcriptome

Gene expression analysis of PULSE and IQR. c , d , Comparative gene expression analysis of cell-type-specific 684 PULSE genes in severe COVID-19 patients versus moderate patients based on 685 scRNA-seq datasets from ( c ) Ren et al. and ( d ) Liao et al., respectively. The Z -score of changes within each group

Q1−1.5×IQR, and Q3+1.5×IQR. *: FDR<0.1. +: FDR<0.01

including ( a ) M237, ( b ) M1164, and ( c ) M1540, are 693 significantly enriched with NK-cell genes identified in either common or rare variant 694 analysis. Blue nodes represent NK-cell genes and yellow nodes indicate other genes 695 within each module

Violin plots show the distributions of log expression levels 701 within each group, and boxplots indicate the median, IQR, Q1−1.5×IQR, and 702 Q3+1.5×IQR. The red dashed line indicates the median expression level of the 703 transcriptome. h, i, Comparative gene expression analysis of module genes in severe 704 COVID-19 patients versus moderate patients based on scRNA

respectively. The Z -score of Wilcoxon rank-sum test was , and boxplots indicate the median, IQR, Q1−1.5×IQR, and Q3+1.5×IQR. The red 709 dashed line indicates the median expression change of the transcriptome

Allele Z -scores were calculated as Z = b / se , where b and se are effect size and standard 714 error, respectively, as reported by the COVID-19 GWAS 7 (COVID-19 Host Genetics 715 Initiative, Release 5, phenotype definition A2, EUR only) where the sample age, sex, 716 and ancestry information were included as covariates. Given Z -scores and lung 717 snATAC-seq peaks, we aim to identify functional genomic regions in which the Z -score disequilibrium (LD) blocks, where each LD block contains J k ( k = 1

regions and each region harbors I j,k ( j = 1 , ..., J k , I j,k >0) SNPs, the Z -scores follow a 721 multivariate normal distribution

the j -th region of the k -th block is denoted as z i,j,k 724 ( i = 1 , ..., I j,k ) and u k are the effect sizes that can be expressed as 725 gene body (i.e., the region up to 10kb either side of the 789 annotated gene body) overlapped with any of the cell-type-specific regions. To get the 790 final gene lists, RefMap genes were further filtered based on their expression levels. In 791 particular, with the lung snRNA-seq data 33 , we defined expressed genes in each cell 792 type as those with Seurat 74 log-normalized value>0.6931

non-adult ones, and defined non-developmental genes (nDG) as those with FC>1

Only RefMap genes that were identified as expressed and non-developmental in each 798 cell type were kept for downstream analysis

COVID-19 (A1), hospitalized COVID-19 (B2), and COVID-19 overall (C2), respectively

Heritability partitioning within genes identified by traditional methods and within 810 cell-type-specific RefMap genes was performed as previously described 76 . Briefly, for all 811 gene lists, we examined the proportion of total SNP-based heritability carried by SNPs

On the sample level, we (1) computed inbreeding 932 coefficients (Fhat1, Fhat2, and Fhat3 in GCTA 97 ) and removed genomes that resided 933 more than 3 standard deviation from the mean; (2) computed identity-by-descent (IBD) singleton calls, SNV count

On the variant level, we (1) removed multiallelic sites; (2) kept variants in autosomes

940 (3) removed variants on blacklisted regions, compiled by the ENCODE Project

Phase 4); (4) removed variants identified other than ''PASS,'' such as ''low 942 quality

For deriving high-quality variants for downstream analysis, we removed 946 samples with kinship >0.03, sample call rate <0.97, or mean sample coverage <=18X. to have genotype quality (GQ) >=20, and for non-reference calls, sufficient portion 951 (>0.9) of reads was required to cover the alternate alleles. In addition, we removed 952 genomic positions with cohort-wise call rate <0.95 and computed Hardy-Weinberg 953 equilibrium (HWE), which was required to be <1e-05 for common variants and <1e-06 954 for rare variants. With all these filtering completed, we assessed the sample-level 955 genomic parameters, such as Ti/Tv ratios, het/hom ratios, and number of super populations

Phase 963 3) and inferred the ancestry for each genome. For the GEN-COVID cohort, samples 964 with >90% EUR ancestry fraction were kept in the discovery cohort

results ), where only semi-significant genes 992 ( P <1e-03, REGENIE 103 ) were available. Data consists of exome-wide association 993 studies of various COVID-19 outcomes across 662,403 individuals (11,356 with 994 COVID-19) aggregated from four studies: UK Biobank (UKB; n =455,838)

Fisher's exact test, assuming a background of 19,396 coding genes in the genome 1001 which is the total number profiled by Regeneron 12

The PULSE model

Given the variant annotations from ANNOVAR, we calculated 1004 gene-level mutation profiles for each individual. Here we only focused on rare 1005 nonsynonymous and splicing-site SNVs as well as frameshift and splicing-site indels

Phase 1007 3) samples. For nonsynonymous and splicing-site SNVs, we calculated the 1008 accumulative mutation burdens for each gene based on individual annotations (32 in 1009 total

where the sample size or feature 1079 dimension is large. Here, stochastic variational inference (SVI) 106 was used to scale up 1080 our model for the large amount of genome data. In fact, borrowing the idea from 1081 stochastic optimization, we can update parameters per epoch by using only one or a 1082 mini-batch of samples instead of the whole dataset. Specifically, with SVI we first 1083 calculated the natural gradient of ELBO with respect to the variational parameter whose 1084 update rule contains sample points. Thanks to the conditional conjugacy predefined in 1085 our model, the natural gradient enjoys a simple form (see Supplementary Notes for 1086 details). Then based on stochastic optimization, we sampled a minibatch and rescaled 1087 the term involving sample points, resulting in a noisy but cheaply computed and 1088 unbiased natural gradient. At last, the variational parameter was updated from this 1089 gradient according to the gradient-based optimization algorithm 107 . This SVI update can 1090 be easily embedded into CAVI without many changes

We integrated all above techniques into our VI algorithm. Details on the update rules for 1095 both local and global variational parameters and the VI algorithm are provided in 1096 Supplementary Notes

The exact Bayesian prediction for test samples needs to integrate out 1098 all hidden variables, which is computationally intense and usually not necessary. Here, 1099 we adopted maximum a posteriori (MAP) and predicted new coming sample by 1100

Similarly, the importance weights for individual genes (referred to as PULSE gene

To eliminate the bias caused by hub proteins, we first 1108 carried out the random walk with restart algorithm 108 over the PPI network, wherein the 1109 restart probability was set to 0.5, resulting in a smoothed network after retaining the top 1110 5% predicted edges. To decompose the network into different subnetworks/modules, we 1111 performed the Leiden algorithm 62 , a community detection algorithm that searches for 1112 densely connected modules by optimizing the modularity

1114 Supplementary Table

Transcriptome analysis 1116 Four single-cell RNA-seq datasets were used in the transcriptome analyses, including 1117 human healthy lungs 33 , 39 and COVID-19 patients 22,23 . Data after QC was acquired for 1118 each study

6931 was used to define expressed genes in the 1120 transcriptome throughout the study if not specified. For the disease samples, we 1121 removed the overlap of severe patients between the two cohorts 22,23 . In the comparative 1122 expression analysis of severe versus moderate patients, to stabilize the analysis we 1123 estimated the change of gene expression levels using the Z -score estimated from 1124 Wilcoxon rank-sum test

the Italian multicenter study aimed at 1140 identifying the COVID-19 host genetic bases. Specimens were provided by the 1141 COVID-19 Biobank of Siena, which is part of the Genetic Biobank of Siena, member of 1142 BBMRI-IT

it ) for its support. We thank private donors for the support 1146 provided to A.R. (Department of Medical Biotechnologies, University of Siena) for the 1147 COVID-19 host genetics research project (D.L n

We also thank Intesa San Paolo for the 2020 charity 1152 fund dedicated to the project "N. B/2020/0119 Identificazione delle basi genetiche 1153 determinanti la variabilità clinica della risposta a COVID-19 nella popolazione italiana" 1154 and "Bando FISR 2020

conceived and designed the study. S.Z. contributed to the 1157 design, implementation, training and testing of RefMap and PULSE

were 1159 responsible for data acquisition

drafted the manuscript 1163 with assistance from all authors. All authors meet the four ICMJE authorship criteria, 1164 and were responsible for revising the manuscript, approving the final version for 1165 publication

Mendelian randomization for COVID-19 GWAS with phenotypes B2 and C2

plot of P -value distribution for SKAT analysis on the GEN-COVID cohort

Sex distributions of the GEN-COVID and VA cohorts after sample filtering 1189 Supplementary Figure 9

1190 Normalized rank versus specificity of PULSE prediction for the VA COVID-19 cohort 1191 Supplementary Figure 10

1192 Normalized rank versus specificity and sensitivity of PULSE prediction for GEN-COVID 1193 non-EUR samples

Partitioned heritability analysis by LDSC for COVID-19 GWAS phenotypes A2, B2, and 1204 C2

); these data are not publicly available due to US Government and 1237 Department of Veteran's Affairs restrictions relating to participant privacy and consent

All other data used in this study are available from the original studies

An interactive web-based dashboard to track 1245 COVID-19 in real time

Scoring systems for predicting mortality for severe patients with 1247 COVID-19

Predictive indicators of severe COVID-19 independent of comorbidities 1249 and advanced age: a nested case-control study

& The COVID-19 Host Genetics Initiative

Host Genetics Initiative, a global initiative to elucidate the role of host genetic 1253 factors in susceptibility and severity of the SARS-CoV-2 virus pandemic

Trans-ancestry analysis reveals genetic and nongenetic 1256 associations with COVID-19 susceptibility and severity

Genomewide Association Study of Severe Covid-19 with Respiratory Failure

& Others. Mapping the human genetic architecture of 1261 COVID-19 by worldwide meta-analysis

Genetic mechanisms of critical illness in COVID-19

Initial whole-genome sequencing and analysis of the host genetic 1265 contribution to COVID-19 severity and susceptibility

Clinical and molecular characterization of COVID-19 hospitalized 1267 patients

Analysis of ACE2 genetic variants by direct exome sequencing in 1269 99 SARS-CoV-2 positive patients

A catalog of associations between rare coding variants and 1271 COVID-19 outcomes. medRxiv (2021)

Clinical features of patients infected with 2019 novel coronavirus in 1273

Immune determinants of COVID-19 disease presentation and severity

COVID-19: consider cytokine storm syndromes and 1277 immunosuppression

Deep immune profiling of COVID-19 patients reveals distinct 1279 immunotypes with therapeutic implications

B Cell Subsets as Severity-Associated Signatures in 1281 COVID-19 Patients

Longitudinal analyses reveal immunological misfiring in severe 1283 COVID-19

Systems biological assessment of immunity to mild versus 1285 severe COVID-19 infection in humans

Single-cell landscape of immunological responses in patients 1287 with COVID-19

The cellular immune response to COVID-19 deciphered by single cell multi-omics across three UK centres

Single-cell landscape of bronchoalveolar immune cells in patients 1291 with COVID-19

COVID-19 immune features revealed by a large-scale single-cell 1293 transcriptome atlas

A molecular single-cell lung atlas of lethal COVID-19

COVID-19 tissue atlases reveal SARS-CoV-2 pathology and 1297 cellular targets

SARS-CoV-2 Orf6 hijacks Nup98 to block STAT nuclear import and 1299 antagonize interferon signaling

Imbalanced Host Response to SARS-CoV-2 Drives 1302 Development of COVID-19

Deletion of the NKG2C receptor encoding KLRC2 gene and 1304 HLA-E variants are risk factors for severe COVID-19

Diverse Functional Autoantibodies in Patients with COVID-19

Natural killer cell immunotypes related to COVID-19 disease 1309 severity

Solid Organ Transplantation: A Review Article

Genome-wide 1313 Identification of the Genetic Basis of Amyotrophic Lateral Sclerosis

Single-cell multiomic profiling of human lungs reveals 1315 cell-type-specific and age-dynamic control of SARS-CoV2 host genes

LD Score regression distinguishes confounding from 1318 polygenicity in genome-wide association studies

A single-cell and spatial atlas of autopsy tissues reveals 1320 pathology and cellular targets of SARS-CoV-2. bioRxiv

Mendelian Randomization for Strengthening Causal Inference in 1323 Observational Studies

The genetic architecture of the human immune system: a 1326 bioresource for autoimmunity and disease pathogenesis

Roles of the NKG2D immunoreceptor and its ligands

A molecular cell atlas of the human lung from single-cell RNA 1330 sequencing

Enrichr: a comprehensive gene set enrichment analysis web 1332 server 2016 update

1335 CD16 surface antigens takes place by different mechanisms. Involvement of the 1336 phospholipase D pathway in tumor necrosis factor alpha synthesis

Signal transduction during activation and inhibition of 1339 natural killer cells

Natural killer cells in HIV-1 1341 infection and therapy

Killer Cell Function Is Well Preserved in Asymptomatic Human Immunodeficiency 1344

HIV-2) Infection but Similar to That of HIV-1 Infection When CD4

SARS-CoV-2 Cell Entry Depends on ACE2

Marked T cell activation, senescence, exhaustion and skewing 1349 towards TH17 in patients with COVID-19 pneumonia

Pericyte-specific vascular expression of SARS-CoV-2 receptor ACE2 -1352 implications for microvascular inflammation and hypercoagulopathy in COVID-19

Expanded encyclopaedias of DNA elements in 1355 the human and mouse genomes

regioneR: an R/Bioconductor package for the association analysis of 1357 genomic regions based on permutation tests

Optimal unified approach for rare-variant association testing with 1360 application to small-sample case-control whole-exome sequencing studies

ACE2 gene variants may underlie interindividual variability and 1363 susceptibility to COVID-19 in the Italian population

Employing a systematic approach to biobanking and analyzing 1366 clinical and genetic data for advancing COVID-19 research

ANNOVAR: functional annotation of genetic 1369 variants from high-throughput sequencing data

A global reference for human genetic 1372 variation

Combining effects from rare and common genetic variants in an 1374 exome-wide association study of sequence data

Are rare variants responsible for susceptibility to complex diseases?

The allelic architecture of human disease genes: 1379 common disease-common variant… or not?

Decoding the Genomics of Abdominal Aortic Aneurysm

Gene-Environment Interaction in the Era of 1384 Precision Medicine

STRING v11: protein-protein association networks with 1386 increased coverage, supporting functional discovery in genome-wide experimental 1387 datasets

Genome-wide prediction and functional characterization of the 1389 genetic basis of autism spectrum disorder

From Louvain to Leiden: guaranteeing 1391 well-connected communities

Signals of hope: gauging the impact of a rapid 1393 national vaccination campaign

Covid-19: variants and vaccination

Factors associated with hospital admission and critical illness 1397 among 5279 people with coronavirus disease

IL-12-producing monocytes and HLA-E control HCMV-driven 1400

NKG2C+ NK cell expansion

Decoding the patterns of self and nonself by the 1402 innate immune system

COVID-19 severity correlates with airway epithelium-immune cell 1407 interactions identified by single-cell analysis

A Modified Cholesky Algorithm Based on a 1411 Symmetric Indefinite Factorization

Variational learning for rectified factor analysis

Variational Inference: A Review for 1416

Integrated analysis of multimodal single-cell data

BEDTools: a flexible suite of utilities for comparing 1421 genomic features

Partitioning heritability by functional annotation using 1423 genome-wide association summary statistics

Genomic atlas of the human plasma proteome

Connecting genetic risk to disease end points through the human blood plasma proteome

Assessment of Bidirectional Relationships Between Physical 1429 Activity and Depression Among Adults: A 2-Sample Mendelian Randomization 1430

Evaluation of the causal effects between subjective wellbeing 1432 and cardiometabolic health: mendelian randomisation study

Physical exercise is a risk factor for amyotrophic lateral sclerosis: 1435 Convergent evidence from mendelian randomisation, transcriptomics and risk 1436 genotypes

LDlink: a web-based application for exploring 1440 population-specific haplotype structure and linking correlated alleles of possible 1441 functional variants

Mendelian randomization: avoiding the downsides of a powerful, widely applicable 1444 but potentially fallible technique

Interpreting findings from Mendelian randomization 1446 using the MR-Egger method

Guidelines for performing Mendelian randomization 1448 investigations

from weak instruments in Mendelian randomization studies

Invited Commentary: Detecting Individual 1453 and Global Horizontal Pleiotropy in Mendelian Randomization-A Job for the 1454

Humble Heterogeneity Statistic?

Detection of widespread horizontal 1457 pleiotropy in causal relationships inferred from Mendelian randomization between 1458 complex traits and diseases

Assessing the suitability of summary data for two-sample 1460

Mendelian randomization analyses using MR-Egger regression: the role of the I2 1461 statistic

MAGMA: generalized 1464 gene-set analysis of GWAS data

Fast and accurate long-read alignment with Burrows-Wheeler 1466 transform

Scaling accurate genetic variant discovery to tens of thousands of 1468 samples

Million Veteran Program: A mega-biobank to study genetic 1470 influences on health and disease

Phenome-wide association of 1809 phenotypes and COVID-19 1472 disease progression in the Veterans Health Administration Million Veteran Program

Functional equivalence of genome sequencing analysis 1475 pipelines enables harmonized variant calling across human genetics projects

GCTA: a tool for 1478 genome-wide complex trait analysis

Enhancements to the ADMIXTURE algorithm for 1480 individual ancestry estimation

0: A One-Stop Database of 1482 Functional Predictions and Annotations for Human Nonsynonymous

In silico prediction of splice-altering single 1485 nucleotide variants in the human genome

RegSNPs-intron: a computational framework for predicting pathogenic 1488 impact of intronic single nucleotide variants

On efficient and accurate calculation of 1490 significance P-values for sequence kernel association testing of variant set

Computationally efficient 1493 whole genome regression for quantitative and binary traits

GTEx Consortium. Genetic effects on gene expression across 1495 human tissues

Spike and Slab Variational Inference for 1497

Multi-Task and Multiple Kernel Learning

Stochastic variational inference

A Stochastic Approximation Method

Exploiting ontology graph for 1505 predicting sparsely annotated gene function

Pattern Recognition and Machine Learning (Information Science and Statistics

Variational inference: A review for statisticians

Stochastic variational inference

Algorithm 1: Stochastic MFVI for PULSE Input : Model p, hyperparameters Θ and learning rate t

Initialize variational parameters

Randomly split the dataset into N/B mini-batches D 1:N/B . 4 for i = 1 : N/B do 5 1. Update local variational parameters ξ 1:N based on Eq

Based on mini-batch D i , update φ 1 , φ 2 , ψ 1j and ψ 2j according to Eqs. 26, 27, 30 and 31, respectively, and then update the corresponding global variational parameters based on Eqs

Update other global variational parameters according to Eqs