key: cord-0267524-strj8e96 authors: Nakayama, Jun; Yamamoto, Yusuke title: Single-cell meta-analysis of cigarette smoking lung atlas date: 2021-12-10 journal: bioRxiv DOI: 10.1101/2021.12.09.472029 sha: 993974e6db72f9c0703a8db85ea80c77cbf4a494 doc_id: 267524 cord_uid: strj8e96 Single-cell RNA-seq (scRNA-seq) technologies have been broadly utilized to reveal the molecular mechanisms of respiratory diseases and physiology at single-cell resolution. Here, we constructed a cigarette smoking lung atlas by integrating data from 8 public datasets, including 104 lung scRNA-seq samples with patient state information. The cigarette smoking lung atlas generated by this single-cell meta-analysis (scMeta-analysis) revealed early carcinogenesis events and defined the alterations of single-cell gene expression, cell population, fundamental properties of biological pathways, and cell–cell interactions induced by cigarette smoking. In addition, we developed two novel scMeta-analysis methods incorporating clinical metadata: VARIED (Visualized Algorithms of Relationships In Expressional Diversity) and AGED (Aging-related Gene Expressional Differences). VARIED analysis revealed the expressional diversity associated with smoking carcinogenesis in each cell population. AGED analysis revealed differences in gene expression related to both aging and smoking states. Our scMeta-analysis provided new insights into the effects of smoking and into cellular diversity in the human lung at single-cell resolution. Smoking is the leading risk factor for early death, and its negative effects present individual and public health hazards (1, 2) . Cigarette smoke is a mixture of thousands of chemical compounds generated from tobacco burning (3) that causes chronic airway inflammation, reactive oxygen species (ROS) production, and DNA damage. Specifically, it has been discovered that smoking injures the respiratory organs and cardiovascular system and causes carcinogenesis, chronic obstructive pulmonary disease (COPD), and atherosclerosis (4) . In particular, the incidence of lung squamous carcinoma is significantly increased by cigarette smoking (5, 6) . Single-cell RNA-seq (scRNA-seq) technologies have been broadly utilized to reveal the molecular mechanisms of respiratory diseases and physiology at single-cell resolution. scRNA-seq in human lungs identified novel cell populations and cellular diversity (7) (8) (9) (10) (11) (12) (13) . However, there are several concerns regarding scRNA-seq analysis. One of these concerns is sample size, that is, that clinical scRNA-seq analyses could be biased due to insufficient sample sizes. A possible solution is meta-analysis of scRNA-seq data. The recently developed single-cell meta-analysis (scMeta-analysis) method has been considered a powerful tool for large-scale analysis of integrated single-cell cohorts. The scMeta-analysis shows robust statistical significance and the capacity to compare the results among different studies at the single-cell level. In fact, integrated scMeta-analysis of a number of cohorts has revealed a previously unappreciated diversity of cell types and gene expression; for example, scMeta-analysis of lung endothelial cells, including human and mouse datasets, revealed novel endothelial cell populations (14) (15) (16) (17) . In addition, comparative analysis of scRNA-seq cohorts revealed pan-cancer tumor-specific myeloid lineages (18) . In this study, we integrated 8 publicly available datasets comprising 104 lung scRNA-seq samples and analyzed a total of 257,663 single cells to construct a cigarette smoking lung atlas. The scMeta-analysis of the cigarette smoking lung atlas defined single-cell gene expression according to smoking, age, and gender. In addition, we developed novel scMeta-analysis methods: VARIED (Visualized Algorithms of Relationships In Expressional Diversity) analysis and AGED (Aging-related Gene Expressional Differences) analysis with clinical metadata. VARIED analysis revealed the diversity of gene expression associated with cancer-related events in each cell population, and AGED analysis revealed the expressional differences in relation to both aging and smoking states. According to scRNA-seq collection criteria (see methods), we chose 8 publicly available datasets of lung scRNA-seq data to construct a cigarette smoking lung atlas ( Figure 1A ). To this end, we collected data from 374,658 single cells from 104 scRNA-seq samples (smoker: 55 samples, never-smoker: 49 samples, Figure 1A ). In the process of quality control with Seurat in R, 116,995 low-quality single cells (nFeatures < 10 3 & mt.percent > 20%) were removed. Integration of the 8 datasets was performed by the Harmony algorithm with the smoking states of scRNA-seq samples (19) (Supplementary Figure S1A ). Integrated single-cell transcriptome data were linked with clinical metadata such as smoking states, age, gender, and race (Supplementary Table S1 Comparison of the atlases by smoking states revealed that most of the cell populations in the UMAP plot overlapped; however, parts of epithelial clusters were specific to the never-smoker group (Supplementary Figure S2A ). To confirm that the integration of the 8 datasets reduced bias, we showed the atlas marked with the datasets ( Figure 1D ). All major clusters seemed to overlap among the 8 datasets (Supplementary Figure S2B) , although the populations of cells were different in each dataset ( Figure 1E ). This difference in cell populations could be caused by differences in tissue collection and cell isolation processes. In the atlas with all cell types ( Figure 1B ), we first identified the cell types present within the atlas according to the lung cell markers in the human lung scRNA-seq atlas (7) (Supplementary Figure S3) . To investigate the cell types in further detail, we extracted subsets of "epithelia", "fibroblasts", "endothelia", "lymphoids", and "myeloids" and repeated the UMAP procedure with each subset, which comprised 44 subpopulations in total (Figure 2A For example, the number of basal linage cells decreased (20) , and the number of basophils increased (21) in smoking lungs. The atlas showed differences in the numbers of 44 cell subpopulations by smoking states ( Figure 2C) . Evidently, the cell numbers of basal, basal-proximal (px), ionocyte, mucous, proliferating epithelia, and tracheal basal clusters significantly decreased. Previous bulk studies have reported that the number of bronchial epithelial cells is altered by smoking (9, 20, 22) . Consistent with these reports, our data confirmed that smoking had a devastating effect on epithelial cells in the bronchus and bronchiole. On the other hand, the numbers of alveolar type 1 cells (AT1), alveolar fibroblasts, adventitial fibroblasts, B cells, CD4+ memory/effector T cells, CD8+ T cells, natural killer (NK) cells, NK T cells (NKT), and basophils significantly increased. Previously, the number of basophils infiltrating lung tissue has been reported to increase in COPD models, and basophils contribute to emphysema formation by cytokine production in the early phase of COPD (21) . The atlas confirmed the increase in basophil cell number with smoking. We also examined the cell cycle in each cell cluster. The cell cycle indices in each subpopulation were not obviously changed between the smoking and never-smoking groups (Supplementary Figure S9A and B). Cigarette smoking is the highest risk factor for carcinogenesis of squamous carcinoma in the bronchia and trachea of the lung (2, 5) . To comprehensively understand the effects of smoking in the lung, we developed VARIED (Visualized Algorithms of Relationships In Expressional Diversity) analysis to quantify the alteration in gene expressional diversity. VARIED analysis is based on the network centrality of a correlational network with graph theory in each single cell (23) . The differences in the centrality between smokers and never-smokers represent the alteration of gene expressional diversity in each cell cluster ( Figure 3A ). VARIED analysis revealed greater diversity in epithelial clusters, suggesting that cigarette smoking primarily perturbed epithelial populations, particularly in the bronchia and trachea ( Figure 3B and 3C). These data are consistent with the fact that epithelial cells, located at the bronchia, are considered to be the origin of lung squamous carcinoma (24) . Interestingly, the diversity in basophils was also remarkably altered by cigarette smoking. To examine the molecular basis for diversity in gene expression, we extracted differentially expressed genes (DEGs) in the basal-px cluster between smokers and never-smokers, focusing on basal-px because this cluster was the most influenced by cigarette smoking ( Figure 3B , Supplementary Table S3 ). Enrichment analysis of the DEGs revealed that cancer-related categories were significantly enriched in the smoker basal-px cluster ( Figure 3D and E, Supplementary Table S4 ). The cigarette smoking lung atlas and VARIED analysis confirmed the early oncogenic events in bronchial and tracheal epithelial cells. Our data indicate that smoking adversely affects bronchial epithelial cells and alters gene expressional diversity in carcinogenesis. As the cigarette smoking lung atlas provided high-resolution expression data in 44 cell types, we explored gene expression profiles from a genome-wide association study (GWAS) of lung squamous carcinoma with smoking (25) . To identify the expressional patterns and the broad contributions of different lung cell types to squamous carcinoma susceptibility, the expression levels of an average of 92 GWAS genes were examined in all lung cell types (Supplementary Figure S10A ). High expression of squamous carcinoma GWAS genes was observed in the specific clusters, and cigarette smoking affected the expression of GWAS-related genes in some clusters. In particular, the expression of MUC1 was increased in the smoker epithelial clusters (Supplementary Figure S10B), and the expression of HLA-A was increased in the smoker myeloid clusters (Supplementary Figure S10C) . Mutated MUC1 has oncogenic roles in carcinogenesis in the human lung (26, 27) . Truncating mutations in HLA-A carry a risk of dysregulation of cancer-related pathways (28) . We also examined the effect of gender differences on gene expression at single-cell resolution in all epithelial clusters (Supplementary Figure S11A) . As a first step, we analyzed the cell cycle distribution in males and females in the smoker group. The results showed almost no difference in cell cycle state between males and females; however, we found subtle differences. For example, the female basal-px cluster exhibited an increased S/G2M index ratio compared to the male basal-px cluster; in contrast, the tracheal basal-px cluster in males exhibited an increase in the S index ratio compared to that in females (Supplementary Figure S11B ). Next, we performed pathway enrichment analysis to identify the differences in epithelial clusters between male and female smokers. As a result, there were differences between males and females; however, gender-specific alterations were commonly identified across the epithelial clusters, not specifically in the clusters (Supplementary Figure S11C ). Given that cigarette smoking has a significant impact on carcinogenesis in bronchial epithelial cell clusters, we next focused on the alteration of cancer-associated fibroblasts (CAFs) and tumor endothelial cells (TECs). These types of cells are well known to contribute to tumor malignancy (29) (30) (31) . We examined the expression of marker genes such as ACTA2, PDPN, and COL1A1 in CAFs and COL18A1, COL4A1, and COL4A2 in TECs by smoking states ( Figure 4A and 4B). A typical CAF marker, ACTA2, was significantly induced in the adventitial fibroblast, alveolar fibroblast, and myofibroblast clusters in the smoker group ( Figure 4A , top panel). Likewise, other CAF markers such as PDPN and COL1A1 were also significantly upregulated in the adventitial fibroblast, alveolar fibroblast, and myofibroblast clusters in the smoker group ( Figure 4A , middle and bottom panels). Additionally, TEC markers such as COL18A1, COL4A1, and COL4A2 were increased in several endothelial cell clusters ( Figure 4B ). For further investigation of CAF marker expression, we divided the smoker adventitial fibroblast cluster into a high-ACTA2 group and a low-ACTA2 group and analyzed the DEGs between them (Supplementary Figure S12A) . The DEGs analysis showed that collagen family and SPARC expression increased in the smoker high-ACTA2 group (Supplementary Figure S12B) . Likewise, DEGs analysis was performed between an ANGPT2-high lymphatic group and an ANGPT2-low lymphatic group, and the results suggested that FABP4 was highly expressed in the ANGPT2-high group. FABP4 is a key regulator of tumor angiogenesis (32) . These results suggested that transformation of cancer-associated stromal cells was induced in the early phase of carcinogenesis promoted by cigarette smoking. Next, we performed module analysis with cancer-related gene sets, such as senescence, ROS production, IFN signaling, heme metabolism, and epithelial to mesenchymal transition (EMT) genes. The module analysis depicted the alteration of cancer-related events by smoking in each cluster ( Figure 4C ). Several modules were drastically altered between the smoker and never-smoker groups, such as IFN signaling in endothelial and myeloid clusters; EMT in epithelial, fibroblastic, and endothelial clusters; and mitophagy in lymphoid and myeloid clusters. Because increased expression of EMT module genes in endothelial clusters was observed, we examined the expression of endothelial to mesenchymal transition (EndMT) marker genes (FN1, POSTN, VIM) (17, 33) . These EndMT markers were significantly upregulated, suggesting that smoking induced EndMT in some endothelial clusters ( Figure 4D top, Supplementary Figure S12E ). Autophagy in immune cells is important for cellular immunity, differentiation and survival (34) . The autophagy module was especially increased in NKT cells from lymphoid clusters and some myeloid clusters ( Figure 4C From the module analysis, we observed increased IFN signaling throughout the lung cells. These data suggested that smoking produced chronic inflammation in the lung and prompted us to examine the interactions between epithelial and immune cells via inflammatory signaling. For this purpose, we performed cell-cell interaction (CCI) analysis using 7,200 interactions between interferon, interleukin, and chemokine family genes at single-cell resolution ( Figure 5A ). CXCL8 (interleukin 8: IL8) is produced by lymphocytes, endothelial cells, fibroblasts, and epithelial cells in the lung and has important roles in pulmonary diseases and cancers (36, 37) . In epithelial-immune cell interactions, the CXCL8-interaction network was expanded by increasing the expression in club, goblet, and serous cells of the smoker groups ( Figure 5B ). The CCI networks between epithelial and lymphoid cell clusters showed increased epithelial to lymphoid cluster interactions in smokers compared to never-smokers ( Figure 5C top). On the other hand, the lymphoid to epithelial cluster interactions showed smaller differences between groups ( Figure 5C bottom, Supplementary Figure S13A left and S13B left). This result suggested that the epithelial to lymphoid interaction was mainly unidirectional, and it is consistent with the module analysis result that the IFN signaling module did not increase in the lymphoid clusters ( Figure 4C ). In contrast, epithelial-myeloid interactions (both "from epithelia to myeloid" and "from myeloid to epithelial") were clearly enhanced in the smoker group compared to the never-smoker group (Supplementary Figure S13A-C). Therefore, cigarette smoking enhanced the mutual interaction between epithelial and myeloid cells via inflammatory signaling. As the majority of the samples in the atlas had patient age information, we aimed to identify aging-related genes associated with cigarette smoking ( Figure 6A ). We developed AGED (Aging-related Gene Expression Differences) analysis based on regression analysis with single-cell transcriptome data (see methods). Briefly, by using regression analysis with age and gene expression in the smoker and never-smoker groups, we calculated the differences in slopes (Δ) for all genes in 44 cell clusters ( Figure 6B ). For selected genes that were obviously changed with advancing age between the smoker and never-smoker groups, the Δ values were plotted as AGED results in a heatmap ( Figure 6C ). These data showed that the lung surfactant proteins SFTPC and SFTPB decreased in secretory epithelial clusters with advancing age in the smoker ( Figure 6C and 6D left). These lung surfactant proteins maintain the activation of alveolar macrophages and promote recovery from injuries induced by smoking (38) . Additionally, secretoglobins (SCGB3A1, SCGB3A2, and SCGB1A1) were also decreased with advancing age in smokers ( Figure 6C and 6D middle and right). MALAT1 is a well-known lncRNA in lung cancer, and its expression contributes to malignancy (39) . AGED analysis showed that MALAT1 expression increased in most cell types with advancing age in smokers ( Figure 6C and 6E), suggesting that the oncogenic risk associated with MALAT1 increased with age. From the module analysis, heme metabolism was dysregulated in the lung ( Figure 4C ). The expression levels of FTL and FTH1 genes (ferritin) were significantly altered with advancing age in the smokers ( Figure 6C ). In the "CD68+ macrophage" and "macrophage" clusters, ferritin significantly increased with smoking and aging. In addition, the expression patterns of several mitochondrial genes were altered with advancing age in smokers (Supplementary Figure 14) . The module analysis showed that the mitophagy, ferroptosis, and ROS production modules, which are related to mitochondrial dysfunction, were also altered by smoking ( Figure 4C ). AGED analysis confirmed that age-related mitochondrial dysregulation contributed to the progression of respiratory diseases. Collectively, the AGED analysis revealed changes in aging-related gene expression with smoking in each cell cluster. In this study, we presented a human cigarette smoking lung atlas, generated via the meta-analysis of 104 samples from 8 public scRNA-seq datasets. Our integrated smoking atlas confirmed the alteration of gene expression in the lung at single-cell resolution and identified the early oncogenic events induced by cigarette smoking. Additionally, the novel VARIED and AGED analyses revealed cell type and gene expressional diversity with smoking and age. One of the significant contributions of this study is that the scMeta-analysis of integrated datasets identified expressional diversity in the early phase of lung squamous carcinoma at the single-cell level. In fact, expression analysis following VARIED revealed early oncogenic signaling in epithelial, fibroblastic, and endothelial cells, expression changes in GWAS-related genes, and gender-dependent alterations in the smoking lung. In previous studies of the effects of smoking, genetic mutations in oncogenes and tumor suppressor genes were discovered (40) (41) (42) . Bronchial epithelial cells from smokers have mutations in TP53, NOTCH1, FAT1, CHEK2, PTEN, ARID1A and other genes (40) . Our atlas showed that survival AKT-mTOR signaling, mitochondrial dysregulation, and sirtuin signaling pathways were altered in bronchial basal cells by smoking (Supplementary Table S4 ). Mutations in PTEN contribute to the activation of AKT-mTOR signaling (43) . FAT1 controls mitochondrial functions (44) , and its mutations induce the dysregulation of mitochondria. Additionally, cigarette smoking promotes lung carcinogenesis by IKKβand JNK-dependent inflammation (45) . DEGs analysis of basal-px clusters indicated that JUN and FOS expression levels were increased in the smoker basal-px cluster (Supplementary Table S3 ). Our module analysis and CCI analysis showed enhancement of inflammatory signaling in the epithelial clusters. Furthermore, our results showed that sirtuin signaling was enhanced in bronchial epithelial cells in smokers. The atlas confirmed the signaling related to genetic mutations induced by smoking. The first scMeta-analysis was performed to investigate severe acute respiratory syndrome coronavirus 2 (SARS-CoV-2)-related genes by The Human Cell Atlas Lung Biological Network (14) . Further scMeta-analyses were reported for endothelial cells in the human and mouse lung (15) and liver-specific immune cells (16) , which revealed the alteration of cell populations and expressional heterogeneity with single-cell resolution. Additionally, the study of pan-cancer scRNA-seq cohorts revealed heterogeneity in tumor-infiltrating myeloid cell composition and the functions of cancer-specific myeloid cells (18) . scMeta-analysis is a powerful tool and strategy to overcome the problem of sample bias in small clinical cohorts. Additionally, our integrated datasets enabled us to perform single-cell analysis linked with clinical information in meta-cohorts such as AGED analysis, which identified aging-related gene expression with single-cell resolution. Furthermore, it revealed correlations in the alterations of gene expression associated with smoking and aging. Further scMeta-analyses incorporating additional clinical information will be helpful for understanding homeostasis and diseases. Our study has limitations. First, differences in the tissue sampling and single-cell isolation methods generated bias in the cell populations used in this study. This bias could not be completely removed by computational normalization. In fact, our integrated datasets showed the differences in cell subpopulations in each dataset (Supplementary Figure S2B ). Next, clinical information such as smoking states, gender, and age depended on the collection in the primary studies. The atlas has only a simple classification: smoker or never-smoker; we could not consider detailed smoking information such as the amount of smoking, years of smoking, and Brinkman index (Supplementary Tables S1 and S2). Additionally, patient age was significantly different between the smoker and never-smoker populations ( Figure 6A ). Moreover, clinical information such as age and gender was not available for all datasets. In the future, it will be necessary to expand the integrated dataset following the publication of new appropriate datasets for a more robust analysis. The integrated atlas presented herein contributed to the characterization of the alterations caused by cigarette smoking that are related to carcinogenesis of lung squamous carcinoma. However, lung cancer also develops in never-smokers, in whom lung adenocarcinoma is predominant (5, 6) . scMeta-analysis focused on lung adenocarcinoma in different clinical states has the potential to reveal the nature of genetic carcinogenesis. As a future study, the integration of scRNA-seq data from normal lungs (never-smokers) and lung adenocarcinoma could be a feasible approach to discover the mechanism of carcinogenesis and elucidate the cellular diversity in lung adenocarcinoma. In addition, clinical scRNA-seq and scMeta-analysis will be powerful tools in combination with data from pan-cancer multiomics analyses, such as those in The Cancer Genome Atlas (TCGA) (46, 47) . Therefore, the integration of scMeta-analysis data with clinical and omics data paves the way for an in-depth understanding of the nature of cancer. The scRNA-seq cohorts were downloaded from the public Gene Expression Omnibus (GEO) and European Genome-Phenome Archive (EGA) databases (Supplementary Table S1 ). We collected scRNA-seq samples of human lungs for which smoking states information was available. From physiological studies of the lung airway, all 10 never-smoker samples were extracted from the EGA00001004082 dataset (48), and 1 never-smoker and 3 smoker samples were extracted from the GSE130148 dataset (13) . From idiopathic pulmonary fibrosis (IPF) studies, 5 never-smoker and 3 smoker samples were extracted from a total of 17 samples in the GSE122960 dataset (49), 1 never-smoker and 7 smoker samples were extracted from a total of 34 samples in the GSE135893 dataset (12) , and 22 never-smoker and 23 smoker samples were extracted from a total of 78 samples in the GSE136831 dataset (11) . From studies of lung disease in smokers, 3 never-smoker and 3 smoker samples were extracted from the GSE123405 dataset (50) , and 3 never-smoker and 9 smoker samples were extracted from the GSE173896 dataset (23) . From lung cancer studies, 4 never-smoker and 7 smoker samples were extracted from a total of 58 samples in the GSE131907 dataset (51) . A total of 104 samples (never-smoker: 49, smoker: 56) were collected, and the details of the extracted samples are shown in Supplementary Table S2 . These datasets were imported into R software version 3.6.3. and transformed into Seurat objects with the package Seurat version 3.2 (52) . The Seurat objects from the different datasets were then integrated in R. The integrated dataset was subjected to normalization, scaling, and principal component analysis (PCA) with Seurat functions. Removal of low-quality cells was performed against the merged dataset before batch effect removal according to the following criteria (nFeature_RNA > 1000 and percent.mt < 20) . To remove the batch effect between cohort studies, Harmony (version 1.0) algorithms were applied to the integrated datasets (19, 53) following the instructions in the Quick start vignettes (https://portals.broadinstitute.org/harmony/articles/quickstart.html). Clustering of neighboring cells was performed by the functions 'FindNeighbors' and 'FindClusters' from Seurat using Harmony reduction. First, the clusters were grouped based on the expression of tissue compartment markers (for example, EPCAM for epithelia, CLDN5 for endothelia, COL1A2 for fibroblasts, and PTPRC for immune cells) ( Figure 1C and Supplementary Figure S3 ) and then annotated in detail according to "A molecular cell atlas of the human lung" (7) . Cell cycle analysis was performed with the 'CellCycleScoring' function of Seurat. To evaluate the expressional heterogeneity in the cell populations, we calculated the correlation coefficients for each cell population between smokers and never-smokers. In each cluster, normalized closeness centrality was calculated in R, as previously described (23, 54) . where r is the absolute value of Pearson's correlational coefficient and n is the number of cells in the cluster. Module analysis was performed by the 'AddModuleScore' function in Seurat using the We performed enrichment analysis against the marker gene list in each cluster between male and female smokers by the 'ClusterProfiler' (55) and 'ReactomePA' (56) packages in R. Gene symbols were converted to ENTREZ IDs using the 'org.Hs.eg.db' package version 3.10.0. Pathway datasets were downloaded from the Reactome database. Pathway enrichment analysis using the 'enrichPathway' function was performed by the BH method. Marker genes of the basal-px cluster in smokers and never-smokers were calculated by 'FindMarkers' with the MAST method (57) . Enrichment analysis of basal-px was performed using QIAGEN Ingenuity Pathway Analysis software. Gene-gene interactions, including ligand-receptor interactions, were performed using the interaction database of the Bader laboratory from Toronto University (https://baderlab.org/CellCellInteractions#Download_Data). We selected the genes that were categorized as 'interferons', 'interleukins' and 'TNFSF superfamily' in the HUGO Gene Nomenclature Committee database (https://www.genenames.org/). We calculated the cell number of subpopulations with values greater than 2. Only subpopulations whose expressing cell ratio exceeded 10% were extracted for CCI network analysis, and the CCI score between epithelial and immune cell subpopulations in smokers and never-smokers was calculated as previously described (23) . L: Ligand subpopulation (ligand gene expression > 2), R: receptor subpopulation (receptor gene expression > 2), n: cell number. We calculated the average expression of all genes in each cluster in both smokers and never-smokers and performed regression analysis in correlation with gene expression and patient age by R. Next, we calculated the differences in slopes (delta) in smokers and never-smokers via regression analysis and extracted the genes with the highest delta to be shown in a heatmap. The datasets GSE122960, GSE123405, GSE130148, GSE131907, GSE135893, GSE136831, and GSE173896 are available in the NCBI GEO database (https://www.ncbi.nlm.nih.gov/geo/). The EGA00001004082 dataset is available in the EGA database (https://ega-archive.org/). The source code of scMeta-analysis and integrated datasets is available on GitHub (https://github.com/JunNakayama/scMeta-analysis-of-cigarette-smoking). The dimensionality-reduced cell clustering is shown as a UMAP plot by the function 'runUMAP'. Heatmaps were drawn by Morpheus from the Broad Institute. A ridge plot was drawn using the 'ggridges' package in R. Bubble plots and violin plots were drawn using the 'ggplot2' package in R. Sankey plots were drawn using the 'network3D' package in R. Correlation coefficients were calculated by Spearman correlation in R. Welch's t test or Tukey's or Dunnett's multiple comparison test was used for comparison of the datasets. Significance was defined as P < 0.05. We are grateful to all members of the lab for stimulating discussions during the preparation of this manuscript. scMeta-analysis data were available to the NCBI GEO database and EGA database. Detailed information is shown in Supplementary Table 1 . The authors have declared that no conflict of interest exists. Gene and pathway networks of marker genes for the basal-px cluster. The network plot was generated by IPA. E. Enrichment analysis of marker genes for the basal-px cluster. Significantly enriched pathways are shown based on IPA data. Figure 6B . Supplementary Table S1 . Supplementary TableS1. A list of publicy 8 datasets for the atlas. APCDD1 TNXB THBS3 MTAP NR4A1 NR1H4 SEMA6D HIBCH NUP35 CYP24A1 HYKK TP63 AJAP1 DCBLD1 ADAM15 INPP1 HLA-A HIST1H1E NAPG BPTF SP4 IREB2 PMS1 GTF2H4 HSPA1B CDKN2B STAT1 IL3 CSF2 MPZL3 LRFN2 GATA3 PPP2R2B TGFB1 CHRNA2 CHRNA3 FFAR4 HLA-DQB1 RNASET2 PSMA4 BRCA2 APOM MSH5 HORMAD2 BTNL2 RAD52 NRXN1 CHRNB4 KCNIP4 MTMR3 XRCC4 SH2B3 RTEL1 SLC17A8 TERT A B C BPIFB1 ACVR1B FKBPL BAG6 PRRC2A CDKN2B-AS1 NPHP4 EPHX2 DNAH11 MIPEP TNFRSF19 CDKN2B CLPTM1L GPC5 NRG1 STK32A CYP2A6 FOXP4 FOXP4-AS1 MUC1 ROS1 SECISBP2L CHRNA5 CHEK2 LIF DNAJB4 DPYSL3 Smoking prevalence and attributable disease burden in 195 countries and territories, 1990-2015: a systematic analysis from the Global Burden of Disease Study How cigarette smoke skews immune responses to promote infection, lung disease and cancer Cigarette smoking and inflammation: cellular and molecular mechanisms Systemic effects of smoking Lung cancer in never smokers--a different disease Genetic alterations defining NSCLC subtypes and their therapeutic implications A molecular cell atlas of the human lung from single-cell RNA sequencing Dissecting the cellular specificity of smoking effects and reconstructing lineages in the human airway epithelium Characterizing smoking-induced transcriptional heterogeneity in the human bronchial epithelium at single-cell resolution A single-cell atlas of the airway epithelium reveals the CFTR-rich pulmonary ionocyte Single-cell RNA-seq reveals ectopic and aberrant lung-resident cell populations in idiopathic pulmonary fibrosis Single-cell RNA sequencing reveals profibrotic roles of distinct epithelial and mesenchymal lineages in pulmonary fibrosis A cellular census of human lungs identifies novel cell states in health and in asthma Single-cell meta-analysis of SARS-CoV-2 entry genes across tissues and demographics Integrated Single-Cell Atlas of Endothelial Cells of the Human Lung Creation of a Single Cell RNASeq Meta-Atlas to Define Human Liver Immune Homeostasis An Integrated Gene Expression Landscape Profiling Approach to Identify Lung Tumor Endothelial Cell Heterogeneity and Angiogenic Candidates A pan-cancer single-cell transcriptional atlas of tumor infiltrating myeloid cells Fast, sensitive and accurate integration of single-cell data with Harmony Goblet and Clara cells of human distal airways: evidence for smoking induced changes in their numbers Basophils trigger emphysema development in a murine model of COPD through IL-4-mediated generation of MMP-12-producing macrophages Goblet cell hyperplasia and epithelial inflammation in peripheral airways of smokers with both symptoms of chronic bronchitis and chronic airflow limitation Single-cell Transcriptome Analysis Reveals an Anomalous Epithelial Variation and Ectopic Inflammatory Response in Chronic Obstructive Pulmonary Disease. medRxiv Cell of origin of lung cancer A Decade of GWAS Results in Lung Cancer MUC1: a multifaceted oncoprotein with a key role in cancer progression Targeting the oncogenic MUC1-C protein inhibits mutant EGFR-mediated signaling and survival in non-small cell lung cancer cells Association Analysis of Driver Gene-Related Genetic Variants Identified Novel Lung Cancer Susceptibility Loci with 20,871 Lung Cancer Cases and 15,971 Controls A New Reservoir of Stromal Targets in Pancreatic Cancer A framework for advancing our understanding of cancer-associated fibroblasts Tumor Endothelial Cells (TECs) as Potential Immune Directors of the Tumor Microenvironment -New Findings and Future Perspectives Antiangiogenic and tumour inhibitory effects of downregulating tumour endothelial FABP4 Single-Cell Transcriptomics Reveals Endothelial Plasticity During Diabetic Atherogenesis Autophagy and cellular immune responses Smoking-induced iron dysregulation in the lung Our evolving view of neutrophils in defining the pathology of chronic lung disease Pathophysiological roles of interleukin-8/CXCL8 in pulmonary diseases Involvement of type II pneumocytes in the pathogenesis of chronic obstructive pulmonary disease The noncoding RNA MALAT1 is a critical regulator of the metastasis phenotype of lung cancer cells Tobacco smoking and somatic mutations in human bronchial epithelium Chronic Cigarette Smoke-Induced Epigenomic Changes Precede Sensitization of Bronchial Epithelial Cells to Single-Step Transformation by KRAS Mutations Association between Smoking History and Tumor Mutation Burden in Advanced Non-Small Cell Lung Cancer PTEN loss in the continuum of common cancers, rare syndromes and mouse models Control of mitochondrial function and cell growth by the atypical cadherin Fat1 Tobacco smoke promotes lung tumorigenesis by triggering IKKbeta-and JNK1-dependent inflammation The Cancer Genome Atlas: Creating Lasting Value beyond Its Data The Cancer Genome Atlas Pan-Cancer analysis project A Single-Cell Atlas of the Human Healthy Airways Single-Cell Transcriptomic Analysis of Human Lung Provides Insights into the Pathobiology of Pulmonary Fibrosis Cell-specific expression of lung disease risk-related genes in the human small airway epithelium Single-cell RNA sequencing demonstrates the molecular and cellular reprogramming of metastatic lung adenocarcinoma A benchmark of batch-effect correction methods for single-cell RNA sequencing data Comparative analysis of gene regulatory networks of highly metastatic breast cancer cells established by orthotopic transplantation and intra-circulation injection He, clusterProfiler: an R package for comparing biological themes among gene clusters ReactomePA: an R/Bioconductor package for reactome pathway analysis and visualization MAST: a flexible statistical framework for assessing transcriptional changes and characterizing heterogeneity in single-cell RNA sequencing data GSE131907 (n = 11 Smoking: 6; never: 3)