key: cord-0026964-k3nz41kc authors: Ou, Min; Leung, Henry Chi-Ming; Leung, Amy Wing-Sze; Luk, Ho-Ming; Yan, Bin; Liu, Chi-Man; Tong, Tony Ming-For; Mok, Myth Tsz-Shun; Ko, Wallace Ming-Yuen; Law, Wai-Chun; Lam, Tak-Wah; Lo, Ivan Fai-Man; Luo, Ruibang title: HKG: an open genetic variant database of 205 Hong Kong cantonese exomes date: 2022-02-08 journal: NAR Genom Bioinform DOI: 10.1093/nargab/lqac005 sha: 1be57466c27416dbb0f955e67264fc0ed4145d87 doc_id: 26964 cord_uid: k3nz41kc HKG is the first fully accessible variant database for Hong Kong Cantonese, constructed from 205 novel whole-exome sequencing data. There has long been a research gap in the understanding of the genetic architecture of southern Chinese subgroups, including Hong Kong Cantonese. HKG detected 196 325 high-quality variants with 5.93% being novel, and 25 472 variants were found to be unique in HKG compared to three Chinese populations sampled from 1000 Genomes (CHN). PCA illustrates the uniqueness of HKG in CHN, and the admixture study estimated the ancestral composition of HKG and CHN, with a gradient change from north to south, consistent with their geological distribution. ClinVar, CIViC and PharmGKB annotated 599 clinically significant variants and 360 putative loss-of-function variants, substantiating our understanding of population characteristics for future medical development. Among the novel variants, 96.57% were singleton and 6.85% were of high impact. With a good representation of Hong Kong Cantonese, we demonstrated better variant imputation using reference with the addition of HKG data, thus successfully filling the data gap in southern Chinese to facilitate the regional and global development of population genetics. Hong Kong is a densely populated city in southern China. Its population dynamics are strongly associated with its historical background, especially regarding the transition from a British colony into a Special Administrative Region (SAR) of China in 1997, so it has a unique migration history (1) . About 90% of the early settlers in Hong Kong are thought to have originated from Guangdong Province in southern China (2) , and the majority of them were Cantonese (3) . Cantonese is, however, often loosely defined and refers mostly to a Yue-speaking Han Chinese sub-group in the large southern China region. The Yue people also comprise other subgroups, including Teochew and Hakka, possibly with different ancestries. However, all of them are genetically under-represented in current studies and are vaguely regarded as a mixture of southern Han Chinese (4) (5) (6) . Possibly because of the high population complexity, genetic correlations between subgroups can be found only in a few old studies (6, 7) . The limited availability of sequencing data in the southern Han Chinese subgroups, including Hong Kong Cantonese, also restricts further investigation into their genetic specificity and connections with other Chinese subpopulations. Therefore, the genetic architecture and composition of the Hong Kong population is still unclear, so there is a need to fill the gap of genetic diversity in Chinese populations. Several large-scale population genetic studies are available as good references for East Asian and Chinese populations. The Genome Aggregation Database (gno-mAD, https://gnomad.broadinstitute.org/) complements the world's largest variant database dbSNP (8) , with variants from Whole-Exome Sequencing (WES) and Whole-Genome Sequencing (WGS) samples from East Asia, providing annotations at the super-population level. The NARD database sampled 1690 WGS data of several Northeast Asian populations, including samples from Korea, Japan and China (9) . Note that only about 3.4% of its samples were obtained from Hong Kong in this study, but not specifically Cantonese. Wang et al. compared seven genome-wide data sets from 46 recent groups and 166 ancient groups of East Asians, revealing the evolution of genomic composition in East Asia (10) . The worldwide 1000 Genome Project (1KGP) sampled 388 individuals from three well-represented Chinese populations, including Han Chinese in Beijing, southern Han Chinese, and Chinese Dai in Xishuangbanna (7) . The GenomeAsia project analyzed WGS data of 1,739 individuals from 219 populations and groups across Eastern and Southern Asia, including China. Recent large-scale Chinese population studies, including ChinaMap and Nyuwa, also involved intensive Chinese sampling (11, 12) . However, as a representative of southern China, there are very few large-scale genomic resources specific to Hong Kong for characterizing the Cantonese population, which is essential to elucidate its adaptive changes (13) and promote further medical development (14) . Owing to the increasing need to gather genetic data for population-wide studies, Yu et al. recently analyzed the WES data of 1,116 Hong Kong samples and identified a set of variants for potential pharmacogenetic use in Hong Kong (15) . Most of their variant data, however, are not publicly available. The government announced plans to organize a Hong Kong-specific genome institute (HKGP), but it is still at an early stage of development (15) . Therefore, our Hong Kong (HKG) database was developed to broaden the availability of freely accessible genomic resources for Hong Kong Cantonese to facilitate more effective intra-and inter-population-wide comparative analyses. In this study, we describe HKG, the first and by far the largest openly available variant database for Hong Kong Cantonese, extracted using 205 high-quality whole exome sequencing data. Exome sequencing can capture the most informative sections of the entire genome as efficiently as WGS (16) , allowing highly confident annotations and effective consequence interpretations. We also show the ability of HKG to better position Hong Kong Cantonese among other Chinese populations, update the population-specific information, especially for clinically significant variants, and improve variant imputation and correlation with local samples. HKG can potentially be a pioneer in providing a reference for development in upcoming regional and local genetics studies. This project was reviewed and approved by the University of Hong Kong Human Research Ethics Committee (HREC reference number: EA200067). This was a secondary data analysis project for sequencing protocol and bioinformatics algorithm development, and informed consent was not required by the Committee. All procedures were performed in accordance with relevant guidelines and regulations. The paired-end 150bp raw reads used in the current study are from the Clinical Genetic Service of the Department of Health of Hong Kong. The samples originated from 205 individuals in the Hong Kong SAR who self-reported as Cantonese. The samples were target captured using SureSelect Human All Exon V6 (Target Size 60M) of Agilent Technologies and were sequenced using Illumina NovaSeq 6000 in the Novogene Tianjin Sequencing Center & Clinical Lab for 200× target depth according to the manufacturer's instructions . Cleaned reads were aligned with BALSA (17) to the GRCh38 reference release 5 (GCA 000001405.20) with decoy hs38d1 (GCA 000786075.2), and then sorted with samtools v1. 10 (18) . Duplicated reads were marked by Picard v2.0.1 (19) . Variants were first called from individual samples using the HaplotypeCaller module in GATK v4.1.3.0 (20) and were stored in GVCF format. GenomicsDBImport and GenotypeGVCFs modules were then used to perform joint variant calling on all 205 samples. The GATK Variant Quality Score Recalibration (VQSR) was used to remove low-quality variants. The SNP VQSR model was trained using HapMap (21) v3.3 and 1KGP Omni v2.5 SNP sites, and the INDEL VQSR model was trained using the Mills et al. (22) 1KGP gold standard and Axiom Exome Plus indel sites. Sensitivity thresholds of 99.6% and 95.0% were used to filter SNPs and INDELs, respectively. In addition, we filtered out sites using the same filter criteria used by ExAC (23): (i) InbreedingCoeff (inbreeding coefficient) <-0.2; (ii) AC (allele count) = 0; (iii) DP (depth) < 10 or (iv) GQ (Genotype Quality) <20. The filtration was performed using bcftools (v1.10.2). The stats module in bcftools was used to get the variant statistics. We used bcftools to split multi-allelic records into multiple bi-allelic records for variant annotation and conducted annotations of each allele with the Variant Effect Predictor (VEP) tool (Ensembl GRCh38 release 100). dbNSFP4.0a was used to get the annotations of dbSNP ID, GnomAD population allele frequency, ExAC population allele frequency, SIFT, Polyphen2, MetaSVM, MetaLR, CADD, GERP++, phyloP100way, phastCons100way, 1KGP population allele frequency, Exome Variant Server population allele frequency, and Clin-Var Clinical significance. ClinVar (v2020-07-06) was used as a custom annotation in VEP to retrieve the pathogenic variants. We used GATK LiftoverVcf to liftover the CIViC (v2020-08-07) database from GRCh37 to GRCh38 before using it to get druggable variants. The LOFTEE (24) plugin of VEP was used to generate the loss-of-function variant annotations. Scripts from (24) were used to retrieve the multi-nucleotide variants (MNVs). We used DAVID (http: //david.abcc.ncifcrf.gov) to perform the enrichment analysis of gene ontology (GO) biological processes and KEGG pathways. (25) or not in the SureSelect Human All Exon V6 bed regions were filtered out. The R package gdsfmt was used to convert the datasets into gds format for SNPRelate to use as input for PCA analysis (26, 27) . For admixture analysis, we aggregated the variants in the HKG exome bed regions, 1KGP CHN, other 1KGP EAS (i.e. JPT, CHB and KHV), and SAS (GIH, PJL, BEB, STU and ITU). We used PLINK (v1.90b6.10 64-bit) (28) to remove the variants with I. missing call frequencies greater than 0.05, II. with minor allele frequency lower than 0.05, or III. with Hardy-Weinberg equilibrium exact test P-values below 0.0001 (-geno 0.05 -maf 0.05 -hwe 0.0001). For each window of 1000 SNPs, we calculated the Linkage Disequilibrium (LD) between each pair of SNPs in the window and filtered those LD >0.2. The filtration was repeated by moving 100 SNPs forward until all SNPs were scanned (-indeppairwise 1000 100 0.2). After the filtrations, ADMIXTURE v1.3.0 (29) was applied to the rest of the variants, with the estimated number of subpopulations (k) ranging from 2 to 15. The output of the ADMIXTURE was visualized by Pophelper v2.3.0 (30). The mapped genes from the high-impact novel variants of HKG were obtained, as well as the three CHN populations of 1KGP, namely CHS, CHB, and CDX. DisGeNet (version 7.0) data was used to identify the gene-disease associations. The significant enrichments of those non-overlapped genes among the four populations in HKG were estimated by R package 'phyper'. The overlap between the diseasegene sets of DisGeNet and the 634 HKG uniquely affected genes (i.e., those not contained in the mapped genes from CHN) were used to calculate the P-values of the enrichment. Only enrichments with P < 0.01 and FDR < 0.25 were selected. To verify the improvement of the imputation accuracy using the HKG data as a local reference panel in addition to the 1KGP samples, two imputations on autosomes were performed: (i) using only the 1KGP samples as the reference panel (1KGP) and (ii) using both 1KGP and HKG samples (1KGP + HKG). We randomly divided the 205 HKG samples into two sets, one with 204 samples used as the reference panel, and the other with one sample used as the test data. This step was repeated five times to generate five reference-test pairs for 5-fold cross validation. The 204 samples of each pair were merged with the 1KGP samples to create the 1KGP + HKG reference panels. Variants with AC < 3 or (AN -AC) < 3 or with missing genotypes were excluded. Only biallelic variants were kept for analysis. For each test sample, we obtained the imputable variant intersection of the 1KGP, 1KGP + HKG, and the Infinium OmniZhongHua-8 (v1.4) SNP array panel. In each test sample, we randomly masked 200 variants per autosome. The imputation by BEAGLE 4.1(31,32) using the default setting along with niterations = 10 and ne = 20 000 was performed on each test sample with the two reference panels (1KGP and 1KGP + HKG). To assess the imputation quality, the info score for each imputed variant was calculated. An imputed variant with info score <0.4 (resp. >0.7) was classified as 'poor' (resp. 'confident'). The change in info scores after the addition of HKG samples were further compared under different HKG MAF ranges. The performance evaluation was also repeated on higher % missing SNP. To assess the performance of HKG for correlation analysis using local samples, 58 whole genome sequencing Hong Kong individuals from the Northeast Asian Reference Database (NARD;(9)) and 532 pharmaceuticalrelated variants from Yu et al. (15) were used. The NARD dataset (NARD MAF.hg38.vcf.gz) were downloaded from https://nard.macrogen.com/, and 17 224 variants having AC larger than 1 in its Hong Kong samples were considered. The 532 variants from Yu et al. were liftover from GRCh37 to GRCh38 using GATK LiftoverVcf. Variant intersection was obtained using bcftools isec. Variant allele frequency regression and calculation of Cook's distances were performed using the statsmodels module in Python. Deep whole exome sequencing obtained data from ∼38 Mbp targeted positions (92.24% exomic, 4.29% intronic, 3.47% others such as intragenic positions). The average on-target mean depth was ∼159× (96-277× per sample), with 85% of target regions covered by on average of at least 52× (24-101×) and an average of 64% (40-85% per sample) covered by at least 100×. An in-house pipeline was used to process and interpret the WES data (Supplementary Figure S1 , also see Materials and Methods). We found 196,325 high-quality variants (83.99% of all called variants), including 186 466 SNPs, 3709 insertions and 6150 deletions, after intensive quality filtering. On average, 26 221.6 variants were called per sampled individual. The transition/transversion ratio (Ti/Tv) was 2.89 for all bi-allelic variants that passed quality control. Breaking down by minor allele frequency (MAF), the Ti/Tv ratio for variants with MAF >5% and MAF ≤5% were 2.93 and 2.87 respectively. All variants were categorized based on their AC in HKG ( Figure 1A ). Compared with the worldwide distribution available in gnomAD among the HKG targeted region, both showed approximately half of the variants as singletons (AC = 1). We found that HKG had more common variants (i.e. MAF > 1%). Using the Variant Effect Predictor (VEP) of Ensembl, the identified variants were annotated with different consequences. Five main public databases (dbSNP151, 1000 Genomes, ESP6500, ExAC, gnomAD) were used to classify each HKG variant as either known (annotated in at least one database) or otherwise novel. 93.85% of known HKG variants were annotated in at least two databases; 88.54% of HKG singleton variants and 98.87% of HKG doubleton variants were known ( Figure 1B) . The relative contribution of annotations from each database is: 36.18% dbSNP151, 33.93% gnomAD, 13.72% ExAC, 10.03% 1KGP and 6.14% ESP6500. There were 11 659 novel HKG variants, with over 96.57% of them being singletons. Since the variants were confidently identified using GATK joint variant calling, the singleton variants were likely true variants rather than errors. The large proportion of novel singleton variants indicates that our analysis using high-depth WES data was highly sensitive. But the number of singletons detected increased with the addition of samples ( Figure 1C ) suggested that the sampling is not saturated. High-impact variants are defined as potentially altering the protein structure, and therefore affecting functionality. HKG variants were grouped into four categories by their function impact on a transcript or coding genes in decreasing severity: (i) HIGH, including stop-gain or stop-loss variants, frameshift variants, splice donor or acceptor variants and initiator codon variants, (ii) MODERATE, (iii) LOW and (iv) MODIFIER ( Table 1 ). The variants (SNPs or INDELs) of higher impact were more likely to be singleton than those of lower impact ( Figure 1D) . A large number of variants were missense (42.46%), synonymous (31.32%) or intron (8.81%) variants. However, there were relatively few missense mutations at high MAF: at MAF >1%, the distribution became 15.26% missense, 23.99% synonymous, 23.50% intron. On the other hand, deleterious mutations, such as frameshift, gain of stop codon, and splice site variants, were mostly singletons. INDELs were generally less common as they are likely to cause deleterious frameshift, so many more singletons were found as SNP in the HKG exome data ( Figure 1D ). There was a lower proportion of IN-DEL singletons with LOW and MODIFIER impact (15% and 28% resp. in INDEL; 45% and 43% resp. in SNP; details in Table 1 ) as well. A multi-nucleotide variant (MNV) is a combination of multiple variants coexisting in the same codon on the same haplotype. There were 800 MNVs that occurred in at least two individuals, and 254 of them were caused by two consecutive nucleotide changes. Since exome regions have higher GC content than other regions (33) and CpG sites have higher mutation rates (34), we observed an expected high percentage (36.81%) of MNVs related to transitions at CpG sites (Supplementary Figure S2) . We found 658 (82.25%) MNVs that might alter the protein in a different way than considering two single variants separately (35) . The alteration of consequence is listed in Supplementary Table S1 . 48.88% of the MNVs were missense variants. All nonsense single variants found in MNV pairs were rescued in HKG, so they were no longer causing protein truncation or disease. Rescued nonsense MNVs were involved in disease-related genes COL9A2, SYNE2 and DNAH11 (36) . COL9A2 is associated with autosomal dominant Multiple Epiphyseal Dysplasia 2, and SYNE2 is related to autosomal dominant Emery-Dreifuss muscular dystrophy 5. DNAH11 is associated with autosomal recessive primary ciliary dyskinesia-7. Severely affected consequences, such as 'gained nonsense' and 'rescued nonsense' MNVs, were found in various genes. Among the gained nonsense MNVs, one of the affected genes was HLA-DRB1, which is related to rheumatoid arthritis. The variants chr6:32584314G > T and chr6:32584315A > T in HLA-DRB1 were originally classified as separate missense variants, while their MNVs was identified as a nonsense mutation (i.e., changing the Phenylalanine to a stop codon). This MNV was heterozygous in 4 HKG samples, affecting only one copy of the gene. We found two other heterozygous nonsense MNVs in HLA-DRB5 (chr6:32522110GA > TT) and SLC30A8 (chr8:117172544CG > TA), in 4 and 12 individuals respectively. We found a stop-loss MNV chr14:105907227TT > CC in IGHD2-8 shared by 33 individuals, and none of them are homozygous. IGHD2-8 is one of the IG D (diversity chain immunoglobulin gene) that undergoes somatic recombination before transcription. During somatic recombination (37), a pair of IG D and IG J genes (joining chain immunoglobulin gene) are joined together by randomly deleting the DNA between them. After somatic recombination, somatic hypermutations in the mRNA can also increase immunoglobulin diversity. The random somatic recombination and somatic hypermutations could dilute the impact of this MNV by deleting or changing the sequence that contains the MNV. The variant chr11:71816849CG > TA on ZNF705E was homozygous, found in 12 individuals, which might introduce an early truncation to the protein. It was also located at the last exon of the gene with only one transcript, and therefore might have functional implications. The 1KGP project contains the largest amount of publicly available Chinese (CHN) population genetic data, including southern Han Chinese (CHS), Beijing Han Chinese (CHB) and Dai minority in southwestern China (CDX). Other recently developed Chinese databases (11, 12) do not provide the full variant list for batch download, so they were not included for HKG comparison. There were 128,470 variants recorded among the four populations. There were 25,472 variants uniquely found in HKG compared with CHN, and 4,366 shared variants with 5-fold differences in MAF in at least one CHN. The CHS has the highest number of shared variants with HKG (81 658 shared variants), while CHB has the largest number of unique variants shared with HKG (7673 variants) (Figure 2A ). All four populations had a similar percentage of singletons. Our PCA analysis suggests a unique composition of HKG among CHN, even compared with the closely related CHS, although there was no clear separation boundary among CHB, CHS and HKG ( Figure 2B ). This suggests that the genetic relatedness of CHN and HKG has been correlated with their geographic location. In addition, as expected, we confirmed that HKG was part of the East Asian (EAS) population by PCA (Supplementary Figure S3) . To investigate the ancestral population structure of HKG, we performed an ADMIXTURE analysis of CHN, HKG, five EAS populations, and six South Asian populations using 1,159,511 autosomal markers. The lowest crossvalidation error was achieved when the number of hypothetical ancestral components was set to 5 (K = 5), as shown in Figure 2C with full illustration in Supplementary Figure S4 . With two hypothetical ancestral components (K = 2), the results did not show a big difference be- tween HKG and CHN, but HKG also captured a few individuals with completely different major ancestral components than the rest of HKG. These outliers consistently (from K = 2 to K = 15) showed a similar composition to other East Asian populations, such as Punjabi of Pakistan, and Bengali of Bangladesh, which were common minorities in Hong Kong for generations (https://www.bycensus2016. gov.hk/en/Snapshot-10.html, Point 2). K = 5 showed a clear separation of the populations, with three ancestral components dominated in HKG and CHN (in Vietnam as well). Two of the major ancestral components gradually changed in proportion from the northern to southern Chinese subpopulations. The dominant ancestral populations in the northern Chinese (purple) were progressively diluted towards the south, whereas the proportion of the minor ancestral population in the northern Chinese (red) gradually increased towards the southern Chinese. The compositions of these two major components also showed differences between the closely related CHS and HKG, suggesting deviation in the ancestry of Hong Kong Cantonese from CHS. Further separation from K = 7 onwards becomes ambiguous, suggesting a simple ancestral population in China, which agrees with other Chinese population studies (11, 12) . The application of exome data for Identical-by-Descent (IBD) analysis is likely to cause bias in the size and number of IBD segments detected due to many uncovered positions. In this study, we conducted the IBD analysis with intensive post-filters to characterize the population of HKG, only for comparing to the other CHN populations (methods and results in Supplementary Note and Supplementary Table S2 ). The result suggests that HKG is not as isolated as the CDX and has a lower population mixture than the CHB. Again, the interpretation of exome-based IBD results should be done cautiously. We expect a more reliable IBD study will be conducted when whole genome sequencing of more Hong Kong individuals are available from projects such as the Hong Kong Genome Project. A total of 96 795 variants (49.3% of all high-quality HKG variants) identified in at least two HKG individuals were classified as high-confidence variants, 31.26% of which were rare variants (MAF < 1%). To evaluate the potential contribution of HKG variants for biomedical use, we investigated 189 variants with pathogenicity, 410 druggable variants, and 360 LoF variants in greater detail. Among the 189 ClinVar pathogenic variants found in HKG, only seven were reported in existing studies with Hong Kong samples (9, 15) . There were 169 annotated pathogenic rare variants (MAF ≤ 1% in HKG and worldwide) located in 141 genes. The details of these annotated pathogenic variants and their annotations are listed in Supplementary Table S3 . Among the 20 annotated pathogenic variants which were common (MAF > 1%) in HKG, were reported as pathogenic by other Chinese population studies in ClinVar. The pathogenicity of these variants among Hong Kong Cantonese should therefore be further evaluated. Among the rest, nine pathogenic variants were found to have much higher AF (>5 times higher) in HKG compared to the worldwide records, and 13 were also found to be common variants in gnomAD of other EAS samples (Supplementary Table S4 ). Also, 12 common variants were annotated pathogenic with reference only to Western and East Asian populations studies, so they lack support from Chinese data. Thus, it is reasonable to conjecture that these high MAF variants are not pathogenic in HKG. In total, 410 pharmaceutical-related high-confidence variants were annotated in CIViC and PharmGKB databases, providing a possible guide for drug use. Only four of them were also reported in a recent pharmacogenetics study (15) . Among the 24 CIViC variants in HKG, 22 are common variants (MAF > 1%), so they are more likely to be drug effective in the population. We also found that some of these variants had significantly different MAF between the Chinese and non-Chinese populations. These variants might affect drug usage and treatment plans. For example, the variant rs1799782, which is related to Lung Non-small Cell Carcinoma and can increase the response rate of chemotherapy (38) , has a much higher MAF in HKG (30.50%) and gnomAD EAS (30.52%) than in the worldwide distribution (9.46%). There were 401 HKG variants with PharmGKB annotations, six of which were not found in gnomAD but were common in HKG (MAF > 1%). Two of these variants, rs6318 and rs3758581, had an extremely high MAF of over 95%. The G allele of variant rs6318 accounted for 98.5% in HKG, which could be associated with an increased likelihood of drug-related weight gain based on three different human studies (39) (40) (41) . Another variant, rs3758581, had 95.1% MAF, which is associated with drug response of fluoxetine, fluvoxamine, and voriconazole based on three in vitro studies (42) (43) (44) . Other than database annotations, we found 1276 LoF variants that occurred in at least two HKG samples using the LOFTEE annotation in VEP, 878 of which were removed due to low confidence. Three variants were marked as false positives because the alternative allele was the ancestral state (33) . Common variants (MAF > 0.5) in gno-mAD were regarded as errors in the reference (33) and were not analyzed. Therefore, we found 306 highly reliable LoF variants on 301 genes in HKG; 68.5% of the variants were SNPs and 31.9% INDELs; 175 variants were rare (MAF ≤ 1%) and 185 were common (MAF > 1%). However, there was not significantly enriched GO term, KEGG pathway or disease association being found. By comparing five promising databases -dbSNP, gno-mAD, 1KGP, ExAC and ESP6500--we obtained 11 659 novel variants (10 641 SNPs/366 insertions/652 deletions) after stringent filtering. Out of these novel variants, we identified 26 common variants (MAF > 5%), 88 lowfrequency variants (MAF = 1-5%), and 54 rare variants (MAF < 1%). More than 96% of the novel variants were singletons (i.e. AC = 1), which was higher than that in Chi-naMAP (75.3%) and NyuWa (86.8%), using whole genome sequencing data. The impact distribution of these novel variants was: 18.99% HIGH, 6.56% MODERATE, 3.94% LOW and 6.39% MODIFIER. The proportion of highimpact variants in the novel sets is significantly higher than the known sets ( Figure 3A) . A comparison of novel and known variants in different AF categories under each impact level is shown in Supplementary Table S5 . We computed pathogenicity scores using the C-scores reported by CADD (Combined Annotation-Dependent Depletion), a widely used database of pathogenicity. A higher C-score or pathogenicity score means more deleterious outcomes. The mean pathogenicity scores of novel variants and all (i.e. novel + known) variants of HKG were 14.4 and 10.5, respectively. A larger proportion of novel variants fell into the high score range (>20) compared to all variants of HKG ( Figure 3B ), indicating that a larger percentage of novel variants were more deleterious. Among the 175 rare LoF variants in HKG, six (3.43%) were also present in 799 VEP annotated high-impact novel variants ( Figure 3C ). The biological association study mapped 799 highimpact novel HKG variants to 731 coding genes, which were significantly enriched with cilium assembly and morphology, ECM-receptor interaction, protein transport, microtubule cytoskeleton organization, sister chromatid cohesion, and protein digestion and absorption ( Figure 3D) . We also obtained a set of high-impact variant-mapped genes that were present only in HKG but not in other CHN. Other disease-associated genes which were potentially related to high-impact novel variants, based on DisGeNET and their association network, are listed in Supplementary Table S6 and Supplementary Figure S5 for reference. We found eight of the disease-associated genes with novel highimpact variants in at least three individuals (Supplementary Table S7 ). Our detailed studies suggest that these highimpact novel variants often existed on the same exon of the affected genes. Known pathogenic variants could also be found on the same exons of MUC4 (45) and CACNA1A (46) genes. However, owing to the limited sample size and uncertainty by in silico predictions, further confirmation of their pathogenicity is awaited with larger sampling in the Hong Kong Genome Project (HKGP) (https://www.info.gov.hk/ gia/general/202011/05/P2020110500465.htm) in the future. Among the 26 common novel variants (MAF > 0.05), six were in coding regions, although these variants are likely to be benign in HKG given their high AF. The coding region variant chr7:100773854G > GTT of the ZAN gene occurs in 7.80% of the population. Deleterious mutations in ZAN might affect the adhesion of sperm to eggs, thus reducing fertility (https://www.genecards.org/cgi-bin/carddisp.pl? gene=ZAN). There is also a known benign record listed in ClinVar at the same genomic position as the frameshift G > GT variant. Another high AF variant associated with immune response was found on chr7:142796847A > T of the TRBJ2-3 gene, with AF 20.97%. The closely associated pseudogene in the same family chr7:142796707C > G TRBJ2-2P also had a high MAF of 20.20%, but no functional significance was recorded. All the HKG common novel variants are listed in Supplementary Table S8 . We validated the effective usage of HKG variants using imputation and correlation studies. The imputation accuracy was evaluated using HKG + 1KGP and 1KGP reference panels to impute variants in 200 randomly selected posi-tions in five phased Hong Kong Cantonese samples. The addition of HKG in the reference panel yielded a significant increase of 2.189% on the average info score in autosomes compared with using just 1KGP data as panel (Supplementary Table S9 and S10). For variants with MAF > 5% and MAF ≤ 5%, the info score increment reached 2.86% and 3.38% respectively. Improvements in imputation quality could be observed as there were 1.33% more highconfidence variants (info score > 0.7) and 1.495% fewer low-confidence variants (info score < 0.4). These improvements were found to be significant, based on the student T-test, marked as ** on Figure 4A , suggesting Hong Kongspecific data should be included in imputation of local samples. For correlation analysis, we obtained a limited number of Hong Kong samples from two existing studies and checked their variant consistency with HKG. The Northeast Asian Reference Database (NARD; (9)) involves 58 whole genome sequencing Hong Kong samples. It is the known database with the greatest number of variants (8, 898 ,677 variants) from a Hong Kong population. The variants in these samples are hereafter abbreviated as NARD HK. We found 17 224 variants in both NARD HK and HKG. Figure 4B shows that allele frequencies in NARD HK are well correlated (r 2 = 0.985) with HKG, suggesting a strong positive linear association. Using Cook's distance (>0.0002), seven variants were identified to be common in NARD HK but rare in HKG (Supplementary Table S11 ). Among them, only two variants, chr11:55639048C > T and chrX:71104307C > T, could be found in dbSNP, but none of them had a record of clinical significance. We also correlated the novel variants of HKG with the pharmacogenetic variants reported by Yu et al. (15) . A linear relationship between their variant AFs was still observed ( Figure 4C ), but with a lower r 2 of 0.909 due to the strong sampling bias towards pharmaceutical use in that study. These results highlight the effective correlation between HKG with other Hong Kong samples, hence they are a good representation of the local genetic diversity. Among 612 search results of 'Hong Kong Chinese variant' and 'genotyping'/'WES'/'WGS' in PubMed (on 14 December 2020), 192 used samples collected in Hong Kong, but only 84 of them provided the count or frequency of the variants in individuals (Details can be found in Supplementary Table S12). As discussed above, there has long been a research gap in the development of a Hong Kong Cantonese-specific genomic database, which has hindered the regional and local development of genetic studies. Population-specific genomic data is undoubtedly one of the most valuable knowledge databases, which evolves with the ancestry and demography of a population (47) . It allows us to trace the population origin or predict population susceptibility to future environmental changes (48) . There is an increasing number of population-wide genetics studies (49, 50) , driven not only by the availability of resources, but also the related practical health care benefits. In this study, we illustrated the power of using WES as an alternative approach to describe the landscape of genetic variations to define the genomic characterizations of HKG. This holds the potential for novel gene characterizations at a lower cost than WGS (16) . As HKG is the first public Hong Kong Cantonese variant database, it is the leading project to promote local genetic studies for research and clinical applications. It is obvious that Hong Kong Cantonese consist mainly of Chinese Cantonese. The backbone of HKG is closest to Chinese CHB, CHS and CDX populations sampled by 1KGP, with over 60% similarity. The majority of Hong Kong Cantonese share ancestry with CHS, and both have a similar effective population size. Our analysis also suggests sufficiently large uniqueness of HKG to justify the need to have a population-specific variant database. HKG is the first to demonstrate the potential association between ancestral composition and geographical distribution of Hong Kong Cantonese and provides evidence for the historical interpretation of Hong Kong Cantonese population migration, filling the gap of genetic diversity among Cantonese people. Interestingly, HKG also successfully captured a portion of non-Chinese diversity among the Hong Kong Cantonese. These samples might represent the South Asians who migrated to Hong Kong many years ago for historical reasons (1) . This also suggests the comprehensiveness in our sampling. Developments towards precision medicine are one of the most important medical challenges in recent years (51) . Although a large proportion of variants identified in HKG are known variants, we updated the population-specific frequency and haplotype for 599 potentially pathogenic and druggable variants, based on the CIViC, PharmGKB and ClinVar annotations. The dynamics in population allele frequencies influences the functional interpretation of variants, which may assist local medical consultations and treatment decisions (51, 52) . HKG found 11,659 novel variants, 6.85% of which were high-impact variants. Since these novel variants had not been documented, biological associations were made only based on the corresponding gene and their association from disease databases. As different exomic variants could be mapped to the same coding genes, the biological associations should be carefully interpreted. Therefore, the HKG-specific disease-gene network presented in this study should only be regarded as predicted patterns and not to be taken as clinical advice. Instead, we encourage further investigation of possible genetically related health issues in the community, using HKG data as a steppingstone. HKG is by far the largest public variant database for Hong Kong Cantonese. With its highly confident variant data, HKG can serve as a pioneering genomic resource region wide. As a part of southern Chinese, HKG also provides additional data for genetic studies of human traits in Chinese populations. The application effectiveness of our data is reflected in the imputation study and correlation analysis. Although HKG has already shown high discovery power to detect high-quality variants using high-depth exome data, the number of samples is still limited at this stage, and we believe that large-scale sampling in the future, such as HKGP, would be meaningful for further investigation. HKG is publicly available in the European Variation Archive (EVA) under study number PRJEB41688 (https:// www.ebi.ac.uk/ena/browser/view/PRJEB41688). The genotype imputation can be freely performed at the HKG imputation server for the academic purpose at http://www.bio8. cs.hku.hk/HKGimputationServer. The scripts used in this study are available at https://github.com/HKU-BAL/HKG. Supplementary data are available at NARGAB online. In: A Concise History of Hong Kong Mode of migration, age at arrival, and occupational attainment of immigrants from mainland china to hong kong Social change, cohort quality and economic adaptation of chinese immigrants in hong kong Genomes project Genomes project The genomeasia 100K project enables genetic discoveries across asia Genomes project dbSNP: the NCBI database of genetic variation NARD: whole-genome reference panel of 1779 northeast asians improves imputation accuracy of rare and low-frequency variants Genomic insights into the formation of human populations in east asia The ChinaMAP analytics of deep whole genome sequences in 10,588 individuals NyuWa Genome resource: Adeep whole-genome sequencing-based variation profile and reference panelfor the Chinese population Advances and limits of using population genetics to understand local adaptation: (Trends in ecology & evolution Towards precision medicine Actionable pharmacogenetic variants in hong kong chinese exome sequencing data and projected prescription impact in the hong kong population Use of whole exome and genome sequencing in the identification of genetic causes of primary immunodeficiencies BALSA: integrated secondary analysis for whole-genome and whole-exome sequencing, accelerated by GPU The sequence alignment/map format and SAMtools GitHub repository The genome analysis toolkit: a mapreduce framework for analyzing next-generation DNA sequencing data The international hapmap project Natural genetic variation caused by small insertions and deletions in the human genome Analysis of protein-coding genetic variation in 60,706 humans The mutational constraint spectrum quantified from variation in 141,456 humans Insights into human genetic variation and population history from 929 diverse genomes A high-performance computing toolset for relatedness and principal component analysis of SNP data SeqArray-a storage-efficient high-performance data format for WGS variant calls PLINK: a tool set for whole-genome association and population-based linkage analyses Enhancements to the ADMIXTURE algorithm for individual ancestry estimation pophelper: an r package and web app to analyse and visualize population structure Genotype imputation with millions of reference samples Rapid and accurate haplotype phasing and missing-data inference for whole-genome association studies by use of localized haplotype clustering Whole-exome sequencing in an isolated population from the dalmatian island of vis The promise of discovering population-specific disease-associated genes in south asia Genome Aggregation Database Production, T. and Genome Aggregation Database, C.2020) Landscape of multi-nucleotide variants in 125,748 human exomes and 15,708 genomes Online Mendelian Inheritance in Man (2021) McKusick-Nathans Institute of Genetic Medicine Immunobiology 5: the immune system in health and disease XRCC1 rs1799782 (C194T) polymorphism correlated with tumor metastasis and molecular subtypes in breast cancer Pharmacogenetics of risperidone therapy in autism: association analysis of eight candidate genes with drug efficacy and adverse drug reactions Pharmacogenetic associations of antipsychotic drug-related weight gain: a systematic review and meta-analysis Pharmacogenomic associations with weight gain in olanzapine treatment of patients without schizophrenia Effects of CYP2C19 variants on fluoxetine metabolism in vitro The effects of cytochrome P450 2C19 polymorphism on the metabolism of voriconazole in vitro Evaluation of the effects of 18 non-synonymous single-nucleotide polymorphisms of CYP450 2C19 on in vitro drug inhibition potential by a fluorescence-based high-throughput assay Axed MUC4 (MUC4/X) aggravates pancreatic malignant phenotype by activating integrin-␤1/FAK/ERK pathway Spinocerebellar [corrected] ataxia type 6: molecular mechanisms and calcium channel genetics Tracing the peopling of the world through genomics The major genetic risk factor for severe COVID-19 is inherited from neanderthals The UK10K project identifies rare variants in health and disease A 1000 arab genome project to study the emirati population The need for multi-omics biomarker signatures in precision medicine Human genomics projects and precision medicine