key: cord-0332861-ihrzo8oc authors: Brumpton, B. M.; Graham, S.; Surakka, I.; Skogholt, A. H.; Loset, M.; Fritsche, L. G.; Wolford, B.; Zhou, W.; Nielsen, J.; Holmen, O. L.; Gabrielsen, M. E.; Thomas, L.; Bhatta, L.; Rasheed, H.; Zhang, H.; Kang, H. M.; Hornsby, W. E.; Moksnes, M. R.; Coward, E.; Melbye, M.; Giskeodegard, G.; Fenstad, J.; Krokstad, S.; Naess, M.; Langhammer, A.; Boehnke, M.; Abecasis, G.; Asvold, B. O.; Hveem, K.; Willer, C. J. title: The HUNT Study: a population-based cohort for genetic research date: 2021-12-25 journal: nan DOI: 10.1101/2021.12.23.21268305 sha: 25c9c04af678ddd0313603b4a357f6548e07649a doc_id: 332861 cord_uid: ihrzo8oc The Trondelag Health Study (HUNT) is a population-based cohort of ~229,000 individuals recruited in four waves beginning in 1984 in Trondelag County, Norway. ~88,000 of these individuals have available genetic data from array genotyping. HUNT participants were recruited during 4 community-based recruitment waves and provided information on health-related behaviors, self-reported diagnoses, family history of disease, and underwent physical examinations. Linkage via the Norwegian personal identification number integrates digitized health care information from doctor visits and national health registries including death, cancer and prescription registries. Genome-wide association studies of HUNT participants have provided insights into the mechanism of cardiovascular, metabolic, osteoporotic and liver-related diseases, among others. Unique features of this cohort that facilitate research include nearly 40 years of longitudinal follow-up in a motivated and well-educated population, family data, comprehensive phenotyping, and broad availability of DNA, RNA, urine, fecal, plasma, and serum samples. Norway, like other Nordic countries, has characteristics that are uniquely favorable for recruitment to population studies, establishing biobanks, and identifying clinical outcomes and prospective disease trajectories. This includes a unique personal identification number applied throughout the life span, a universal and digitized public health care system, and accessible harmonized electronic health records. In addition, seventeen mandatory and validated national health registries are used for health analysis, administration, and emergency preparedness, and fifty-two national medical quality registries provide disease specific data on diagnosis and treatment parameters. Finally, Norwegians are an altruistic, highly motivated population for participating in biomedical research, as reflected in survey response rates of up to 89%. These factors have supported the establishment and maintenance of the Trøndelag Health Study (HUNT), a large populationbased prospective Norwegian cohort, linked to registries and biobanks dating back more than 50 years ( Figure 1 ). To understand the genetic basis of diseases, as well as follow individuals with genetic and epidemiological risk factors in a well-ascertained county in Norway, we established a comprehensive collaboration in 2005 between the HUNT study at the Norwegian University of Science and Technology, Norway and the University of Michigan, USA. This paper presents the history and status of this collaboration by describing the study population, the strategy incorporating genotyping, sequencing, and imputation-based approaches in HUNT, the vast phenotype data collected by decades of HUNT researchers, the linkage to the digitized public health care system and key findings to date. HUNT is an ongoing population-based health study in Trøndelag County, Norway. The study collects health-related data from questionnaires, interviews and clinical examinations from individuals within this geographical region (Figure 2 ). More than 229 000 adults (20 years or older at recruitment) have participated in the study to date, of whom 95 000 have provided at least one biological sample (https://www.ntnu.edu/hunt/hunt-samples) [1] [2] [3] [4] . The periodic survey design includes four recruitment waves. HUNT1 , HUNT2 (1995-97), HUNT3 (2006-08) and HUNT4 concentrated primarily on the North-Trøndelag area, where all adults (age ≥20 years) were invited. In addition, HUNT4 expanded to collect basic questionnaire data from the adult population of South-Trøndelag (105,797 additional participants) 3 . ~19 000 adults have participated in all four HUNT waves, thus having longitudinal questionnaire and physical exam information spanning over 35 years. Complementing the surveys in adult participants, four separate Young-HUNT surveys gathered data from ~25 000 adolescents in junior high and high school, concurrent with HUNT2-4. No genotyping has been performed on Young-HUNT, however 4 212 have sequentially participated in the adult version of HUNT. The HUNT Study has a high level of participation (ranging from 54% to 89% between surveys among those invited) making the cohort a good representative of the general Norwegian population. The HUNT and Young-HUNT cohorts are described in more detail elsewhere 1-5 . ~88 000 individuals provided DNA for medical research during at least one of the HUNT recruitment periods. Initially, our efforts were focused on identifying genetic variants associated with myocardial infarction (MI) [6] [7] [8] . Towards this goal, we genotyped exome variants and performed low-pass whole-genome sequencing Table 1 ), including early-onset MI cases and equal numbers of sex-and age-matched controls. Although no novel significant associations were found, likely due to the limited sample size, this set of low-pass sequences provided important insights into genetic variants present in the Norwegian population and contributed Norwegian reference sequences to the Haplotype Reference Consortium (HRC) imputation panel 9 . We next undertook genome-wide genotyping on all HUNT2-3 participants (n=71 860) with available DNA (Figure 3 ). Motivated by a goal of capturing high-quality, common-and low-frequency and Norwegian-specific variants, we used a variety of approaches to observe or estimate genotypes: 1) direct genotyping using standard and customized HumanCoreExome arrays from Illumina; 2) genotyping and imputation with a merged HRC and 358 964 polymorphic variants. We next used the 2 201 sequenced samples (HUNT-WGS) for joint imputation with the HRC panel 10 . We previously showed that imputation with a HUNT-specific reference panel improved imputation of low-frequency and population-specific variants compared to either using the 1000 Genomes or HRC reference panels alone 11 . Lastly, we imputed 25 million variants from the TOPMed imputation panel (minor allele count greater than 10), which resulted in slightly lower imputation quality compared to the population-specific reference panel but captured a larger number of variants (Supplementary Figure 1) . These two imputed datasets can be used separately in downstream analysis; we recommend using the HRC and HUNT-WGS imputation for the investigation of the Norwegian specific variants. Together, the imputations resulted in 33 million variants in 70 517 individuals from HUNT2 or HUNT3 of which 3.3 million variants are not found in UK Biobank. Finally, 18 722 new samples from HUNT4 have recently been genotyped using the same approaches (HumanCoreExome array, UM HUNT Biobank v2.0) and following imputation will create a new, larger data freeze of ~88,000 individuals from HUNT2-4. Further details of the quality control and imputation in HUNT can be found in the Supplementary Note. All rights reserved. No reuse allowed without permission. preprint (which was not certified by peer review) is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. The copyright holder for this this version posted December 25, 2021. ; history (including age of diagnosis) of a range of diseases including cardiovascular events; basic demographics including sex and participation age; anthropometrics including weight, height, BMI, and waist-tohip ratio; blood pressure measurements; and lifestyle information including smoking status ( Table 1) . HUNT data categories have been previously described 2, 3 , and are described in detail on the HUNT databank website (https://www.ntnu.edu/hunt/databank). Importantly, many measurements and questionnaire items have been intentionally kept identical or similar across HUNT surveys to enable longitudinal analyses. *Age at first attendance of HUNT is reported. ## Non-fasting glucose. ##~7 0% of participants from HUNT4 also have BMD measured in total hip which are undergoing quality control. All rights reserved. No reuse allowed without permission. preprint (which was not certified by peer review) is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. Using the unique personal identification number given to all Norwegian citizens allows for longitudinal follow-up by linkage between HUNT data, regional and national registries and electronic health records. Norway currently has 17 national health registries (https://helsedata.no/no/) that are mandatory and cover the entire population (Supplementary Table 2 , a n u s a n d a n a l c a n a l D e m e n t i a 2 9 0 , 2 9 4 , 3 3 1 F 0 0 , F 0 1 , F 0 2 , F 0 3 , G 3 0 , G 3 1 . 1 4 5 0 9 M o o d ( a f f e c t i v e ) d i s o r d e r s 2 9 6 , 2 9 8 , 3 0 0 , 3 0 1 , 3 1 1 F 3 0 , F 3 1 , F 3 2 , F 3 3 , F 3 4 , F All rights reserved. No reuse allowed without permission. preprint (which was not certified by peer review) is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. P r e g n a n c y , c h i l d b i r t h a n d t h e p u e r p e r i u m G e s t a t i o n a l h y p e r t e n s i o n 3 4 2 O 1 3 1 6 6 6 P r e s c r i p t i o n d a t a # L o w d o s e a s p i r i n --2 2 Previously, methods had been developed to account for relatedness for analysis of quantitative traits 14 , but methods to properly account for relatedness and control for unbalanced case-control ratios for binary traits were lacking. We therefore developed statistical methods to allow for the analysis of all individuals, and to control for case-control imbalance of binary phenotypes, which is commonly observed in biobanks such as HUNT. These methods, which are computationally efficient in biobank-scale data, allowed us to perform association testing in HUNT for both single variants (using SAIGE) and gene-based burden tests (using SAIGE-GENE), while accounting for sample relatedness with a sparse identical by state (IBS) sharing matrix 13, 15-17 . These methods account for case-control imbalance of binary phenotypes, typical in a population-based sample, by using the saddlepoint approximation to calibrate unbalanced case-control ratios in score tests All rights reserved. No reuse allowed without permission. preprint (which was not certified by peer review) is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. The copyright holder for this this version posted based on logistic mixed models [Add SAIGE reference]. We demonstrated a vast improvement in reducing type I error rates when analyzing unbalanced case control ratios with SAIGE in HUNT. For example, venous thromboembolism with 2 325 cases and 65 294 controls and a case:control ratio of 0.036 had substantial inflation of type I error with methods available prior to the development of SAIGE (Supplementary Figure 4) . To demonstrate the application of SAIGE-GENE, we investigated 13 416 genes, with at least two rare (MAF ≤ 1%) missense and/or stop-gain variants that were directly genotyped or imputed from the joint HRC and HUNT-WGS reference panel among 69 716 Norwegian samples from HUNT2-3 with measured highdensity lipoprotein. We identified eight genes with p-values below the exome-wide significance threshold (P ≤ 2.5 × 10 −6 ), seven of which remained significant after conditioning on nearby single-variant associations, suggesting independent rare coding variants within these genes 16 . Importantly, using SAIGE and SAIGE-GENE, we were able to use all samples, account for sample relatedness case-control imbalance, and maintain well controlled type I error rates. were not replicated at genome-wide significance in a genome-wide association study (GWAS) of 359,432 genotyped variants in HUNT. However, after imputing the dataset with the HRC and HUNT-WGS reference panel to cover more variants or meta-analysis in GLGC, significant associations in all 5 linkage peaks were observed. This study demonstrates one of the benefits of linkage analysis over GWAS, that is the ability to test for linkage in regions that are difficult to genotype such as rare variants, structural variants, copy number variants or variants in highly repetitive regions, as long as identical by descent segments in the region can be identified 18 . Finally, linkage analysis may improve statistical power when investigating rare risk variants which segregate within families and reduce confounding effects of population stratification. The high degree of relatedness in the HUNT Study participants has enabled analysis methods are tailored to this study design. These include GWAS by proxy 19, 20 , in which the phenotypes of non-genotyped family members of genotyped HUNT participants can be used to identify proxy-cases, individuals with a proportion (0.5 for first degree relatives) of the genetic risk of cases. These proxy-cases can be appropriately modelled to increase statistical power in GWAS. For example, the power to detect an allele with an odds ratio of 1.1 and MAF of 0.21 at an alpha of 5x10 -8 increases from 0.419 to 0.644 when proxy-cases are appropriately modelled instead of used in controls as in standard GWAS (Supplementary Figure 5) . All rights reserved. No reuse allowed without permission. preprint (which was not certified by peer review) is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. The copyright holder for this this version posted December 25, 2021. ; The wealth of phenotypic and genetic data available in the HUNT cohort has led to the discovery of many new genetic associations across a broad range of traits ( Table 3) . Early genetic studies of HUNT participants used exome arrays and focused on cardiovascular disease. We identified a novel coding variant in TM6SF2 associated with total cholesterol, MI, and liver enzymes 6 and replicated known MI associations at the 9p21 locus and a low-frequency missense variant in the LPA gene (p.Ile1891Met) 7 . Following the genotyping of nearly 70 000 participants in HUNT2 and HUNT3 and the development of a combined HRC and HUNT-WGS imputation reference panel, we extended our analyses to a genome-wide search (Figure 5) . Through imputation of indels called from low-pass HUNT-WGS, we discovered a rare mutation in the MEPE gene, enriched in the Norwegian population (0.8% in HUNT, 0.1% in non-Finnish Europeans), that was associated with low forearm bone mineral density and increased risk of osteoporosis and fractures 21 . Although this region had been previously identified as associated with bone mineral density 22 , the association in HUNT with replication in the UK biobank 23 pin-pointed MEPE as the likely causal gene in the region by identifying an insertion/deletion polymorphism that likely resulted in a loss-of-function protein. In another study we paid special attention to loss-of-function mutations associated with favorable blood lipid profiles (reduced LDL cholesterol and reduced CAD risk) which were not associated with altered liver enzymes or liver damage. We additionally found one elderly individual with homozygous ZNF529 loss-of-function variant showing no signs of cardiovascular disease or diabetes, suggesting that the full knock-out of this gene is viable. This highlighted ZNF529 as a potential therapeutic target for lipids 24 identified from sequencing and custom content genotyping. On top of the association studies performed using HUNT data only, we have contributed to many international consortium efforts aimed at aggregating GWAS data across cohorts. By performing GWAS metaanalyses that included HUNT and other cohorts, efforts driven by our research team have identified genetic variants associated with atrial fibrillation which may act through a mechanism of impaired muscle cell differentiation and tissue formation during fetal heart development 25 and cardiac structural remodeling 26 ; variants associated with estimated glomerular filtration rate exhibiting a sex-specific effect 27, 28 ; and variants associated with thyroid stimulating hormone that revealed an inverse relationship between TSH levels and thyroid cancer 29 . Later studies using the TOPMed reference panel 30 identified variants associated with circulating cardiac troponin I level and investigated its role as a non-causal biomarker for MI using Mendelian randomization 31 , and identified variants associated with iron-related biomarker levels and explored their relationship with all-cause mortality 32 . All rights reserved. No reuse allowed without permission. preprint (which was not certified by peer review) is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. The copyright holder for this this version posted December 25, 2021. ; preprint (which was not certified by peer review) is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. The copyright holder for this this version posted December 25, 2021. ; The high degree of relatedness in the HUNT Study offers a unique opportunity to use family-base designs to investigate causal associations. Mendelian randomization (MR), which uses genetic variants instrumental variables to investigate modifiable (non-genetic) factors, was first proposed using parentoffspring 40 . Alleles that are inherited from each parent are randomly determined during the meiotic proce This random allocation is essential to providing reliable comparisons in MR studies. However, due to the of genotyped family data, previous studies applied MR on the population-level, where the random alloca alleles is only approximate. We were able to use the ~15 000 families in HUNT to perform MR as origina proposed -in family-based designs 36 . Using this approach in HUNT, we showed empirically that MR esti from samples of unrelated individuals for the association of taller height and lower BMI increase educatio attainment, were likely induced by population structure, assortative mating or dynastic effects. We obser clear associations in within-family MR analyses in HUNT or in a replication cohort of 222 368 siblings fro 23andMe 36 . This approach has since grown in popularity and, together with HUNT, many cohorts now contribute to the investigation of causal associations with family-based designs 37 . Further leveraging the family structure information in HUNT, we have performed and have future opportunities to investigate causal effects between family members, for example parent-offspring effects assortative mating and sibling effects 43 preprint (which was not certified by peer review) is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. The copyright holder for this this version posted of genotyped family data and this has both limited causal inference (as mentioned above) and the ability of typical GWAS to distinguish between direct and indirect genetic effects 37 . HUNT data allows for study designs to disentangle these sources of genotype-phenotype associations in humans. In one such example, we used 26 057 mother-offspring and 9 792 father-offspring pairs to investigate whether adverse environmental factors in utero increased future risk of cardiometabolic disease in the offspring. We observed that adverse maternal intrauterine environment, as proxied by maternal SNPs that influence offspring birthweight, were unlikely to be a major determinant of late-life cardiometabolic outcomes of the offspring 41 . We contribute to genetic studies worldwide through participation in consortia focused on a variety of diseases including cardiovascular disease 44, 45 , lipids 46, 47 , type 2 diabetes 48 , osteoporosis 49 , decline in kidney function 50 , Alzheimer's disease 51 , bipolar disease 52 , intracranial aneurysms 53 , insomnia 54 , respiratory health 55 and sleepiness 56 . We also contributed HUNT data to studies of anthropometric traits 57 , alcohol and nicotine use 58, 59 , COVID-19 60 , phenome-wide discovery 61 , and genetic risk prediction 62 , among others. We believe that team science by consortia 61 fulfills the goals of the HUNT study and moves the science fastest towards new discoveries and improved human health. Together, the multifaceted genetic discovery strategy incorporating genotyping, sequencing, and imputation-based approaches in HUNT has aided the identification of likely causal genes and variants for disease and human traits. It has also proved to be a valuable resource for genetically informed methods of causal inference, supporting the identification of modifiable risk factors. We owe this success to the willingness and high participation rates of the people of Trøndelag, the vast phenotyping collected by decades of HUNT researchers, and access to digitized public health care systems. We anticipate that the rich data collection will continue to be a unique dataset for future opportunities in longitudinal and family-based designs, genetic discoveries, Mendelian randomization, meta-analysis and polygenic score validation, well into the future. approval for such linkage from the Regional Committee for Medical and Health Research Ethics, Norway and each registry owner. GWAS summary statistics from publications including HUNT are available from NTNU Open Research Data (https://dataverse.no/dataverse/root) and the Willer lab (http://csg.sph.umich.edu/willer/public/). The genotyping in HUNT and work presented in this cohort profile was approved by the Regional Committee for Ethics in Medical Research, Central Norway (2014/144, 2018/1622, 152023). All participants signed informed consent for participation and the use of data in research. Cohort Profile: the HUNT Study, Norway Cohort Profile Update: The HUNT Study The Nord-Trøndelag Health Study 1995-97 (HUNT 2): objectives, contents, methods and participation Cohort profile of the Young-HUNT Study, Norway: a population-based study of adolescents Systematic evaluation of coding variation identifies a candidate causal variant in TM6SF2 influencing total cholesterol and myocardial infarction risk No large-effect lowfrequency coding variation found for myocardial infarction Meta-analysis of gene-level tests for rare variant association A reference panel of 64,976 haplotypes for genotype imputation A reference panel of 64,976 haplotypes for genotype imputation Improving power of association tests using multiple sets of imputed genotypes from distributed reference panels Genetic Risk Factors for Lung Cancer: Relationship to Smoking Habits and Nicotine Addiction: The Nord-Trøndelag (HUNT) and Tromsø Health Studies Methods for association analysis and meta-analysis of rare variants in families Efficient Bayesian mixed-model analysis increases association power in large cohorts All rights reserved. No reuse allowed without permission preprint (which was not certified by peer review) is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. The copyright holder for this this version posted Efficiently controlling for case-control imbalance and sample relatedness in large-scale genetic association studies Scalable generalized linear mixed model for region-based association tests in large biobanks and cohorts Setbased rare variant association tests for biobank scale sequencing data sets Methods and Applications for Collection, Contamination Estimation, and Linkage Analysis of Large-scale Human Genotype Data Case-control association mapping by proxy using family history of disease Incorporating family disease history and controlling case-control imbalance for population based genetic association studies MEPE loss-of-function variant associates with decreased bone mineral density and increased fracture risk Twenty bonemineral-density loci identified by large-scale meta-analysis of genome-wide association studies Exome sequencing and characterization of 49,960 individuals in the UK Biobank Loss-of-function genomic variants highlight potential therapeutic targets for cardiovascular disease Genome-wide Study of Atrial Fibrillation Identifies Seven Risk Loci and Highlights Biological Pathways and Regulatory Elements Involved in Cardiac Development Biobank-driven genomic discovery yields new insight into atrial fibrillation biology Sex-specific and pleiotropic effects underlying kidney function identified from GWAS meta-analysis Discovery and prioritization of variants and genes for kidney function in >1.2 million individuals preprint (which was not certified by peer review) is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity GWAS of thyroid stimulating hormone highlights pleiotropic effects and inverse association with thyroid cancer Sequencing of 53,831 diverse genomes from the NHLBI TOPMed Program Genome-wide association study of cardiac troponin I in the general population Genome-wide meta-analysis of iron status biomarkers and the effect of iron on all-cause mortality in HUNT Systematic evaluation of coding variation identifies a candidate causal variant in TM6SF2 influencing total cholesterol and myocardial infarction risk Cardiovascular Disease Risk, and an Investigation of Potential Unanticipated Effects of PCSK9 Inhibition Efficiently controlling for case-control imbalance and sample relatedness in large-scale genetic association studies Avoiding dynastic, assortative mating, and population stratification biases in Mendelian randomization through within-family analyses Within-sibship GWAS improve estimates of direct genetic effects The causal effects of serum lipids and apolipoproteins on kidney function: multivariable and bidirectional Mendelianrandomization analyses Trans-ethnic Mendelian-randomization study reveals causal relationships between cardiometabolic factors and chronic kidney disease Mendelian randomization': can genetic epidemiology contribute to understanding environmental determinants of disease? Mendelian randomization study of maternal influences on birthweight and future cardiometabolic risk in the HUNT cohort Investigating a Potential Causal Relationship Between Maternal Blood Pressure During Pregnancy and Future Offspring Cardiometabolic Health Deconstructing the sources of genotypephenotype associations in humans Discovery and systematic characterization of risk variants and genes for coronary artery disease in over a million participants Genetic Architecture of Abdominal Aortic Aneurysm in the Million Veteran Program Discovery and refinement of loci associated with lipid levels The power of genetic diversity in genome-wide association studies of lipids Large-scale association analysis provides insights into the genetic architecture and pathophysiology of type 2 diabetes Deciphering osteoarthritis genetics across 826,690 individuals from 9 populations The CKDGen Consortium: ten years of insights into the genetic basis of kidney function A genome-wide association study with 1,126,563 individuals identifies new risk loci for Alzheimer's disease Genome-wide association study of more than 40,000 bipolar disorder cases provides new insights into the underlying biology Genome-wide association study of intracranial aneurysms identifies 17 risk loci and genetic overlap with clinical risk factors Biological and clinical insights from genetics of insomnia symptoms Genetic associations and architecture of asthma-chronic obstructive pulmonary disease overlap preprint (which was not certified by peer review) is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity Genome-wide association analysis of self-reported daytime sleepiness identifies 42 loci that suggest biological subtypes Genetic studies of body mass index yield new insights for obesity biology Association studies of up to 1.2 million individuals yield new insights into the genetic etiology of tobacco and alcohol use Model-based assessment of replicability for genome-wide association meta-analysis Mapping the human genetic architecture of COVID-19 Global Biobank Meta-analysis Initiative: powering genetic discovery across human diseases Sex-specific survival bias and interaction modeling in coronary artery disease risk prediction preprint (which was not certified by peer review) is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. The copyright holder for this this version posted A special thanks to all HUNT participants for donating their time, samples and information to help others.