key: cord-0000352-mxjtj5c0
authors: Zhang, Yonqing; De, Supriyo; Garner, John R; Smith, Kirstin; Wang, S Alex; Becker, Kevin G
title: Systematic analysis, comparison, and integration of disease based human genetic association data and mouse genetic phenotypic information
date: 2010-01-21
journal: BMC Med Genomics
DOI: 10.1186/1755-8794-3-1
sha: c8910b943170c80f2f3c5e5846b615d9c42622e1
doc_id: 352
cord_uid: mxjtj5c0

BACKGROUND: The genetic contributions to human common disorders and mouse genetic models of disease are complex and often overlapping. In common human diseases, unlike classical Mendelian disorders, genetic factors generally have small effect sizes, are multifactorial, and are highly pleiotropic. Likewise, mouse genetic models of disease often have pleiotropic and overlapping phenotypes. Moreover, phenotypic descriptions in the literature in both human and mouse are often poorly characterized and difficult to compare directly. METHODS: In this report, human genetic association results from the literature are summarized with regard to replication, disease phenotype, and gene specific results; and organized in the context of a systematic disease ontology. Similarly summarized mouse genetic disease models are organized within the Mammalian Phenotype ontology. Human and mouse disease and phenotype based gene sets are identified. These disease gene sets are then compared individually and in large groups through dendrogram analysis and hierarchical clustering analysis. RESULTS: Human disease and mouse phenotype gene sets are shown to group into disease and phenotypically relevant groups at both a coarse and fine level based on gene sharing. CONCLUSION: This analysis provides a systematic and global perspective on the genetics of common human disease as compared to itself and in the context of mouse genetic models of disease.

Common complex diseases such as cardiovascular disease, cancer, and autoimmune disorders; metabolic conditions such as diabetes and obesity, as well as neurological and psychiatric disorders make up a majority of health morbidity and mortality in developed countries. The specific genetic contributions to disease etiology and relationships to environmental factors in common disorders are unclear; complicated by many factors such as gene-gene interactions, the balance between susceptibility and protective alleles, copy number variation, low relative risk contributed by each gene, and a myriad of complex environmental inputs.

Genetic association studies using a candidate gene approach and more recently whole genome association studies (GWAS) have produced a large and rapidly increasing amount of information on the genetics of common disease. In parallel, mouse genetic models for human disease have provided a wealth of genetic and phenotypic information. While not always perfect models for human common complex disorders, the genetic purity and experimental flexibility of mouse disease models have produced valuable insights relevant to human disease.

Gene nomenclature standardization [1] , database efforts [2] [3] [4] , and phenotype ontology projects [5] in both human and mouse over the past decade have provided the foundation for integration of information on genetic contributions to disease and phenotypes. This allows the opportunity for systematic comparison and higher order systems analysis of disease and phenotypic information. In this report, we summarize and integrate large scale information on human genetic association information and mouse genetically determined phenotypic information with the goal of identifying fundamental relationships in human disease and mouse models of human disease.

The Genetic Association Database [2] (GAD) http:// geneticassociationdb.nih.gov is an archive of summary data of published human genetic association studies of many common disease types. GAD is primarily focused on archiving information on common complex human disease rather than rare Mendelian disorders as found in the Online Mendelian Inheritance in Man (OMIM) [6] . GAD contains curated information on candidate gene studies and more recently on genome wide association studies. It builds on the curation of the CDC HuGENet info literature database [3] in part by adding molecular and ontological annotation creating a bridge between epidemiological and molecular information. This allows the large-scale integration of disease based genetic association information with genomic and molecular information as well as with the software tools and computational approaches and that use genomic information [7] [8] [9] [10] [11] [12] . This report is a summary and analysis of the genes and diseases with positive associations in the Genetic Association Database with regard to replication, comparisons between diseases, and within broad phenotypic disease classes. Although GAD contains information on gene variation, this report is at the gene level only and does not consider specific gene variation or genetic polymorphism.

The Genetic Association Database (GAD) currently contains approximately 40,000 individual gene records of genetic association studies taken from over 23,000 independent publications. Importantly, a large number (11, 568) of the records in GAD have a designation of whether the gene of record was reported to be associated (Y) or was not (N) associated with the disease phenotype for that specific record. Many records, for various reasons, do not have such a designation. In addition, a portion of the database records have been annotated with standardized disease phenotype keywords from the MeSH http://www.nlm.nih.gov/mesh/ vocabulary. The GAD summations shown below are a subset of the records in GAD. They only include those records that are both; a) positively associated with a disease phenotype, and b) have a MeSH disease phenotype annotation. This represents a subset of 10,324 records having both positive associations to disease and records with MeSH annotations. Records designated as not associated (N) with a disease phenotype and those without MeSH disease annotation are not considered at this time in this report.

The mouse phenotypic information described here was obtained from the Mouse Genome Informatics (MGI) database [4] http://www.informatics.jax.org/ Phenotypes, Alleles and Disease Models section. The file used for mouse phenotypic information (see methods) is comprised of 5011 unique genes and 5142 unique phenotypic terms derived from information from specific gene mutations in multiple mouse strains. The mouse phenotypic information had been annotated to the mouse gene mutation records using Mammalian Phenotype terms and codes in the mouse phenotype database as a component of the Mouse Phenotyping Project [5, 13] .

Quantitation of how often a disease phenotype was positively associated with a gene was performed as follows. GAD records having both recorded positive associations and annotated MeSH disease keywords were extracted and stored in a database according to their relationships. Using a perl script, the number of times of co-occurrence of a MeSH disease keyword was positively associated with a specific gene was recorded as found in the GAD database. These counts were sorted in declining order for each unique gene grouped by the disease MeSH term with which they are associated.

The mouse phenotypic information described here was obtained from the Mouse Genome Informatics (MGI) http://www.informatics.jax.org/; Phenotypes, Alleles and Disease Models section; ftp://ftp.informatics.jax.org/pub/ reports/index.html#pheno Using these three files downloaded on 4-4-2008ftp:// ftp.informatics.jax.org/pub/reports/MPheno_OBO.ontologyftp://ftp.informatics.jax.org/pub/reports/MGI_Pheno-typicAllele.rptftp://ftp.informatics.jax.org/pub/reports/ MGI_PhenoGenoMP.rpt

The mouse phenotype files were extracted using a perl script annotating each gene with the phenotype term associated with each Mammalian Phenotype (MP) code.

Individual GAD primary gene sets were analyzed using Venny [14] http://bioinfogp.cnb.csic.es/tools/venny/index. html. Pathway Venn Diagram comparisons were performed by placing individual GAD primary gene sets into WebGestalt [15] http://bioinfo.vanderbilt.edu/webgestalt/ to identify KEGG pathways, then placing the resulting pathway names into Venny.

Relationships between diseases were identified by a unique method similar to phyologenetic classification. First the distance between the diseases were calculated by pairwise comparison of the diseases by finding the common genes between the pairs and dividing it by the smallest group of the pair. This number was then subtracted from 1. This step was done because if two lists are identical (100% match) then the resultant distance should be 0. This is represented in the formula:

Where: C k : Genes in each disease set (where k = i, j); N(C k ): Number of genes in each disease set (where k = i, j); d ij is the pairwise distance; i, j: index of genes in each disease set where; i = 1, 2, 3, ........., n; j = 1, 2, 3, ........., m

The disease relationships were calculated from the distance matrix using the Fitch program from the Phylip package [16] . It calculates the relationships based on the Fitch and Margoliash method of constructing the phylogenetic trees [17] using the following formula (from the Phylip manual):

where D is the observed distance between gene sets i and j and d is the expected distance, computed as the sum of the lengths of the segments of the tree from gene set i to gene set j. The quantity n is the number of times each distance has been replicated. In simple cases n is taken to be one. If n is chosen more than 1, the distance is then assumed to be a mean of those replicates. The power P is what distinguished between the Fitch and Neighbor-Joining methods. For the Fitch-Margoliash method P is 2.0 and for Neighbor-Joining method it is 0.0. As running Fitch took a long time when the gene-set size was huge (weeks for the human gene-sets and months for the mouse gene-sets), Neighbor-Joining method was used to create the replicate dendrograms (not shown) after randomizing the input order for greater confidence. The resulting coefficient matrix files were displayed using the Phylodraw graphics program [18] .

Ward's minimum variance method [19] was used to find the distance between two diseases. The distance between the clusters is the ANOVA sum of squares between the two clusters added up over all the variables. At each generation, the within-cluster sum of squares is minimized over all partitions obtainable by merging two clusters from the previous generation. Ward's method joins clusters to maximize the likelihood at each level of the hierarchy under the assumptions of multivariate normal mixtures, spherical covariance matrices, and equal sampling probabilities. Distance for Ward's method is:

where N K is the number of observations in C K (which is the Kth cluster, subset of {1, 2, ..., n) where n is the number of observations). x k is the mean vector for cluster C K .

Each record in GAD represents a specific gene from a unique publication of a human population based genetic association study and is categorized into one of 24 general disease classes corresponding to broad MeSH disease or disease phenotypic groupings. Table 1 is a summary of the number of positively associated human genes in each MeSH human disease class. As represented by these disease classes the GAD database covers a broad selection of diseases falling into major disease classes including; aging studies, cancer, immune disorders, psychiatric diseases, metabolic conditions, pharmacogenomic studies, and studies of chemical dependency, among others. Similarly, each record in the phenotype files from the MGI phenotype database represents a unique mouse gene specific genetic model. Table 2 shows the general categories represented by the mouse phenotype summary files and the number of mouse genes found in each top level phenotype class. The mouse files contain a greater number of intermediate developmental and morphological phenotypes (e.g. insulin resistance, absent CD4+ T cells, abnormal spatial learning) while the human files tend to comprise a greater number of end stage clinical disease phenotypes (e.g. Type 2 Diabetes, multiple sclerosis, autism). Table 3 introduces examples of human genes from fundamental biological pathways that have been consistently associated with major disease phenotypes highlighting the sometimes-broad pleiotropic effects that major regulatory molecules have on multiple disease phenotypes. Genes such as NOS3, nitric oxide synthase 3, regulating nitrous oxide production; HLA-DQB1, the MHC class II molecule DQ beta 1, involved in antigen presentation; ACE, angiotensin I converting enzyme, central to the renin-angiotensin system and PPARG, peroxisome proliferator-activated receptor gamma, regulating transcription in pathways important in lipid metabolism are examples of genes that affect multiple tissues and different organ systems through the complex course of disease progression. Importantly, all the mouse orthologs of the human genes in Table 3 have experimentally determined phenotypes that are similar or broadly overlapping with human clinical disease phenotypes (see below).

The majority of this report is built upon large nonredundant general summary lists for both human and mouse, shown below. These lists take two complimentary forms in both human and mouse. The first sets are GENE-to-Disease/Phenotype lists. These are non-redundant lists of genes showing the diseases or phenotypes that have been associated with each gene (Table 4  human, table 5 mouse, and table 6 human-mouse). The second sets of basic lists are DISEASE/PHENOTYPE-to-Gene lists. These are non redundant lists of diseases or phenotypes with the genes that have been associated with that disease or phenotype (Table 7 human and  table 8 mouse) . Table 4 shows examples of selected genes in each row that have been positively associated with specific disease phenotype keywords. Each human gene symbol is followed by a specific MeSH disease term and the number of times that gene has been positively associated with the term, in declining order. A major feature of Table 4 is that individual genes have been positively associated with sometimes overlapping disease phenotypes over a broad range from more frequently to less frequently. Table 4 is a small representative subset, truncated in the number of genes (rows) and the number of MeSH terms (columns). The complete list of 1,584 human genes with additional information can be found in Table S1a [20] . An interactive version of the same list can be found in Table S1b [21] .

Quite often the resulting list of phenotypes associated with a specific gene may include the major disease phenotype followed by specific sub-phenotypes of the disease that contribute distinct aspects to the overall clinical disease phenotype. For example, IL13 has been associated with asthma at least 11 times as well as to the asthma sub-phenotype immediate hypersensitivity 4 times. Similarly, the gene CFH has been associated with macular degeneration at least 19 times, as well as to the endophenotype of macular degeneration, choroidal neovascularization 3 times. Although replication in genetic association studies has been widely debated [22] , consistent replication by independent groups, although sometimes with both modest risk and significance values [23] , suggests a fundamental measure of scientific validity. This is true for both candidate gene as well as GWAS studies.

In other cases, individual genes have been associated with independent but related disorders that may share fundamental biological pathways in disease etiology, such as HLA-DQB1, CTLA4, and PTPN22 as in the case of autoimmune disorders. This gene overlap emphasizes the fundamental, often step-wise biochemical role each gene plays in shared disease etiology [24] [25] [26] [27] . That is, HLA-DQB1 in antigen presentation, CTLA4 in regulation of the expansion of T cell subsets, and PTPN22 in T cell receptor signaling, all contributing to immunological aberrations and progression to clinical disease, as in rheumatoid arthritis, systemic lupus erythematosus, and type 1 diabetes. In other cases, the same gene has been associated with quite different clinical phenotypes, suggesting sharing of complex biological mechanisms at a more underlying level. For example, the gene CFTR, widely recognized as the cause of cystic fibrosis, has been consistently associated with pancreatitis, may be implicated in chronic rhinitis [28] , and may play a protective role in gastrointestinal disorders [29] .

Tables 5 and S2 are the mouse equivalents of the human GENE-to-Disease/Phenotype lists (tables 4 and S1 for human). These were developed from the mouse phenotype table of genes with mouse phenotype ontological codes ftp://ftp.informatics.jax.org/pub/reports/ index.html#pheno, downloaded on 4-4-08. To build tables 5 and S2, the matching phenotypic terms were exchanged for each Mammalian Phenotype code (MP:#). This resulted in the mouse GENE-to-Disease/Phenotype tables (tables 5 and S2) similar in structure to human GENE-to-Disease/Phenotype tables (tables 4 and S1). Unlike the human tables, the mouse GENE-to-Disease/ Phenotype tables come from individual mouse experimental knockout or other genetic studies. They are not based on population based epidemiological studies. They also do not have the quantitative aspect of the human tables with publication frequency counts tagged to each record. In addition, although they include a wide variety of physiological, neurological, and behavioral phenotypes, they do emphasize developmental studies and observational morphological phenotypes common in mouse knockout studies. Table 5 is a small representative subset, truncated in the number of genes (rows) and the number of Phenotype terms (columns). The complete list of 5011 mouse genes with annotated phenotypes and additional information can be found in Table S2a[ 30] . An interactive version of the same list can be found in Table S2b [31].

We can now compare these tables directly, thereby allowing gene-by-gene comparison of human disease phenotypes and mouse genetic phenotypes. Tables 6 and S3 are comparisons of the genes that overlap between the human and mouse gene lists (Table S1 and  Table S2 ) showing mouse gene symbols and their human orthologs. Table 6 is a small subset of selected gene-phenotype cross species comparisons. Even though in some cases the human studies have not been replicated, there is often a striking concordance between human disease phenotypes and mouse genetically determined phenotypes. For example, the human gene inhibin alpha (INHA) has been associated with premature ovarian failure [32] , and shows mouse phenotypes of abnormal ovarian follicle morphology, female infertility, and ovarian hemorrhage [33] , among other phenotypes relevant to human disease. Similarly, in humans the engrailed homeobox 2 gene (EN2) has been associated with autistic disorder [34] while the comparison to mouse En2 has genetic mutations involved in abnormal social integration, spatial learning, and social/consecutive interaction, among others [35] . Importantly, the few mouse studies highlighted above, and many found in the main table S3, were published after the corresponding human genetic population based epidemiological studies. Given concerns of false positives and publication bias in human genetic association studies, direct comparisons to related mouse phenotypes may provide supporting evidence that a given gene may be relevant to a specific human disease phenotype. Table S3 [36] is a full listing of the 1104 shared genes between the human disease and mouse phenotype summaries.

The second type of main summary tables are DISEASE/ PHENOTYPE-to-Gene lists. Disease/Phenotype gene summaries are essentially transposed versions of the GENE-to-Disease/Phenotype summaries (Tables S1 & S2) that allow different types of comparisons. These are non-redundant lists of phenotype keywords, MeSH disease terms in the case of human and Mammalian Phenotype Terms (MP) in the case of mouse, followed by the genes associated or annotated to those disease phenotype keywords. Table 7 shows examples of selected human disease phenotypes in each row positively associated with specific human genes for 8 major MeSH disease classes including cardiovascular, digestive system diseases, diseases of environmental origin, immune system diseases, mental disorders, nervous system diseases, nutritional and metabolic diseases, and eye diseases. Each Mesh phenotype term is followed by the number of times that a specific disease term has been positively associated with a particular gene in each row, in decreasing order. Table 7 is a small representative set, truncated in the number of disease phenotypes (rows) and the number of genes (columns). The complete list of 1,318 MeSH disease phenotype terms with additional information can be found in Table S4a[ 37] . An interactive version of the complete list can be found in Table S4b [38].

Tables 8 and S5 constitute the mouse DISEASE/PHE-NOTYPE-to-Gene summaries. Table 8 consists of selected mouse phenotypes which fall into similar general classes of the human table 7 followed by 6 representative genes that have been assigned to the appropriate phenotypic term due to a specific mouse genetic model. Unlike the human Disease/Phenotype-to- (7) BREAST NEOPLASMS (7) MYOCARDIAL (7) Celiac Disease (6) Tuberculosis, Pulmonary (5) 1493 CTLA4 Diabetes Mellitus, Type 1(28) Graves Disease (21) Thyroiditis, Autoimmune(10) Autoimmune Diseases (8) 183

AGT Hypertension (24) Coronary Disease (6) Diabetic Nephropathies (5) Myocardial Infarction (5) 1814 DRD3 Schizophrenia (24) Dyskinesia, Drug-Induced (6) Psychotic Disorders (5) Alcoholism (2) 4846 NOS3

Hypertension (20) Myocardial Infarction (18) Coronary Artery Disease (15) 

Cardiomyopathy, Dilated (2) 5468 PPARG Diabetes Mellitus, Type 2 (18) Obesity (11) Diabetes Mellitus (6) Insulin Resistance (4) 2784 GNB3

Hypertension (18) Insulin Resistance(4) Diabetes Mellitus, Type 2(3) Obesity (3) 1815 DRD4 Attention Def. Dis. with Hyperact. (17) Schizophrenia (8) Substance-Related Disorders (4) Mood Disorders (4) 1813 DRD2

Alcoholism (17) Schizophrenia (14) Personality Disorders (2) Depressive Disorder (2) 155 ADRB3

Obesity (17 Table 8 is also a small representative set, truncated in the number of disease phenotypes (rows) and the number of genes (columns). The complete list of 5,142 mouse phenotype terms with their corresponding Mammalian PhenoCode designations can be found in Table S5a[ 39] . An interactive version of the complete list can be found in Table S5b[40] .

The purpose of this project is not simply to generate lists and information. It is to provide a distillation of disease and phenotype information that can be used in dissecting the complexities of human disease and mouse biology. Now that we have generated GENE-to-disease/ phenotype summaries and DISEASE/PHENOTYPE-togene summaries for both mouse and human, they can be used for systematic analysis, comparison, and integrating of orthologous data with the goal of providing higher order interpretations of human disease and mouse genetically determined phenotypes.

Gene sets have been defined simply as groups of genes that share common biological function, chromosomal location, or regulation [41] . Gene sets are used in highthroughput systematic analysis of microarray data using a priori knowledge. Unlike previously defined gene sets based on biological pathways or differentially expressed genes [41] , GAD disease gene sets are unique in that they are composed of genes that have been previously shown to be both polymorphic and have been determined to be genetically positively associated with a specific disease phenotype in a human population based genetic association study. Similarly, Table S5a [39] the mouse DISEASE/PHENOTYPE-to-Gene list is used as a source for gene sets for mouse phenotypes (MP gene sets) comprised of unique gene based mouse genetic models. These gene set files are currently the largest set of gene set files publicly available and the only gene sets files where each gene is based on direct human or mouse genetic studies.

One aspect of common complex disease is that the development of disease and disease phenotypes quite often present along a broad spectrum of symptoms and share clinical characteristics, endo-phenotypes, or quantitative traits with closely related disorders [25] . This is evident in gene sharing, as mentioned above, and equally in the overlap of biological pathways between related disorders. Using GAD disease gene sets, Venn diagram comparisons among related disorders shows modest gene sharing. However, when gene sets are then placed into biological pathways and compared by Venn analysis, there is a marked increase in the overlap in pathways between related disorders. This was not found in gene sets from unrelated disorders. For example, major autoimmune disorders quite often share endophenotypes of lymphoproliferation, autoantibody production, and alterations in apoptosis, as well as other immune cellular and biochemical aberrations. As shown in Figure 1a , genes that have been positively associated with type 1 diabetes, rheumatoid arthritis, and Crohn's disease show a modest overlap. However, when individual gene sets are fitted into biological pathways, then compared for overlap of pathway membership, there is a striking increase in the overlap at the pathway level. This is true in a comparison of gene and pathways for type 2 diabetes, insulin resistance, and obesity as well (Figure 1b ). This pattern of major pathway overlap does not seem to occur between unrelated disorders, such as insulin resistance, rheumatoid arthritis and bipolar disorder (Figure 1c ). This disease related sharing at the pathway level suggests common regulatory mechanisms between these disorders and that the original positive associations are not necessarily due to random chance alone.

Dendrogram analysis of human disease gene sets As archival information grows, analysis of complex molecular and genetic datasets using clustering or network approaches has become increasingly more useful [13, [42] [43] [44] [45] . Therefore, in addition to comparisons between individual diseases using human and mouse gene sets, we analyzed large gene groups using dendrogram and clustering approaches based on gene sharing between gene sets. Figure 2 shows a broad based dendrogram comparison based on gene sharing between 480 GAD disease gene sets, using gene sets each containing at least 3 genes. A striking feature of this analysis is that at a coarse level, major disease groups cluster together in space demonstrating shared genes between major clinically important disease groups. Disease domains are represented by groups such as cardiovascular disorders, metabolic disorders, cancer, immune and inflammatory disorders, vision, and chemical dependency. At finer detail within a specific broader group, it becomes clear that individual diseases with overlapping 

Heart Failure ADRA2C (5) phenotypes are found close in space, such as asthma, allergic rhinitis, and atopic dermatitis. This overlap due to gene sharing recapitulates an overlap in clinical characteristics between these related disorders. Similarly, phenotypes within the metabolic group related to diabetes are closely aligned in space including; insulin resistance, hyperglycemia, hyperinsulinemia, and hyperlipidemia. This close apposition of related disease phenotypes and sub-phenotypes at both a coarse and fine level is a consistent feature of the overall display. The human gene sets used in creating this tree diagram can be found in Table S6[ 46] . It is important to emphasize that this display and the distance relationships between diseases are calculated through an unbiased gene-sharing algorithm independent of disease phenotype labels and not as a result of an imposed logical hierarchy or an ontological annotation system. This grouping of major disease phenotypes based solely on gene sharing provides supporting evidence that the underlying disease based gene sets may have a fundamental relevance to disease and may not be reported in the literature by chance alone. Dendrogram analysis of mouse phenotypic gene sets Figure 3 is a similar dendrogram to the human tree using 1056 mouse phenotypic gene sets, using gene sets each containing at least 10 genes. This was produced using the same gene sharing algorithm as for the human gene sets in Figure 2 . As with the human dendrogram, the mouse tree displays informative groupings at both a coarse and fine level. This tree groups into major groupings nominally assigned as brain development and brain function, embryonic development, cardiovascular, reproduction, inflammation, renal function, bone development, metabolism, and skin/hair development. The identification of major groupings emphasizing developmental processes reflects the emphasis of gene knockouts and developmental models resulting in observable morphological traits and less so with regard to end stage clinical diseases as in the human dendrogram. Like the human dendrogram ( Figure 2 ) discrete major functional groupings in the mouse dendrogram suggests that individual experimental observations are not random. Fundamental complex processes such as metabolism, cardiovascular phenomena, and developmental processes are integrated by extensive sharing of related pliotropic genes. Moreover, like the human tree, fine structure in the mouse tree shows related mouse phenotypes are closely positioned in space. For example, in the metabolism major grouping, the individual phenotypes of body mass, adipose phenotypes, and weight gain are closely positioned. Similarly, in the brain function group, the behavioral phenotypes of anxiety, exploration, and responses to novel objects are found next to one another. This pattern is a fundamental feature of this tree. Like the human tree, the mouse dendrogram shown here is based solely on a gene sharing algorithm using genes assigned to individual phenotypes. It is not based on an imposed predetermined hierarchy or ontology. Importantly, unlike the human tree, the information contained in the mouse tree is derived from individual independent mouse genetic studies and phenotypic observations and not from large case controlled population based epidemiological studies. Controversial issues such as publication bias or study size which confound human genetic association studies are not as relevant here in the context of studies of experimentally determined individual mouse gene knockouts and related studies. The mouse gene sets used in creating this tree diagram can be found in Table S7[ 47] . Hierarchical clustering of human and mouse gene sets Hierarchical clustering has become a common tool in the analysis of large molecular data sets [48] allowing identification of similar patterns in a scalable fashion from the whole experiment down to a level of fine structure. To provide further evidence of disease relevance and biological content contained in both the human and mouse gene sets hierarchical clustering was performed on both human and mouse. Four hundred and eighty human gene sets were clustered producing 46 major disease clusters. In the mouse, clustering was performed on 2067 mouse phenotype gene sets, using gene sets containing at least 3 genes. This resulted in 165 major subgroups of functional phenotypic specificity. Hierarchical clustering is shown for human [Additional file 1 and Additional file 2] and for mouse [Additional file 3 and Additional file 4]. Like the human and mouse dendograms, this hierarchical clustering showed functional disease grouping at both a coarse group level and at a fine level within major phenotypic groupings. These clusters in both human and mouse falling into closely defined broad functional groups as well as closely related clinical, physiological, and developmental phenotypes demonstrates a general pattern of relevance to disease in their original underlying genetic associations. As in the dendrogram displays, this suggests that the genes nominally positively associated to these disorders, drawn from the medical literature, are not pervasively randomly assigned or due to a widespread pattern of random false positives associations.

This report describes a summary of the positive genetic associations to disease phenotypes found in the Genetic Association Database as well as a summary of mouse genetically determined phenotypes from the MGI phenotypes database. The genes and disease lists described here were derived from a broad literature mining approach. We have shown disease relevance in three distinct ways; a) in comparing individual gene lists and pathways, b) comparing between species and, c) in broad based comparative analysis utilizing complex systems approaches. Moreover, we identify disease based genes sets for 1,317 human disease phenotypes as well as 5,142 mouse experimentally determined phenotypes. These resources are the largest gene set files currently publicly available and the only gene set files derived from population based human epidemiological genetic studies and mouse genetic models of disease.

Each individual GAD disease gene set (i.e. a single disease term followed by a string of genes) or mouse phenotype gene set becomes a candidate for a number of uses and applications including: a) contributing to complex (additive, multiplicative, gene-environment) statistical models for any given disease phenotype [49] [50] [51] [52] [53] ; b) use in comparative analysis of disease between disease phenotypes; c) use in interrogating other related data types, such as microarray (see below), proteomic, or SNP data [54] [55] [56] ; and d) integration into annotation engines [57] or genome browsers Table S7 [47]. [58] or other analytical software to add disease information in comparative genomic analysis. In a sense, each individual human or mouse disease/phenotype gene set becomes a unique hypothesis, testable in a variety of ways. Increasingly, combinations of genes may have important predictive value as combinatorial biomarkers in predicting disease risk as opposed to single candidate genes [59, 60] .

In addition, in an ongoing parallel set of experiments, using a Gene Set Analysis (GSA) approach using the web tool Disease/Phenotype web-PAGE, in the analysis of orthologous microarray data (De S, Zhang Y, Garner JR, Wang SA, Becker KG: Disease and phenotype gene set analysis of disease based gene expression, unpublished), both the human and mouse disease/phenotype gene sets defined above demonstrate striking disease specificity in PAGE [61] gene set analysis of previously published microarray based gene expression studies from numerous independent laboratories in both a species specific and cross species manner. This was true when studying gene expression studies of type 2 diabetes, obesity, myocardial infarction and sepsis, among others, providing further evidence of the disease and clinical relevance of both the human and mouse gene sets.

This approach is limited in a number of ways. In particular, the GAD database compares the results of human population based epidemiological studies performed using different sample sizes, populations, statistical models, and at different times over approximately the last 16 years. In addition, the GAD database draws on association studies of broad quality with different degrees of detail provided. Although all human genetic association studies discussed here have been individually determined to be positively associated with a disease or phenotype in a peer reviewed journal, we make no assertion that any individual study is correct and we recognize the controversy in the genetics community regarding statistical and biological significance of genetic association studies. Moreover, although the GAD database contains information on polymorphism and variation, and each GAD record is fundamentally based on polymorphism, this report does not consider variation or polymorphism in the summaries shown. Likewise, mouse genetic models in many cases are weighted to gene knockouts which may not be necessarily be directly representative of multifactorial human common complex disease.

However, even with these limitations, we believe valuable insights can be gained from broad based literature assessments of the genetic contribution in human common complex disease and in mouse phenotypic biology. More importantly, this suggests greater opportunities for systematic mining and analysis of published data and in cross comparison of archival molecular databases in both human and animal models of disease with regard to genetic variation, population comparisons, and integration with many different types of orthologous information.

The HUGO Gene Nomenclature Database, 2006 updates

The genetic association database

Tracking the epidemiology of human genes in the literature: the HuGE Published Literature database

The Mouse Genome Database (MGD): mouse biology and model systems

Mouse Phenotype Database Integration Consortium: integration [corrected] of mouse phenome data resources

Mendelian Inheritance in Man and its online version, OMIM

SUSPECTS: enabling fast and effective prioritization of positional candidates

SNPs3D: candidate gene and SNP selection for association studies

T1DBase, a community web-based resource for type 1 diabetes research

DAVID Knowledgebase: a genecentered database integrating heterogeneous gene annotation resources to facilitate high-throughput gene functional analysis

WholePathwayScope: a comprehensive pathway-based analysis tool for high-throughput data

GenomeTrafac: a whole genome resource for the detection of transcription factor binding site clusters associated with conventional and microRNA encoding genes conserved between mouse and human gene orthologs

Creation and implications of a phenome-genome network

VENNY. An interactive tool for comparing lists with Venn Diagrams

WebGestalt: an integrated system for exploring gene sets in various biological contexts

Construction of phylogenetic trees

PhyloDraw: a phylogenetic tree drawing system

Hierarchical Grouping to optimize an objective function

Table S1a-Human GENE-to-Disease/Phenotype. A file of Human Genes followed by Disease Phenotype MeSH terms

Table S1b-Human GENE-to-Disease/Phenotype interactive. The same list as Table S1a, but with direct searches back to GAD

Why most published research findings are false

On the synthesis and interpretation of consistent but weak gene-disease associations in the era of genome-wide association studies

Clustering of non-major histocompatibility complex susceptibility candidate loci in human autoimmune diseases

The common variants/multiple disease hypothesis of common complex genetic disorders

The PTPN22 C1858T functional polymorphism and autoimmune diseases-a metaanalysis

Replication of putative candidate-gene associations with rheumatoid arthritis in >4,000 samples from North America and Sweden: association of susceptibility with PTPN22, CTLA4, and PADI4

Increased prevalence of chronic rhinosinusitis in carriers of a cystic fibrosis mutation

Potential role for the common cystic fibrosis DeltaF508 mutation in Crohn's disease

Table S2a-Mouse GENE-to-Disease/Phenotype. A file of Mouse Genes followed by Disease Phenotype Mammalian Phenotype (MP) terms

Table S2b-Mouse GENE-to-Disease/Phenotype interactive. The same list as Table S2a, but with direct searches back to MGI and GAD

Shelling AN: INHA promoter polymorphisms are associated with premature ovarian failure

Interrelationship of growth differentiation factor 9 and inhibin in early folliculogenesis and ovarian tumorigenesis in mice

Association of the homeobox transcription factor, ENGRAILED 2, 3, with autism spectrum disorder

En2 knockout mice display neurobehavioral and neurochemical alterations relevant to autism spectrum disorder

A list of 1105 genes that overlap between the Human GENE-to-Disease Phenotype list (S1) and the Mouse GENE-to-Disease phenotype list

Table S4a-Human DISEASE/PHENOTYPE-to-Gene. A file of Human Disease Phenotype MeSH terms followed by associated genes

A file of Human Disease Phenotype MeSH terms followed by associated genes, but with direct searches back to GAD

A file of Mouse Disease-Phenotype Mammalian Phenotype (MP) terms followed by assigned mouse genes

Table S5b-Mouse DISEASE/PHENOTYPE-to-Gene (mouse) Interactive. A file of Mouse Disease-Phenotype Mammalian Phenotype (MP) terms followed by assigned mouse genes, but with direct searches back to MGI

Gene set enrichment analysis: a knowledge-based approach for interpreting genome-wide expression profiles

Networkbased analysis of affected biological processes in type 2 diabetes models

The human disease network

Genetics of gene expression and its effect on disease

A genomewide functional network for the laboratory mouse

A file of the GAD Human gene sets used in the dendrogram fig 2

A file of the Mouse gene sets used to build the mouse dendrogram fig 3

Cluster analysis and display of genome-wide expression patterns

Harnessing the information contained within genome-wide association studies to improve individual prediction of complex disease risk

Prediction of individual genetic risk of complex disease

The challenge for genetic epidemiologists: how to analyze large numbers of SNPs in relation to complex diseases

Multifactor dimensionality reductionphenomics: a novel method to capture genetic heterogeneity with use of phenotypic variables

Exchangeable models of complex inherited diseases

On the utility of gene set methods in genomewide association studies of quantitative traits

GSEA-SNP: applying gene set enrichment analysis to SNP data from genome-wide association studies

GLOSSI: a method to assess the association of genetic loci-sets with complex diseases

Extracting biological meaning from large gene lists with DAVID

The UCSC Genome Browser Database: update

Classification and prediction of clinical Alzheimer's diagnosis based on plasma signaling proteins

Cumulative Association of Five Genetic Variants with Prostate Cancer

PAGE: parametric analysis of gene set enrichment

The authors would like to thank Dr. Ilya Goldberg for helpful discussions, and Drs. Goldberg, David Schlessinger, and Chris Cheadle and for critical reading of the manuscript. This research was supported by the Intramural Research Program of the NIH, National Institute on Aging and Center for Information Technology. 

The authors declare that they have no competing interests. 

Authors' contributions YZ performed statistical analysis, gene set assembly, and contributed to the manuscript. SD performed dendrogram and clustering analysis and contributed to the manuscript. JG, KS, and SAW did database curation and analysis. KGB organized the project, did database curation, performed comparisons, and wrote the manuscript. All authors read and approved the manuscript.