key: cord-102935-cx3elpb8 authors: Hassani-Pak, Keywan; Singh, Ajit; Brandizi, Marco; Hearnshaw, Joseph; Amberkar, Sandeep; Phillips, Andrew L.; Doonan, John H.; Rawlings, Chris title: KnetMiner: a comprehensive approach for supporting evidence-based gene discovery and complex trait analysis across species date: 2020-04-24 journal: bioRxiv DOI: 10.1101/2020.04.02.017004 sha: doc_id: 102935 cord_uid: cx3elpb8 Generating new ideas and scientific hypotheses is often the result of extensive literature and database reviews, overlaid with scientists’ own novel data and a creative process of making connections that were not made before. We have developed a comprehensive approach to guide this technically challenging data integration task and to make knowledge discovery and hypotheses generation easier for plant and crop researchers. KnetMiner can digest large volumes of scientific literature and biological research to find and visualise links between the genetic and biological properties of complex traits and diseases. Here we report the main design principles behind KnetMiner and provide use cases for mining public datasets to identify unknown links between traits such grain colour and pre-harvest sprouting in Triticum aestivum, as well as, an evidence-based approach to identify candidate genes under an Arabidopsis thaliana petal size QTL. We have developed KnetMiner knowledge graphs and applications for a range of species including plants, crops and pathogens. KnetMiner is the first open-source gene discovery platform that can leverage genome-scale knowledge graphs, generate evidence-based biological networks and be deployed for any species with a sequenced genome. KnetMiner is available at http://knetminer.org. which is prone to information being overlooked and subjective biases being introduced. Even when 38 the task of gathering information is complete, it is demanding to assemble a coherent view of how 39 each piece of evidence might come together to "tell a story" about the biology that can explain how 40 multiple genes might be implicated in a complex trait or disease. New tools are needed to provide 41 scientists with a more fine-grained and connected view of the scientific literature and databases, 42 rather than the conventional information retrieval tools currently at their disposal. 43 Scientists are not alone with these challenges. Search systems form a core part of the duties of 44 many professions. Studies have highlighted the need for search systems that give confidence to 45 the professional searcher and therefore trust, explainability, and accountability remain a significant KnetMiner provides search term suggestions and real-time query feedback. From a search, a user 118 is presented with the following views: Gene View is a ranked list of candidate genes along with a 119 summary of related evidence types. Map View is a chromosome based display of QTL, GWAS 120 peaks and genes related to the search terms. Evidence View is a ranked list of query related 121 evidence terms and enrichment scores along with linked genes. By selecting one or multiple 122 elements in these three views, the user can get to the Network View to explore a gene-centric or 123 evidence-centric knowledge network related to their query and the subsequent selection. (Nilsson-Ehle, 1914) and that the red pigmentation of wheat grain is controlled by R genes on the 136 long arms of chromosomes 3A, 3B, and 3D (Sears, 1944 Figure 3A ). This network is displayed in the 155 Network View which provides interactive features to hide or add specific evidence types from the 156 network. Nodes are displayed in a defined set of shapes, colors and sizes to distinguish different 157 types of evidence. A shadow effect on nodes indicates that more information is available but has 158 been hidden. The auto-generated network, however, is not yet telling a story that is specific to our 159 traits of interest and is limited to evidence that is phenotypic in nature. 160 To further refine and extend the search for evidence that links TT2 to grain color and PHS, we can 162 provide additional keywords relevant to the traits of interest. Seed germination and dormancy are 163 the underlying developmental processes that activate or prevent pre-harvest sprouting in many 164 8 grains and other seeds. The colour of the grain is known to be determined through accumulation of 165 proanthocyanidin, an intermediate in the flavonoid pathway, found in the seed coat. These terms 166 and phrases can be combined using boolean operators (AND, OR, NOT) and used in conjunction 167 with a list of genes. Thus, we search for TRAESCS3D02G468400 (TT2) and the keywords: "seed 168 germination" OR "seed dormancy" OR color OR flavonoid OR proanthocyanidin. This time, 169 KnetMiner filters the extracted TT2 knowledge network (823 nodes) down to a smaller subgraph of 170 68 nodes and 87 relations in which every path from TT2 to another node corresponds to a line of 171 evidence to phenotype or molecular characteristics based on our keywords of interest ( Figure 3B ). Overall the exploratory link analysis has generated a potential link between grain color and PHS 193 due to TT2-MFT interaction and suggested a new hypothesis between two traits (PHS and root 194 hair density) that were not part of the initial investigation and previously thought to be unrelated. 195 Furthermore, it raises the possibility that TT2 mutants might lead to increased root hairs and to 196 higher nutrient and water absorption, and therefore cause early germination of the grain. More data 197 and experiments will be needed to address this hypothesis and close the knowledge gap. biologists would generally agree to be informative when studying the function of a gene. Searching 255 a KG for such patterns is akin to searching for relevant sentences containing evidence that 256 supports a particular point of view within a book. Such evidence paths can be short e.g. Gene A 257 was knocked out and phenotype X was observed; or alternatively the evidence path can be longer, 258 e.g. Gene A in species X has an ortholog in species Y, which was shown to regulate the 259 expression of a disease related gene (with a link to the paper). In the first example, the relationship 260 between gene and disease is directly evident and experimentally proven, while in the second 261 12 example the relationship is indirect and less certain but still biologically meaningful. There are 262 many evidence types that should be considered for evaluating the relevance of a gene to a trait. In 263 a KG context, a gene is considered to be, for example, related to 'early flowering' if any of its 264 biologically plausible graph patterns contain nodes related to 'early flowering'. In this context, the 265 word 'related' doesn't necessarily mean that the gene in question will have an effect on 'flowering shown to a user; let alone if combining GCSs for tens to hundreds of genes. There is therefore a 293 need to filter and visualise the subset of information in the GCSs that is most interesting to a 294 specific user. However, the interestingness of information is subjective and will depend on the 295 biological question or the hypothesis that needs to be tested. A scientist with an interest in disease 296 biology is likely to be interested in links to publications, pathways, and annotations related to 297 diseases, while someone studying the biological process of grain filling is likely more interested in 298 links to physiological or anatomical traits. To reduce information overload and visualise the most 299 interesting pieces of information, we have devised two strategies. 1) In the case of a combined 300 gene and keyword search, we use the keywords as a filter to show only paths in the GCS that 301 connect genes with keyword related nodes, i.e. nodes that contain the given keywords in one of 302 their node properties. In the special case where too many publications remain even after keyword 303 filtering, we select the most recent N publications (default N=20). Nodes not matching the keyword 304 are hidden but not removed from the GCS. 2) In the case of a simple gene query (without 305 additional keywords), we initially show all paths between the gene and nodes of type 306 phenotype/trait, i.e. any semantic motif that ends with a trait/phenotype, as this is considered the 307 most important relationship to many KnetMiner users. 308 Gene Ranking 309 We have developed a simple and fast algorithm to rank genes and their GCS for their importance. 310 We give every node in the KG a weight composed of three components, referred to as SDR, 311 standing for the Specificity to the gene, Distance to the gene and Relevance to the search terms. 312 Specificity reflects how specific a node is to a gene in question. For example, a publication that is 313 cited (linked) by hundreds of genes receives a smaller weight than a publication which is linked to 314 one or two genes only. We define the specificity of a node x as: where n is the 315 frequency of the node occurring in all N GCS. D i s t a n c e assumes information which is associated 316 more closely to a gene can generally be considered more certain, versus one that's further away, 317 e.g. inferred through homology and other interactions increases the uncertainty of annotation 318 14 propagation. A short semantic motif is therefore given a stronger weight, whereas a long motif 319 receives a weaker weight. Thus, we define the second weight as the inverse shortest path distance 320 of a gene g and a node x: Both weights S and D are not influenced by the 321 search terms and can therefore be pre-computed for every node in the KG. Relevance reflects the 322 relevance or importance of a node to user-provided search terms using the well-established 323 measure of inverse document frequency (IDF) and term frequency (TF) (Salton & Yang, 1973 we define the KnetScore of a gene as: 330 The sum considers only GCS nodes that contain the search terms. In the absence of search terms, 331 we sum over all nodes of the GCS with R=1 for each node. The computation of the KnetScore 332 biologists, such as tables and chromosome views, allowing them to explore the data, make 370 choices as to which gene to view, or refine the query if needed. These initial views help users to 371 reach a certain level of confidence with the selection of potential candidate genes. However, they 372 16 do not tell the biological story that links candidate genes to traits and diseases. In a second step, to 373 enable the stories and their evidence to be investigated in full detail, the Network View visualises 374 highly complex information in a concise and connected format, helping facilitate biologically 375 meaningful conclusions. Consistent graphical symbols are used for representing evidence types 376 throughout the different views, so that users develop a certain level of familiarity, before being 377 exposed to networks with complex interactions and rich content. Scientists spend a considerable amount of time searching for new clues and ideas by synthesizing 397 many different sources of information and using their expertise to generate hypotheses. KnetMiner 398 is a user-friendly platform for biological knowledge discovery and exploratory data mining. It allows 399 humans and machines to effectively connect the dots in life science data and literature, search the 400 17 connected data in an innovative way, and then return the results in an accessible, explorable, yet 401 concise format that can be easily interrogated to generate new insights. We Discovering Protein Drug Targets Using 563 The Monarch Initiative: an integrative data and analytic platform connecting 568 phenotypes to genotypes across species A wheat homolog of MOTHER OF FT AND TFL1 acts in the regulation 570 of germination Zur Kenntnis der mit der keimungsphysiologie des weizens in 573 zusammenhang stehenden inneren faktoren Bioinformatics meets user-centred design: a perspective Meta-analysis of the heritability of human traits based on fifty 579 years of twin studies Information retrieval in the workplace: A 581 comparison of professional search practices Progress in Biomedical Knowledge Discovery: A 25-year 584 On the Specification of Term Values in Automatic Indexing Cytogenetic Studies with Polyploid Species of Wheat Knowledge Graphs and Knowledge Networks: The 590 Story in Brief KnetMaps: a BioJS component to visualize 592 biological knowledge networks Identification of loci 594 governing eight agronomic traits using a GBS-GWAS approach and validation by QTL 595 mapping in soya bean Big Data: Astronomical or Genomical? Sensitivity to "sunk costs" in mice, rats, and humans IWGSC 605 whole-genome assembly principal investigators Whole-genome sequencing and assembly Shifting the limits in wheat research 608 and breeding using a fully annotated reference genome Trend Analysis of Knowledge Graphs for Crop Pest and Diseases MOTHER OF FT AND TFL1 regulates seed germination 613 through a negative feedback loop modulating ABA signaling in Arabidopsis Use of Graph Database for the Integration Allelic Variation and Transcriptional Isoforms of Wheat TaMYC1 Gene 618 Regulating Anthocyanin Synthesis in Pericarp The authors declare that they have no competing interests.