key: cord-0321122-wixs7a3o authors: Shany, Ofaim; Giulia, Menichetti; Michael, Sebek; László, Barabási Albert title: Genomics-based annotations help unveil the molecular composition of edible plants date: 2022-01-25 journal: bioRxiv DOI: 10.1101/2022.01.24.477528 sha: 441e055f44d454f2895937c10b335601970f3444 doc_id: 321122 cord_uid: wixs7a3o Given the important role food plays in health and wellbeing, the past decades have seen considerable experimental efforts dedicated to mapping the chemical composition of food ingredients. As the composition of raw food is genetically predetermined, here we ask, to what degree can we rely on genomics to predict the chemical composition of natural ingredients. We therefore developed tools to unveil the chemical composition of 75 edible plants’ genomes, finding that genome-based annotations increase the number of compounds linked to specific plants by 42 to 100%. We rely on Gibbs free energy to identify compounds that accumulate in plants, i.e., those that are more likely to be detected experimentally. To quantify the accuracy of our predictions, we performed untargeted metabolomics on 13 plants, allowing us to experimentally confirm the detectability of the predicted compounds. For example, we find 59 novel compounds in corn, predicted by genomics annotations and supported by our experiments, but previously not assigned to the plant. Our study shows that genome-based annotations can lead to an integrated metabologenomics platform capable of unveiling the chemical composition of edible plants, and the biochemical pathways responsible for the observed compounds. but previously not assigned to the plant. Our study shows that genome-based annotations can 23 lead to an integrated metabologenomics platform capable of unveiling the chemical composition 24 of edible plants, and the biochemical pathways responsible for the observed compounds. anthocyanins are well known for their bioactive properties 8 . Furthermore, the genetically 37 predetermined polyphenols, alkaloids, carotenoids and phytosterols have well-documented 38 antioxidant and anti-inflammatory activities effecting multiple diseases, from Cancer to diabetes 39 or hypertension 9 . 40 Our current knowledge on food composition is limited to 150 nutrients catalogued by 41 USDA, despite the fact that the true number of chemical compounds in food ranges from tens 10 42 to hundreds of thousands across all known plant species 11 . The bulk of our knowledge on the 43 chemical composition of food comes from mass spectrometry and other low-throughput 44 3 analytical methods and is compiled in repositories such as FooDB 12 and The Dictionary of Food 45 Compounds 13 (DFC), cataloguing comprehensive information on the detected compounds, 46 including both evidence-based and predicted annotations. 47 Our work is driven by the hypothesis that the full list of known and yet unknown chemicals based platform adapted to annotate the functional diversity of plant genomes. 56 Here we explore to what degree genome-based annotations can offer a valuable resource 57 to expand the knowledge of the compound composition of foods. To do so we rely on 58 metabologenomics 17-19 , to integrate genomics and metabolomics, used in the past to discover 59 novel natural products 17, 20 . To be specific, we develop a systematic metabologenomics pipeline, 60 coupled with thermodynamic feasibility analysis aiming to predict the composition of edible 61 plants. We validate our predictions by comparing them to the chemical knowledge curated by 62 food composition databases like FooDB, DFC, and USDA 4 . We also collect new experimental data 63 to explore the chemical composition of 13 plants. Our findings indicate that genomics-based 64 annotations offer a predictive platform capable of systematically capturing the chemical 65 4 composition of plant-based food, from fruits to vegetables and allow us to predict and 66 experimentally test the presence of novel compounds in plants. 67 68 The existing knowledge on food composition 70 We collected data for 75 edible plants with published and annotated genomes from two (unique and common) to the composition of plants in our catalogue ( Figure S1B ). 95 To illustrate the contribution of genomics-based annotations to edible plants (Figure 1b) , 96 we focused on corn (Zea Mays), a highly consumed staple crop 21 worldwide and in the US. 97 Existing knowledge for corn includes 1,221 compounds from FooDB, 311 compounds from DFC 98 and 127 compounds from USDA. Considering overlaps between all sources, this compiled to a 99 total of 1,038 unique compounds. Next, we set out to explore the value of adding genome-based 100 annotations to corn's existing knowledge. 101 One contribution of genomics-based annotations is the metabolic context of these 102 compounds, carried by a network of pathways. Some known pathways are only partially 103 annotated even after considering multiple databases. For example, in Monoterpenoid 104 biosynthesis in corn ( Fig 2B) six out of nine compounds are annotated in both databases. Of the 105 three remaining compounds, (R)-Ipsdienol is known to be present in food but was not annotated 106 to any plant in our collection. The two remaining compounds, Ipsdienon, a product of the 107 reaction catalyzed by EC 1.1.1.386 directly from (R)-Ipsdienol, and (6E)-8-Oxolinalool, a product 108 of the reaction catalyzed by EC 1.14.14.84, are currently documented in food but not in corn 109 6 (white circles, Figure 2b ). To strengthen the stringency of our work, these compounds, whose 110 presence is documented in food, but not known to be associated with corn, are not included in 111 the plant's catalogue. 112 Consider another example, the tocopherol biosynthesis pathway. Tocopherols are an 113 important class of compounds for health and nutrition 22 (α-tocopherol better known as vitamin 114 E). Figure 2c shows the tocopherol biosynthesis pathway in corn and the delineated contribution 115 of each database to its compounds. We find that while databases such as FooDB and DFC 116 annotate the lower half of the pathway, capturing products such as vitamin E and its derivatives. we collected 789 pathways out of which 35% belong to primary and 65% belong to secondary 150 metabolism. We asked if certain pathways are represented more than others in our catalogue. 151 To test for pathway bias we performed a hypergeometric enrichment test capturing the chance 152 for a set of compounds mapped to a pathway in a certain plant to exceed the expected overlap 153 8 with the general reference pathway (p-value <0.05) ( Figure S2 ). In corn, we found 479 enriched 154 pathways spanning both primary (45%) and secondary metabolism (55%), indicating that our 155 corn dataset is metabolically diverse. 156 We next asked about specialized metabolism occurring only in corn, scanning our plant To estimate the predictive power of our approach, we first asked if kinetics-based 306 annotations have increased the predictive power of our platform compared to total genomics-307 15 based annotations. We find that Kinetics-based annotations show a more significant overlap with 308 the experiments compared to genomics (p-value=0.0018, SI section 1, Figure S4 ), and they are 309 characterized by a high degree of structural similarity, significantly different from a random 310 sample of the same size from genomics annotations (p-value<0.001, SI section 1). 311 Next, we used the experimentally detected compounds in the 13 plants for which we 312 performed untargeted metabolomics, to estimate the performance of our approach. Similar to 313 known machine learning methods, we use ΔG scores as a 'classifier' predicting the likelihood of 314 a compound to accumulate. We then compare it to our experimentally detected compound 315 catalogue as ground truth values (a binary classification denoting presence or absence). We 316 calculated standard performance metrics, such as the true positive and false positive rates and 317 the area under the receiver-operator curve (ROC), AUC ROC . In addition, we calculated the 318 precision, recall and F-1 scores. Since our data may be imbalanced, we initially set the threshold 319 of prediction to be larger than zero (positive values). We then performed a moving threshold 320 analysis (see Methods) to determine the optimal threshold for best performance in each plant 321 found in both our annotation and experimental catalogue (Table S1) Since the first block of the InChIKey represents several stereoisomers, we assign a bit 432 vector to each first block as the union of all bit vectors representing the related isomers. We then 433 compare the degree of structural similarity between any pair of compounds by computing the 434 Jaccard similarity between binary vectors. 435 We assess structural similarity for a given set of N chemical compounds, by calculating the 436 intrinsic dimension of their Jaccard similarity matrix � �, a function of the spectrum { } of � �, A Flavonoid that has multiple cardio-protective effects 505 and its molecular mechanisms The unmapped chemical complexity of our 507 diet Exploring food contents in scientific literature 509 with Gene Ontology Meta Annotator for Plants 511 (GOMAP) The C-and G-value paradox with polyploidy Precision phenotyping and association between morphological traits 516 and nutritional content in Vegetable Amaranth (Amaranthus spp Functional relationship of vegetable 519 colors and bioactive compounds: Implications in human health Choosing suitable food vehicles for functional food 522 products Evolutionary routes to 524 biochemical innovation revealed by integrative analysis of a plant-defense related The exposome and 527 health: Where chemistry meets biology Dictionary of food compounds with CD-ROM KEGG: kyoto encyclopedia of genes and genomes Genome-wide prediction of metabolic enzymes, pathways, and gene 533 clusters in plants The BioCyc collection of microbial genomes and metabolic pathways Comparative Metabologenomics Analysis of Polar Actinomycetes Discovery of the Tyrobetaine Natural Products and Their 539 Biosynthetic Gene Cluster via Metabologenomics Define a Novel Class of Chimeric Lanthipeptides Natural products 544 targeting strategies involving molecular networking: Different manners, one goal Global maize production, utilization Acylsugar 592 acylhydrolases: Carboxylesterase-catalyzed hydrolysis of acylsugars in tomato trichomes A role for 595 differential glycoconjugation in the emission of phenylpropanoid volatiles from tomato 596 fruit discovered using a metabolic data fusion approach Comparative mass spectrometry-based 598 metabolomics strategies for the investigation of microbial secondary metabolites Diversity and relationship among grain, flour and starch 601 characteristics of Indian Himalayan colored corn accessions Purification and characterization of a 604 galactose-specific lectin from corn (Zea mays) coleoptyle Management of symptoms, pain and mobility with supplementary 607 managements (including Movardol Forte) in osteoarthrosis: A 6-month, morphology 608 supplement study LCMS-based metabolomics for quantitative 610 measurement of NAD+ metabolites Nicotinamide Riboside-The Current State of 615 Research and Therapeutic Uses Methylthioadenosine and Cancer: old molecules, new 617 understanding Cell Death Discovery ADO/hypotaurine: a novel metabolic pathway 619 contributing to glioblastoma development Hypotaurine evokes a malignant phenotype in glioma through aberrant 621 hypoxic signaling The Effect of L-Carnitine Supplementation on the Quality of Cryopreserved Chicken 624 Semen Pipecolic acid confers systemic immunity by regulating free radicals METABOLIC ENGINEERING AND 628 SYNTHETIC BIOLOGY-REVIEW Expanding lysine industry: industrial biomanufacturing of 629 lysine and its derivatives An economically and environmentally acceptable synthesis of chiral drug 631 intermediate l-pipecolic acid from biomass-derived lysine via artificially engineered 632 microbes Pantoea ananatis converts MBOA to 6-methoxy-4-nitrobenzoxazolin-634 2(3H)-one (NMBOA) for cooperative degradation with its native root colonizing microbial 635 30 consortium High fructose induced adipogenesis and inhibitory potential of 639 trigonelline on murine mesenchymal stem cells: A morphological study Syringic acid (SA) -A Review of Its Occurrence, Biosynthesis, Pharmacological and 643 Industrial Importance Diferuloylputrescine and p-coumaroyl-645 feruloylputrescine, abundant polyamine conjugates in lipid extracts of maize kernels The ModelSEED Biochemistry Database for the integration of 648 metabolic annotations and the reconstruction, comparison and analysis of metabolic 649 models for plants, fungi and microbes Kernel amino acid composition and protein content 651 of introgression lines from Zea mays ssp. mexicana into cultivated maize Large-scale metabolite quantitative trait locus analysis provides new insights 654 for high-quality maize improvement Towards a More Reliable Identification of Isomeric 656 Mass-Spectrometry-Based Identification of Synthetic Drug 660 Linking genomics and metabolomics to chart specialized 662 metabolic diversity ETE 3: Reconstruction, Analysis, and Visualization of 664 Phylogenomic Data SciPy 1.0: fundamental algorithms for scientific computing in Python Rdkit: Open-source cheminformatics software Multidose evaluation of 6,710 drug repurposing library identifies potent 671 SARS-CoV-2 infection inhibitors In Vitro and In Vivo Figure 1 -Plant phylogenetic diversity and schematic overview of genomics contribution to 681 food composition knowledge. (a) Phylogenetic tree representing the 75 plants in our collection 682 colored by plant order. (b) Collection and evaluation of genomics-based annotations to food 683 composition knowledge. Functional annotations include compounds, reactions and pathways Kinetics-based annotations help us to infer compounds likely to accumulate and hence 685 experimentally detectable. (d) Validation of our kinetics-based approach against new 686 metabolomics experiments Figure 2 -Genomics-based annotations boost corn composition knowledge. (a) Database 692 annotations for corn, indicating that some compounds are annotated in multiple databases. The 693 total number (N=4,721) represents the number of unique compounds for corn after the addition 694 of genomics-based annotations Other food related databases used in this work are DFC (the dictionary of food compounds experiments denote the set of compounds collected by metabolomics 697 experiments reported here for corn. (b) A partial adaptation of the Monoterpenoid biosynthesis 698 pathway in corn (KEGG) showing annotation availability, overlap and gaps in the coverage of 699 different databases and genomics-based annotations. The different colors denote annotation 700 sources as single or multiple concentric circles. (c) Vitamin E biosynthesis (tocopherols) pathway 701 in corn (PlantCyc). Circles denote compounds and edges denote reactions