key: cord-289447-d93qwjui
authors: Helmy, Mohamed; Smith, Derek; Selvarajoo, Kumar
title: Systems biology approaches integrated with artificial intelligence for optimized food-focused metabolic engineering
date: 2020-10-09
journal: Metab Eng Commun
DOI: 10.1016/j.mec.2020.e00149
sha: 
doc_id: 289447
cord_uid: d93qwjui

Metabolic engineering aims to maximize the production of bio-economically important substances (compounds, enzymes, or other proteins) through the optimization of the genetics, cellular processes and growth conditions of microorganisms. This requires detailed understanding of underlying metabolic pathways involved in the production of the targeted substances, and how the cellular processes or growth conditions are regulated by the engineering. To achieve this goal, a large system of experimental techniques, compound libraries, computational methods and data resources, including the multi-omics data, are used. The recent advent of multi-omics systems biology approaches significantly impacted the field by opening new avenues to perform dynamic and large-scale analyses that deepen our knowledge on the manipulations. However, with the enormous transcriptomics, proteomics and metabolomics available, it is a daunting task to integrate the data for a more holistic understanding. Novel data mining and analytics approaches, including Artificial Intelligence (AI), can provide breakthroughs where traditional low-throughput experiment-alone methods cannot easily achieve. Here, we review the latest attempts of combining systems biology and AI in metabolic engineering research, and highlight how this alliance can help overcome the current challenges facing industrial biotechnology, especially for food-related substances and compounds using microorganisms.

With the growing population of our planet, food security remains a major challenge facing mankind. This is especially true for countries that do not possess large land spaces for agriculture, such as those in the Middle East (deserts), Japan (mostly mountainous), and Singapore (land scarce). Moreover, nature conservationists are mostly against the clearing of wild flora and fauna to feed the world. Thus, looking at the long term, food security can become a pressing issue for many nations. The Rome Declaration (1996) by the Food and Agriculture Organization (FAO) defines food security as "Food security, [is achieved] when all people, at all times, have physical and economic access to sufficient, safe and nutritious food to meet their dietary needs and food preferences for an active and healthy life" [1] .

On the other hand, there is also a growing awareness of healthy diets as diet is considered to be the most significant risk factor that affects general health and cause diseases, disability or premature death. The trending diets are mostly focused on eating habits that are nutritious, help lose weight, avoid processed foods, especially with preservatives, or foods with artificial ingredients such as artificial flavours or colours [2] . These include plant-based diets (such as vegan) and low calorie fat-burning diets (such as ketogenic) [3, 4] . Thus, the challenge is not only to produce enough food but also those that are safe, nutritious and appealing to the customer's preference.

Food security has become even more important during the ongoing COVID-19 pandemic when countries have largely closed their borders, affecting the food import-export trade [5] .

There are several types of modelling approaches today, that can be largely grouped into i) parametric approaches such as dynamic modeling using ordinary differential equations [23] , and ii) non-parametric models using Boolean logics, stoichiometric matrix and Bayesian inference algorithms [25, 26] . A dynamic model built using differential equations constructs an organism's metabolism step by step using known biochemical reactions and reaction kinetics from their genomic, enzymatic and biochemical information derived from experiment ( Figure 2A ). Using this information, the models are used to predict metabolic outcomes for different in silico perturbations, or to understand the key regulatory mechanisms (such as bottlenecks) and flux distributions to a given perturbation [27, 28] . In other words, the dynamic models utilize a priori knowledge of metabolic pathways, enzymatic mechanisms and temporal experimental data to simulate the concentrations of metabolites over time. These models are usually referred to as kinetic models [29] .

Although kinetic models have been widely used and have proven their benefits [24] , for large-scale modeling, such as genome-scale modeling, it is a daunting challenge to use dynamic modeling due to the absence of large-scale experimentally measured and reliable kinetics [23] .

To overcome this major challenge, as a trade-off, scientists use other types of modeling such as the parameter-less stoichiometric constraint-based modeling approaches. Constraint-based models, have constraints for each decision that represent the minimum and maximum values of the decision (e.g. the minimum and maximum reaction rates) [30] . A widely used constraintbased modeling is the flux balance analysis (FBA) [31] . The FBA models thousands of metabolites and reactions with reasonable computational cost and prediction outcome ( Figure  Figure 2 . Schematic representation of different modeling approaches used in metabolic engineering. A) Mathematical modeling of metabolic pathways. B) Flux balance analysis (FBA) modeling. C) Steps of promoter-strength modeling using statistical models and mutations data. D) Ensemble modeling.

Although numerous works have used metabolic regulation to control the production of targeted metabolites, recent works indicate that transcriptional and translation control can provide significant fold increase in the intended yield output [13, 33] . The transcriptional control changes the way the gene of interest is regulated by manipulating its promoter region. This includes modifications such as mutating the ribosomal binding sites (RBS), the transcription factor binding sites (TFBS), designing and inserting shot sequencing (e.g. new binding sites), or designing an artificial promoter region [33] . The transcriptional control requires deep understanding of how the gene of interest is regulated (activators, enhancer and suppressors) as well as the knowledge of its genomic structure around the binding sites (such as nucleosome positions) [34] ( Figure 2C ). Thus, modeling the transcriptional control remains a challenge as it requires complex data involving quantitative gene expression under each mutation condition to train a model that simulates the effect of each mutation and then use it to predict the impact on the new mutation. Nevertheless, statistical approaches such as the position weight matrix (PWM) modeling, which measures or scores aligned sequences that are likely functionally related, have shown promise for understanding the mutational impact on the transcriptional regulation in mammalian disease cells [35, 36] . Such methods could be explored in the future for controlling the transcriptional efficiency for metabolic engineering outcome. crucial role in the development of the ensemble predictions, thereby, reducing the number of models to a smaller set [38] .

An example of ensemble modeling was performed for two non-native central pathways for carbon conservation the non-oxidative glycolysis (NOG) and the reverse glyoxylate cycle (rGC) pathways using ensemble modeling robustness analysis (EMRA). EMRA successfully determined the probability of system failure and identified possible targets for flux improvement [39] . In another study, ensemble modeling was used to help in developing a L-lysine-producing strain in E. coli [40] . Nevertheless, ensemble modeling come with some major challenges.

Building an ensemble with different modeling algorithms is more difficult that using any standard modeling strategy, the requirement of perturbation-response data makes it similar to many other data-dependant modeling strategies that perform poorly in the absence of reliable data, and the difficulty in interpreting its overall results. These limitations hinder the utility of this powerful modeling approach.

Another widely used modeling approach for metabolic engineering is in silico three-dimensional (3D) molecular modeling for the study of receptor/enzyme-ligand docking and protein homology design [41] . It has a wide range of applications in drug design and metabolism, research and therapeutic antibodies design and molecular interactions research (protein-protein and protein-DNA interactions). In metabolic engineering, 3D modeling is used to design and simulate engineered enzymes that are indispensable for the optimization process of the microorganism's metabolism [42] . In protein engineering, where no structural data is available, molecular modelling is used to model the 3D-structures of enzymes, and coupled with enzymesubstrate docking studies, can be used to target regions of interest to improve various attributes, such as specificity, activity and stability under a given environment. This has been used to great effect for single enzymes as in vitro industrial biocatalysts (e.g. sitagliptin [43]), as well as for entire enzyme cascades (e.g. islatravir [44] ) for the production of active pharmaceutical ingredients.

J o u r n a l P r e -p r o o f Dynamic modeling strategies, as mentioned above, often depend on the parameters that are used to build the model. The parameters (such as reaction kinetics or flux ranges) can be determined using bottom-up or top-down approaches [45] . The bottom-up approach is highly dependent on experiments (such as in vitro enzymatic assays) since it requires information on the reaction kinetics of each enzyme, which is highly challenging to determine for all the enzymes in a pathway or network. Furthermore, even if information is obtained from in vitro experiments, the data are often several orders of magnitude different from actual in vivo experiments [46] .

Moreover, modeling usually requires data (kinetics or flux rates) for multiple conditions or time points to train the model and test its accuracy or applicability, which requires iterative experimental work [18] . Despite the fact that the bottom-up modeling approaches often use optimization algorithms to estimate the model parameters, such as the genetic algorithm, the complex and non-linear nature of the relationships between metabolites limit the usefulness of the model fitting algorithms [45, 47] .

Another aspect of limitations is the scale of the model. Since the bottom-up approach requires detailed experimental measurements, it is more suitable for small-scale models.

Extending the model size requires either more experiments (higher cost and longer time) or more computational estimation reliance of the parameter values (lower accuracy). Thus, an accurate dynamic model based on a bottom-up approach is difficult to establish due to the extended level of uncertainty in the kinetic properties of the enzymes and their reactions [48] . Ensemble modeling helps in building large-scale models, however it also suffers from major limitations as mentioned earlier.

On the other hand, the top-down approaches utilize time series metabolomic data to indirectly infer the kinetics, flux rates or concentrations of metabolites, through the establishment of correlation and causation networks between metabolites [45] . The causation network establishes the cause-effect relationships between the metabolites in the networks and is usually built using time series metabolomic data, while the correlation network uses mathematical and statistical methods to determine the probable relation between the enzymes and metabolites in the network [47] . Nevertheless, the top-down approach has shown notable success in analyzing cellular pathways with simple linear response or mass-action kinetic models with little parameter sensitivity [29, 49] .

For the comparative 3D protein modelling, it is most commonly performed using template-based methods, where homologous protein structures are used to generate models using stand-alone programs such as MODELLER [50] or through online servers such as ROBETTA, which incorporates the RosettaCM method [51] , HHPRED [52] , and ITASSER [53] . These methods produce useful models where good templates are available, but many protein sequences of interest have limited template information, and so poor-quality models are common which hinders their practical applications in guiding protein engineering works.

Most of the above-mentioned modeling strategies require the availability of sufficient and high-quality experimental data. The data includes metabolite concentrations, and their chemical structures, properties, pathways, reaction rates, genomic sequences, genome annotations, transcriptome sequence, gene expression data and many other types of data, as required for their respective modeling strategies. Fortunately, a large number of bioinformatics databases and servers are now freely available with most of this data. Many of them are meta-databases that collect and aggregate data from multiple sources such as KEGG, Pathways Commons and MetaCyc [54] [55] [56] . Despite the benefits of these bioinformatics resources, the challenge is in finding the correct dataset and modeling /analytical approaches to take advantage of this wealth of data. This, therefore, raises the need of the involvement of novel data mining and data analytics approaches, such as artificial intelligence (AI).

Artificial intelligence (AI) provides computers the ability to make decisions based on analyzing the data independently by following predetermined rules or pattern recognition models. Since its introduction in 1956, AI has become a hot research area after proving useful in solving several challenges across many fields [57] . AI and many of its modern techniques such as machine learning (ML) contribute significantly to things that we use in our daily life; from the voice recognition that we use when interacting with smart devices to the algorithms that decide the contents that we see on our social media to the modern-day autonomous cars that will soon be cruising our streets. AI can now read, write, listen, respond to questions, play games or even engage in conversations [58] . It is also playing a significant role in science, technology and research.

In the biomedical and biotechnology fields in particular, AI is heavily employed in addressing certain research challenges while being under-utilized in other aspects. The drug and vaccine discovery fields, for instance, are employing AI to address the challenges of developing new drugs, repurposing existing drugs, understanding drug mechanisms, designing and optimizing clinical trials and identifying biomarkers [59] . Recent surveys show that more than 40 pharma companies and 230 startup companies are employing AI in different aspects of drug discovery [60, 61] . This has resulted in the development of over one hundred drugs that are in different development phases in the fields of oncology, neurology and infectious diseases [62] .

Furthermore, the research on COVID-19 drugs and vaccination development is employing AI, and this has resulted in dozens of promising drug lead compounds and vaccines is such a short period of time [63, 64] . AI is also employed in the fields of genomics, protein-protein interaction prediction, signaling pathways prediction and analysis, protein-DNA binding, cancer diagnosis, and genomic mutation variant calling among several other applications [65] [66] [67] [68] [69] .

On the other hand, AI is not similarly utilized in the fields of metabolomics and metabolic engineering, especially for food applications. Although the idea of combining systems biology and AI (machine learning in particular) to study metabolism is relatively old [70] , the applications of it is still under explored. Machine learning (ML) is the field of AI that is interested in developing computer programs that learn and improve its performance automatically based on experience and without explicitly being programmed [71] . In the last few years, ML research and techniques have improved as large datasets generated by modern analytical lab instruments become available. Therefore, in recent reports we are starting to see ML-based research in identifying weight loss biomarkers [72] , the discovery of food identity markers [73] farm animal metabolism [74] and many other applications in untargeted metabolomics [75, 76] . In metabolic engineering, several areas are starting to take advantage of ML and systems biology integration including pathways identification and analysis, modeling of metabolisms and growth, and 3D protein modeling ( Figure 3 ).

Pathways identification and analysis is very crucial for metabolic engineering. It is common that the biochemical pathway of a targeted substance (e.g. enzyme or compound) is unknown or poorly studied. Furthermore, in many cases, the gene(s) or gene cluster that is responsible for J o u r n a l P r e -p r o o f producing the targeted substance needs to be transferred to a model organism so that it can be easily manipulated and optimized [12] . As mentioned above, the different modeling techniques have their limitations, while combining omics data and using standard data analysis approaches for pathways, the final predictions come with its uncertainty [77] .

ML can be utilized to identify the pathways upstream of the substance. For instance, ML model that used naive Bayes, decision trees, logistic regression and pathway information of many organisms were used in MetaCyc to predict the presence of a novel metabolic pathway in a newly-sequenced organism. The analysis of the model performance showed that most of the information about the presence of a pathway in an organism is contained in a small set of used features. Mainly, the number of reactions along the path from input to output compound was the most informative feature [45] . In general, the ML models used for pathway prediction showed better performance than the standard mathematical and statistical methods [78] . Nevertheless, pathway discovery is still heavily relying on traditional approaches such as gene sequence similarity and network analysis. Thus, better ML algorithms/methods for pathways discovery are 

ML can be invaluable for the identification of important genes or enzymes in the pathways of interest. ML classifiers, such as support vector machine, logistic regression and decision treebased models, have been instrumental in predicting gene essentiality within metabolic pathways through training and testing models (by using labeled data of essential and non-essential genes) [79] . It was also used in finding new drug targets by determining the essential enzymes in a metabolic network of each enzyme by its local network topology, co-expression and gene homologies, and flux balance analyses [80] . Plaimas et al used an ML model that was trained to distinguish between essential and non-essential reactions, which followed an experimental validation using the phenotypic outcome of single knockout mutants of E. coli (KEIO collection) [80] .

In an earlier study, the side effects of drugs on the metabolic network were investigated by predicting an enzyme inhibitory effect through building an ML model. The model used network topology, functional classes of inhibitors and enzymes as background knowledge, with logic-based representation and a combination of abduction and induction methods to predict drug inhibitory side effects [81] .

Newly sequenced genomes undergo two types of annotations; structural annotation and functional annotation. The structural annotation is the process of identifying the genome components and their structures (e.g. identifying genes, their exons, introns and UTRs or their regulatory regions), while the functional annotation identifies the functions of the genes and their products. Both types of annotation are important for metabolic engineering research; the structural annotation identifies the genes, their sequences, length and structure and, therefore, helps in finding alternative organisms where the same gene, pathways or gene clusters exist. The functional annotation helps in identifying organisms that produce the same substance or tolerate the same growth conditions. Comparative genomics, network biology and traditional bioinformatics methods, such as sequence alignment, are usually utilized in this process [82, 83] .

The rapid advancements in the genome sequencing technologies and the significant drop in its cost in the last decade raised the advantage for fast and accurate annotation methods [84] .

This resulted in the development of several new annotation methods that analyse the newly sequenced genomes from different sequencing platforms that addressed many of the challenges, however, many other challenges remain such as missing short genes and erroneous exon start and end annotation [85, 86] . Thus, several other methods were introduced with the idea of combining multi-omics data in the process of the genome annotation and, in particular, the proteomic and transcriptomic data [87] [88] [89] [90] . Despite these efforts, over 20% of the sequenced genomes in the genome online database (GOLD) are still awaiting annotation [91] .

The high-volume and multi-dimensional nature of the genome sequencing data makes it very suitable for applications of machine learning algorithms [92] . The ML model will be trained using annotated genomes to identify genome structures, e.g. genes or regulatory regions, using their features to identify the same structures in the newly sequences genomes [93] . Yip DeepAnnotator, an annotation tool that outperformed the NCBI annotation pipeline in RNA genes annotation [95] . The new versions of the annotation tool GeneMarks for annotation prokaryotic genome (GeneMarkS2+) and the eukaryotic self-training gene finder (GeneMark-EP+) both are utilizing ML algorithms in the annotation process [94, 96] . Deep convolutional neural networks were used to annotate gene-start sites in different species by training the model using the sites from one species as the positive sample and random sequences from the same species as the negative sample. The model was able to identify gene-start sites in other species [97] .

Although, the idea of employing ML in functional annotation started relatively early, it is still underutilized in functional annotation compared to structure annotation. An early attempt of using ML in genes functional annotation from biomedical literature utilized Hierarchical Text J o u r n a l P r e -p r o o f Categorization (HTC) [98] , while Tetko et al provided a high-quality curated functional annotation data as a benchmark dataset for the developers of machine ML-based functional annotation methods for bacterial genomes [99] . The recent reports show the applications of MLbased methods in a wide variety of functional annotations such as the discovery of missing or wrong protein function annotations [100] , predicting gene functions in plant [101] , controlling the false discovery rate (FDR), increase the accuracy of protein functional predictions [102] , and genome-wide functional annotation of splice-variants in eukaryotes [103] .

The advancements of -omics technologies have resulted in a huge accumulation of data (genomics, transcriptomics, proteomics and metabolomics) that is estimated to grow in size to exceed astronomical levels by 2025 [104] . This enormous amount of data has shifted scientific research more towards data-driven approaches such as ML [45] . Combining ML methods with omics data is a typical systems biology approach to address several biomedical challenges. An ML approach was used to replace the traditional kinetic models in estimating the metabolite concentrations over time by combining ML models, proteomic and metabolomic time series data [58] . Also, proteomic and metabolomic data of yeast were combined under several perturbation conditions (97 kinase knockouts), and ML was used to predict the yeast metabolome using the enzyme expression proteome of each kinase-deficient condition. The ML quantifies the role of enzyme abundance through mapping the regulatory enzyme expression patterns then utilizing them in predicting the metabolome under the knockout condition [70] .

The availability of transcriptome data and the ability of ML methods to deal with big data led to the development of several genome-scale methods to predict the phenotype using ML models. To take advantage of the accumulated transcriptome data, a biology-guided deep learning system named DeepMetabolism was developed [105] . DeepMetabolism uses transcriptomics data to predict cell phenotypes. It integrates unsupervised pre-training with supervised training to predict the phenotype with high accuracy and high speed. On the other hand, Jervis et al implemented an ML algorithm to model the bacterial ribosome binding sites (RBSs) sequence-phenotype relationship and accurately predicted the optimal high-producers, an approach that directly apply on wide range of metabolic engineering applications [106] .

Despite the progress in applying ML techniques in metabolic research, ML is still far from being fully utilized in some important aspects of metabolic engineering, especially in metabolic pathways identification, analysis and bioprocess optimization for the food-based research and industries.

In the field of 3D protein modeling, several AI-based advances are also noted. The most recent Critical Assessment of protein Structure Prediction (CASP) meeting in 2018 saw AI methods come of age. The program AlphaFold [107] used a neural net to extract covariant residue pairs from sequence alignments, coupled with estimated distances between them (from 2-20A), and then used the ROSETTA energy function [108] to fold the protein based on these AI-derived restraints. AlphaFold performed exceptionally well in the competition, giving high-accuracy models with template-modelling scores of 0.7 or higher for 24 out of 43 domains (as compared with 14/43 for the next best method). This has been developed into a lab-based version called ProSPr [109] . Yang et al used a similar protocol, but with added estimation of relative residue orientations, resulting in trROSETTA [110] , which improved predictions still further. These 3D modelling methods may be implemented into a comprehensive metabolic engineering platform.

One area that could be addressed in the improvement of 3D protein modelling methods is the inclusion of cofactors. Many enzymes are often folded around cofactors; small-to-large organic molecules which form part of the catalytic machinery, such as flavin adenine dinucleotide (FAD) or haem. These molecules are often removed in template-based modelling (both manual and automated versions), yet their presence is often important for the correct folding of the enzyme [111] . This has the effect of lowering the quality of the model due to the removal of key restraints from the structure, requiring extra docking or structure manipulation to reinsert the cofactor after modelling. It should be possible to include the presence of cofactors through a survey of the Protein Data Bank [112] , where ML methods can be used to identify key determinants of cofactor binding, coupled with identification of these determinants within a target sequence, and application of a combined sequence-and-template-based optimization protocol inclusive of these structural features.

An extension of this might also be used for identification of substrates for enzymes within a metabolic pathway or unnatural substrates which is particularly valuable for the J o u r n a l P r e -p r o o f development of synthetic biosynthetic pathways. One input would be enzyme sequence alignments of known function, as well as structural information for both enzyme families and substrates. A neural network could be used to identify common patterns of binding pocket residues across multiple families of enzymes for different substrates, and identify potential sequences that would be suitable for inclusion in a particular metabolic pathway, inclusive of sequence determinants for ease of inclusion into heterologous expression systems. Also, if no sequence is available that produces a required product, it might be possible to predict the binding pocket residues that might be mutated to give that product. Predictions made can then be experimentally tested, and results fed back into the model.

In recent years, the importance of harnessing natural and food ingredients from diverse sources is increasingly realized, such as using engineered microbes or synthetically derived as highlighted in the introduction section. These approaches provide several benefits for producing a more sustainable bio-based economy that relies less on precious land or limited livestock.

Nevertheless, the bioengineering processes utilized still remain suboptimal, due to the complexity of living systems' emergent behaviors (such as feedback/feedforward inhibition, cofactor imbalances, toxicity of intermediates, bioreactor heterogeneity) that tend to reduce the overall effect of any internal modifications such as adding or engineering a metabolic pathway [113, 114] . Thus, achieving economically viable large-scale production of microbial-derived metabolites or compounds requires appropriately optimized production strains that generate high yields. Until today, however, metabolic engineering efforts mainly serve for broadening and further reducing the cost of those molecules of commercial interests.

To address these issues, Brunk et al engineered eight E. coli lab strains that produced three commercially important biofuels: isopentenol, limonene, and bisabolene [115] . To understand the key regulatory or emergent bottleneck scenarios that limit their industrial applicability, they undertook a large scale -omics based systems biology approach where they performed time-series proteomics and metabolomics measurements, and analyzed the resultant high-throughput data using statistical analytics and genome-scale modeling. The integrated approach revealed several novel key findings. For example, they elucidated time-dependent regulation of gene, protein and metabolic pathways related to the TCA cycle and Pentose-J o u r n a l P r e -p r o o f Phosphate pathway, and the resultant coupling of the pathways that affected NADPH metabolism. These emergent responses were collectively implicated to downregulate the expected biofuel production. The findings, subsequently, led them to identify a crucial gene (ydbK) whose removal led to a 2-fold increase in the production of isopentenol in one of the E.

coli strains [115] .

Despite their success on one strain (out of eight), the overall dynamic changes of metabolic pathways at the different stages of growth for all strains were not understood, as they employed a steady-state genome-scale model, which provided a qualitative, rather than quantitative, inference. This, as mentioned earlier (in Dynamic Modeling), is due to the lack of kinetic parameter values that are required to develop and test a dynamic model for each strain.

To overcome this difficulty, Costello and Martin (2018) used the same time-series proteomics and metabolomics data of Brunk et al and developed a ML model to effectively predict pathway dynamics in an automated fashion [58] . Their model produced both qualitative and quantitative predictions that had better predictions compared to a traditional kinetic model side-by-side.

Basically, their ML model derived a mapping function between the proteomics and metabolomics dataset with the aid of regression techniques and neural networks onto a training data, and finally verifying the prediction on a test data. Apart from better accuracy in the dynamic profiles of the metabolites predicted, the model also did not require detailed understanding of the regulatory steps, which is a major weakness for all modeling approaches.

However, their ML model was short in predicting effective regulator(s) for enhanced production of any of the biofuels (isopentenol, limonene, and bisabolene), nor was there any experimental verification. Although this is a major weakness in current systems metabolic engineering approaches, nevertheless, ML-based modeling has the future potential to productively guide bioengineering strains without knowing complete metabolic regulatory processes, which are very challenging to obtain.

One interesting and popular area of industrially relevant metabolic engineering product in the food and consumer care industries are the terpenes and terpenoids; secondary metabolites or organic compounds naturally found in diverse living species, especially in plants. Due to their high commercial values, numerous researches have focused on producing them or their derivatives at industrial scale using microbes [116] [117] [118] [119] . Although several hundreds, or even thousands, of fold increase has been achieved at test tube or flask level by engineering microbes, J o u r n a l P r e -p r o o f the achievement at large industrial scale bioreactors are far from reality. It is our opinion that ML models can help to uncover the relations between output and input more accurately, and identify sweet spots for carefully targeted steps for generating bioreactor scale targeted output. Although there is no current workable evidence for this, we believe the future looks promising for this front, provided large investments are made to generate biological data that are required by dynamic or ML models to effectively be predictive.

Integrating systems biology and ML holds a great promise for improving the way we study and understand metabolism as well as to improve and engineer alternative food sources that are healthier, affordable and nutritious. However, as reviewed in this chapter, this integration faces several challenges and limitations in order to fully utilize the power of both systems biology and ML.

A major challenge that faces the application of systems biology and ML in food-grade or GRAS metabolic engineering is the lack of data. Systems biology requires high throughput data from multi-omics levels (genomic, transcriptomic, proteomic and metabolomic), and this data is only available for a small subset of microorganisms in general, and significantly lacking for the food-grade or GRAS strains, in particular. The availability of such data is necessary for more holistic studying of the organism and helps in discovering new pathways or proteins, simpler, shorter directed pathways or new enzymes with better production rate [12] . This information will also help in choosing the most appropriate organism to be used for the engineering and production projects. Usually, certain model organisms called "chassis" such as yeast and E. coli are used in these projects where the gene(s) or pathways of the substance of interest is transferred from the donor organism. However, the availability of sufficient information about both the donor organism and the chassis help choosing the correct chasse and avoid facing unexpected qualities such as resistance to certain conditions or missing of important pathways [120] .

In addition to the need of large-scale -omics data for building ML models, another data problem is facing the application of ML in the metabolic engineering research. Training an ML model for metabolic engineering requires sufficient quantitative data for multiple conditions. The multiple conditions can be multiple knockouts, perturbations or growth conditions. For instance,

to build an ML model that predict the required engineering (e.g. knockouts) to improve the J o u r n a l P r e -p r o o f promoter strength, we need to train the model using quantitative data of the downstream gene expression under multiple knockouts or mutants. Similarly, the predictive ML models investigating the translation control, transcription factor binding sites, ribosomal binding sites, enzyme engineering (mutation or truncation) and growth optimization require high quality quantitative data in multiple conditions. The same data can also be used in building different mathematical and statistical models, which allows the development of more integrated methods.

However, this data is hard to find online and needs to be created for each project. We need more research that focus on the generation of high-quality quantitative data, and on building online resources, such as meta databases, that collect and combine this data to make it available for the community.

Another major challenge in the ML field is what is known as "the black box problem".

The black box problem of AI techniques in general is defined as the difficulty of understanding how they work and how and why they give these results [121] . This causes the end user of the technique to be uncertain about the quality of the output, and the often biologically unfamiliar modeler will not be able to intervene to improve the performance as well as raising some legal concerns [122] [123] [124] . For example, in the application of ML in 3D structural modelling, as well as enzyme-substrate identification, the newer AI-based modelling methods are showing some promising results, however, due to the nature of neural nets, it is very difficult to interpret exactly what the programs are learning about the protein-folding problem. We can predict a structure, but without understanding the underlying model for folding. If a way could be found to capture this information, it would be of great use to the community for further study. To address the black box problem, scientists in the field of AI developed a group of AI methods called explainable artificial intelligence (XAI) that aim to make the results of AI methods understandable to humans. Although this is still new, it holds potential to solve the problems that prevent the systematic performance improvement of AI models [121, 125] .

Although genome annotation, both structural and functional, affects most of the biomedical research aspects, it has a special impact on metabolic engineering in general and applications in food industry in particular. The food-grade or GRAS microorganisms are a small subset of all organisms and many of them are either not well-studied or not studied at all. Hence, there is a big challenge in using these species in ML-based metabolic engineering, as many of them are either not sequenced or sequenced with draft annotation, and/or with no annotation. The J o u r n a l P r e -p r o o f annotations are usually automated using standard pipelines which identify the common genes that they share with other microorganisms and can miss the organism-specific features that need deeper attention. These features are exactly what make those organisms suitable for metabolic engineering and food industry. Improved ML-based genome annotation methods will help improving the annotation of the food-safe and GRAS genomes which will directly impact the research in this area.

Another area that needs special attention is the pathways prediction in the absence of genome sequence or genome annotation. Since many of the food-safe and GRAS microorganisms are not sequenced yet, methods that predict the pathways for important substances using different -omics data is required. It is easy now to perform whole-or phosphoproteomics, or transcriptomics in different growth conditions or different life stages of an organism. This -omics data can be used, in the absence of genome sequence, to predict the endogenous or biosynthetic pathways of the substance of interest. ML methods can be used instead of the traditional pathway prediction approach due its better suitability to the nature and size of the data.

Overall, despite the challenges and limitations of AI or ML techniques in dealing with biological datasets, there is no better time than now to explore the full potential of these techniques and to further develop novel methods to overcome the many challenges, including "the black box problem". In parallel, the improvements to the data collection from -omics technologies in time will help to narrow the gap of uncertainty or ambiguity for future systems biology and ML integration for optimal metabolic engineering strategies.

Rome Declaration and Plan of Action

Diets for Health: Goals and Guidelines

Healthy low nitrogen footprint diets

The Ketogenic Diet: Evidence for Optimism but High-Quality Research Needed

COVID-19 risks to global food security

World Bank, Food Security and COVID-19

Metabolic engineering for higher alcohol production

Metabolic engineering of vitamin C production in Arabidopsis

Bringing cultured meat to market: Technical, socio-political, and regulatory challenges in cellular agriculture

Conceptual evolution and scientific approaches about synthetic meat

Engineered microorganisms for the production of food additives approved by the European Union-A systematic analysis

Metabolic Engineering J o u r n a l P r e -p r o o f and Synthetic Biology: Synergies, Future, and Challenges

Systematic engineering for high-yield production of viridiflorol and amorphadiene in auxotrophic Escherichia coli

Microbial astaxanthin biosynthesis: recent achievements, challenges, and commercialization outlook

Emerging engineering principles for yield improvement in microbial cell design

Advancing metabolic engineering through systems biology of industrial microorganisms

Predicting Novel Features of Toll-Like Receptor 3 Signaling in Macrophages

Transcriptome-wide variability in single embryonic development cells

Order Parameter in Bacterial Biofilm Adaptive Response

Systematic determination of biological network topology: Nonintegral connectivity method (NICM)

A systems biology approach to overcome TRAIL resistance in cancer treatment

A Review of Dynamic Modeling Approaches and Their Application in Computational Strain Optimization for Metabolic Engineering

Formulation, construction and analysis of kinetic models of metabolism: A review of modelling frameworks

Bayesian inference of metabolic kinetics from genome-scale multiomics data

Flux analysis and metabolomics for systematic metabolic engineering of microorganisms

Signaling Flux Redistribution at Toll-Like Receptor Pathway Junctions

Basic and applied uses of genome-scale metabolic network reconstructions of Escherichia coli

Physical laws shape biology

Constraint-based models predict metabolic and associated cellular functions

What is flux balance analysis?

Metabolic engineering to increase crop yield: From concept to execution

Design of synthetic yeast promoters via tuning of nucleosome architecture

Inferring gene regulatory logic from high-throughput measurements of thousands of systematically designed promoters

Impact of cancer mutational signatures on transcription factor motifs in the human genome

Genome-scale identification of transcription factors that mediate an inflammatory network during breast cellular transformation

Data Mining Process

Ensemble modeling of metabolic networks

Ensemble Modeling for Robustness Analysis in engineering non-native metabolic pathways

Ensemble modeling for strain development of l-lysine-producing Escherichia coli

Structure-Based Drug Design Strategies and Challenges

A review of metabolic and Science (80-. )

Design of an in vitro biocatalytic cascade for the manufacture of islatravir

Machine Learning Methods for Analysis of Metabolic Data and Metabolic Pathway Modeling

Can complex cellular processes be governed by simple linear rules?

Constructing kinetic models of metabolism at genome-scales: A review

Silico Approach to Characterization and Reduction of Uncertainty in the Kinetic Models of Genome-scale Metabolic Networks

Macroscopic law of conservation revealed in the population dynamics of Toll-like receptor signaling

Comparative protein modelling by satisfaction of spatial restraints

High-resolution comparative modeling with RosettaCM

A Completely Reimplemented MPI Bioinformatics Toolkit with a New HHpred Server at its Core

The I-TASSER suite: Protein structure and function prediction

The KEGG resource for deciphering the genome

Update: integration, analysis and exploration of pathway data

The MetaCyc database of metabolic pathways and enzymes-a 2019 update

Siri, in my hand: Who's the fairest in the land? On the interpretations, illustrations, and implications of artificial intelligence

A machine learning approach to predict metabolic pathway dynamics from time-series multiomics data

43 Pharma Companies Using Artificial Intelligence in Drug Discovery

230 Startups Using Artificial Intelligence in Drug Discovery

116 Drugs in the Artificial Intelligence in Drug Discovery Pipeline

COVID-19 vaccine tracker | RAPS

Dozens of coronavirus drugs are in development -what happens next?

Predicting PDZ domain mediated protein interactions from structure

Predicting the sequence specificities of DNA-and RNA-binding proteins by deep learning

A universal snp and small-indel variant caller using deep neural networks

Deep learning for genomics using Janggu

Dermatologist-level classification of skin cancer with deep neural networks

Machine Learning Predicts the Yeast Metabolome from the Quantitative Proteome of Kinase Knockouts

Understanding Machine Learning: From Theory to Algorithms

Combining Machine Learning and Metabolomics to Identify Weight Gain

Discovery of food identity markers by metabolomics and machine learning technology

Metabolomics meets machine learning: Longitudinal metabolite profiling in serum of normal versus overconditioned cows and pathway analysis

Machine learning in untargeted metabolomics experiments

Machine Learning Applications for Mass Spectrometry-Based Metabolomics

Transcriptome Analysis and Gene Expression Profiling of Abortive and Developing Ovules during Fruit Development in Hazelnut

Next generation models for storage and representation of microbial biological annotation

An integrative machine learning strategy for improved prediction of essential genes in Escherichia coli metabolism using flux-coupled features

Machine learning based analyses on metabolic networks supports high-throughput knockout screens

Application of abductive ILP to learning metabolic network inhibition from temporal data

Comparative genomics approaches to understanding and manipulating plant metabolism

Genome mining of the Streptomyces avermitilis genome and development of genome-minimized hosts for heterologous expression of biosynthetic gene clusters

Challenges in the Next-Generation Sequencing Field

Whole-Genome Alignment and Comparative Annotation

Insect genomes: progress and challenges

A perfect genome annotation is within reach with the proteomics and genomics alliance

Peptide identification by searching large-scale tandem mass spectra against large databases: bioinformatics methods in proteogenomics Metabolomics View project chi sequence View project

Proteogenomics: From next-generation sequencing (NGS) and mass spectrometry-based proteomics to precision medicine

Combining RNA-seq data and homology-based gene prediction for plants, animals and fungi

Genomes OnLine Database (GOLD) v.6: Data updates and feature enhancements

Machine learning and genome annotation: A match meant to be?

Introduction to machine learning

New Machine Learning Algorithms for Genome Annotation

Genome Annotation with Deep Learning

Modeling leaderless transcription and atypical genes results in more accurate gene prediction in prokaryotes

Genome annotation across species using deep convolutional neural networks

Functional Annotation of Genes Using Hierarchical Text Categorization

MIPS bacterial genomes functional annotation benchmark dataset

Machine learning for discovering missing or wrong protein function annotations

Machine learning: A powerful tool for gene function prediction in plants

Protein functional annotation of simultaneously improved stability, accuracy and false discovery rate achieved by a sequence-based deep learning

Genome-wide functional annotation of human protein-coding splice variants using multiple instance learning

Big Data: Astronomical or Genomical?

DeepMetabolism: A Deep Learning System to Predict Phenotype from Genome Sequencing

Machine Learning of Designed Translational Control Allows Predictive Pathway Optimization in Escherichia coli

Improved protein structure prediction using potentials from deep learning

The Rosetta All-Atom Energy Function for Macromolecular Modeling and Design

ProSPr: Democratized Implementation of Alphafold Protein Distance Prediction Network

Improved protein J o u r n a l P r e -p r o o f structure prediction using predicted interresidue orientations

How Do Cofactors Modulate Protein Folding?

The protein data bank

The future of metabolic engineering and synthetic biology: Towards a systematic practice

Cell-Free Metabolic Engineering: Recent Developments and Future Prospects

Characterizing Strain Variation in Engineered E. coli Using a Multi-Omics-Based Workflow

Identifying and engineering the ideal microbial terpenoid production host

A "plug-n-play" modular metabolic system for the production of apocarotenoids

Use of Terpenoids as Natural Flavouring Compounds in Food Industry, Recent Patents Food

Agrocybe aegerita Serves As a Gateway for Identifying Sesquiterpene Biosynthetic Enzymes in Higher

Protein folding and de novo protein design for biotechnological applications

Solving the Black Box Problem: A Normative Framework for Explainable Artificial Intelligence

Big data: a new empiricism and its epistemic and socio-political consequences

Why should I trust you?: explaining the predictions of any classifier

How the machine 'thinks': Understanding opacity in machine learning algorithms

Explainable artificial intelligence: A survey

The authors thank Simon Zhang Congqiang for critical comments, and the Singapore Institute of