key: cord-0311197-3wd3xzej authors: Saltepe, Behide; Bozkurt, Eray Ulaş; Güngen, Murat Alp; Çiçek, A. Ercüment; Şeker, Urartu Özgür Şafak title: Genetic Circuits Combined with Machine Learning Provides Fast Responding Living Sensors date: 2020-10-29 journal: bioRxiv DOI: 10.1101/2020.10.29.361220 sha: 85fe42e1e8075d964cb00d9576725b7d7a6b68cb doc_id: 311197 cord_uid: 3wd3xzej Whole cell biosensors (WCBs) have become prominent in many fields from environmental analysis to biomedical diagnostics thanks to advanced genetic circuit design principles. Despite increasing demand on cost effective and easy-to-use assessment methods, a considerable amount of WCBs retains certain drawbacks such as long response time, low precision and accuracy. Furthermore, the output signal level does not correspond to a specific analyte concentration value but shows comparative quantification. Here, we utilized a neural network-based architecture to improve the aforementioned features of WCBs and engineered a gold sensing WCB which has a long response time (18 h). Two Long-Short Term-Memory (LSTM)-based networks were integrated to assess both ON/OFF and concentration dependent states of the sensor output, respectively. We demonstrated that binary (ON/OFF) network was able to distinguish between ON/OFF states as early as 30 min with 78% accuracy and over 98% in 3 h. Furthermore, when analyzed in analog manner, we demonstrated that network can classify the raw fluorescence data into pre-defined analyte concentration groups with high precision (82%) in 3 h. This approach can be applied to a wide range of WCBs and improve rapidness, simplicity and accuracy which are the main challenges in synthetic biology enabled biosensing. Biosensors are devices composed of biological components (i.e., enzymes 1-5 , tissues, antibodies 6 , nucleic acids 7 , or cells 8-10 ) that detect analytes of interest. Developments in recombinant DNA technologies and synthetic biology have increased the design of living biosensors in many fields from medical 11-18 to environmental applications [18] [19] [20] [21] [22] [23] [24] [25] . Besides, they have plenty of advantages over other types of sensors since they are easy-to-use, cost-effective, renewable and very selective 26, 27 . Contrary to well-established analysis methods such as high-performance liquid chromatography (HPLC), or mass spectroscopy which is used to determine amounts of contaminants in samples, whole cell biosensors (WCBs) detect the bioavailable fraction of chemicals. Hence, WCBs are highly sensitive towards tested chemicals be detected at much lower concentrations 28, 29 . Furthermore, real-time monitoring of bioavailability of chemicals is possible 30 . Additionally, the exploitation of specific transcription factors (TFs) in circuits make Selectivity, reproducibility, accuracy and sensitivity characteristics are the core parameters that define the functionality of a WCB 44 . While selectivity is ensured by specific transcription factors; reproducibility, accuracy and sensitivity of the results are error-prone and depend on many factors including the experimenter, time, and growth phase of bacteria which can lead to variations in sensor signal 45, 46 . Thus, rigorous processing of the biosensor data is of utmost importance for a WCB [47] [48] [49] . Moreover, the input-output signal relationships are seldom linear due to natural limitations of the biological processes in cellular biosensing (e.g. cellular volume, concentration of available reactants, etc.). In order to properly interpret the measurements and obtain coherent results, the data has to be processed with techniques that compensate for the non-linearity caused by experimental variations. These additional steps require a quantitative (system-level) understanding of how a biosensor works 50 . Mathematical models of the biosensing system can help better understanding the behavior of sensors. However, relevant parameters of the equations describing the system cannot always be accurately determined as they require equipment that may not always be available or they may permanently disrupt the operation of the biosensor 51, 52 . Furthermore, interpretation of output signal differs based on the type of the sensor. Digital sensors interpret the analyte in an ON/OFF fashion and conversion of the signal to field application testing is straightforward. However, analog sensors respond proportionally to analyte concentrations. Interpretation of the output signal in a concentration dependent manner and conversion of the output to analyte concentration is still one of the major drawbacks of WCBs 18 . Several attempts have been made to make more convenient biosensing platforms for biomedical and environmental applications and to improve the performance of biosensors. For instance, it is anticipated that integration of wireless technology will ease the biomarker detection and provide real-time monitoring. A recently developed micro-bio-electronic device (IMBED) provides a wireless communication of WCBs with ultralow-power microelectronics technology, enabling real-time monitoring of disease biomarkers in the gastrointestinal tract 12 . On the other hand machine learning has been forecasted to play an important role in advancing biosensing 42 as neural networks in particular have proven useful in an extremely wide spectrum of applications. Yet, there is no study that utilizes machine learning algorithms to advance WCB development. Artificial neural networks, are a class of non-linear signal processing algorithms (a part of machine learning) that are used to process data that cannot be analytically solved. Using different learning algorithms to analyze the data, the network can be trained to form its own model to fit the data which could then be used to analyze further samples. Neural networks have been used to process biosensor outputs for several applications 53, 54 55 . One example of such networks is recurrent neural networks (RNN). In the RNN architecture, the output generated from the earlier inputs are fed to the network as an additional input in the next iteration 55 . This property allows the network to remember its state and alter its weights, during training, to better adapt to the data at hand. RNNs are especially useful for applications that require the analysis of temporal or sequential data like sound recognition or genomic analysis 56, 57 . The long-short-term-memory (LSTM) network is a more advanced type of RNN that is widely used in deep learning today. LSTMs overcome some of the issues associated with RNNs, and can keep better track of temporal patterns within the data. LSTM networks are also being widely used to analyze various biological functions like DNA -protein binding predictions 58 . In this paper, we utilized deep neural networks to analyze the output of a biosensor in order to accurately determine target concentrations in much shorter time than possible via manual analysis. First, we engineered a complex genetic circuit to detect gold ions, and characterized the limit of detection (LoD) and response time of the WCB. The output of the sensor reached the maximum fold change (~10-fold) in 18 h. We utilized an LSTM-based network to decrease the detection time of the sensor and to assess the concentration of analyte accurately. First, we predicted the ON/OFF status of the sensor, and achieved high precision in a short time (78% in 30 min, and 93% in 2 h). Next, we trained a second LSTM model and predicted a concentration for the analyte and observed that the model made precise predictions of concentrations in 3 h. Here, we showed that integration of the machine learning in WCBs can be utilized to decrease long response times of sensors and accurately predict applied gold ion concentrations. This study is unique in the field of WCBs and can be further extended to analyze different parameters that might alleviate the labor-intensive work. Construction and characterization of bacterial gold detecting sensor. Whole-cell biosensors hold great potential in many areas, specifically in environmental contaminant analysis and numerous WCBs have been proposed in last two decades 59 . Due to their certain advantages such as cost, rapidness and ease of use, they become a prominent alternative. In our engineered bacterial WCB design, we constructed a complex and tightly controlled gold detecting circuit combining a semi-specific stress biosensor based on heat shock response (HSR) 60 and a specific biosensor for gold sensing 61 . In the circuit, a constitutively expressed HSR repressor, HspR, blocks gene expression from its cognitive promoter, PdnaK-IR3-IR3, controlling the expression of gold specific transcription factor, GolS, and a site-specific recombinase, Bxb1. Blocking of both elements ensures the elimination of output expression which requires conversion of gold specific promoter, PgolB, by the site-specific recombinase, and transcription initiation by GolS-gold ion complex (Fig. 1a) . The circuit takes action only when gold ions are introduced to the environment causing stress to cells which releases HspR from the HSR promoter and initiates both Bxb1 and GolS expression. First, Bxb1 recognizes certain sequences around the gold specific promoter and converts [62] [63] [64] it towards the output gene. Next, GolS-gold ions complex 19,61 helps initiate the output expression (Fig. 1b) . To begin with, we optimized the gold detecting WCB response with a dynamic range analysis (Fig. 1c) . We induced the sensor with varying gold ion concentrations from 0-to-250 µM for 18 h at 30°C in a stable incubator. We observed that the sensor starts responding to gold ions from 5 µM, which we defined as the LoD of the sensor, the signal increases proportionally with increased gold concentrations, and tends to saturate after 100 µM of gold induction. Therefore, we defined a moderate concentration (i.e. 50 µM) that could be suitable to obtain high response, yet does not disturb cell viability. Next, we examined the specificity of the sensor with 50 µM concentration of varying heavy metal ions (Au 3+ , Cd 2+ , Fe 2+ , Fe 3+ , Co 2+ , Co 3+ , Pb 2+ , As 3+ ) and results indicated a significant increase in reporter expression only in gold induced group after 18 h of incubation (Supplementary Figure 2) . Lastly, we induced the sensor with low and mid concentrations of gold ions (10 and 50 µM, respectively) to analyze the ideal response time of the sensor (Fig. 1d) . We observed that similar responses have been obtained from the sensor within 16 to 22 h of incubation which is quite late for the ideal performance of biosensors. Thus, we introduced LSTM network to decrease the detection times of such biosensors. Workflow of the sensor from wet lab to the LSTM network model. Data processed by the two LSTM networks were obtained from cells induced with gold ions in 96-well plates at 30°C and signal was tracked with 2 min intervals for 6 h. The raw fluorescence data was split into training and test (unseen) data, and processed with LSTM network to shorten the detection time and specify related concentration predictions (Fig. 2) . See Methods for details. from long hours to reach a detectable signal (i.e. 18 h for the highest fold change) which is one of the major disadvantages of WCBs 45,46 . By utilizing an LSTM-based model we aimed to shorten the required time to make an assessment. In order to optimize and shorten the response time of the sensor, multiple networks were trained and tested for different lengths of the biosensor data (see Methods). Starting from 0 min to 6 h, all data were analyzed with 30 min increments (Fig. 3a) . The network was able to identify the ON/OFF states (binary classification) of the sensor in 30 min with high accuracy (78%) and reached the maxima in 3 h (over 98%). The binary classification results of 30 min were represented by confusion matrices for each of the cross validation runs (Fig. 3b) . Note that the raw fluorescence signal was not sufficient to show the ON/OFF status of the sensor in 30 min Each cross-validation run was represented by confusion matrices (Fig. 4b) . Even though signal ratios showed slight differences between applied concentrations in 3 h (Fig. 4c) , the results indicated that the prediction ability of the network is highly satisfactory for both detecting the presence of gold ions (binary classification) and concentration dependent (analog) classification. WCBs are promising tools of biosensing that could be engineered to detect various analytes and could be used in many fields 18 . Yet, there are certain drawbacks to be solved such as long response time. Here, we engineered a gold detecting sensor utilizing gold specific TF, GolS, and a site-specific recombinase, Bxb1. This architecture of the biosensor allows us to monitor both the presence of toxicity as well as the source of the toxicity. The double output capability comes at a cost of longer response time (Figs. 1c and 1d). Therefore, we utilized neural networks to improve the features of our proposed sensor. LSTM is a neural network architecture that is effective in analyzing sequential data 65 . Hence, we chose LSTM as our neural network architecture to learn temporal features from the output of the sensor in relation to the concentration of gold ions. In our proposed work, we successfully integrated two LSTM-based neural networks to accurately and efficiently predict (i) the presence/absence of gold ions (ON/OFF) and (ii) discrete gold concentration using raw fluorescence signal. The ON/OFF state of the sensors is widely used including pathogen detection and early diagnosis 13,66,67 . Therefore, response time of a WCB is vital for decision-making and the treatment of patients. In this study, we have shown that a machine learning based solution could be integrated to WCBs to decrease the detection time. Although our WCB required 5 h to develop a distinguishable signal (~2-fold) (Supplementary Figure 7) , our models were able to shorten the response time to 30 min with 78% accuracy, reaching 93% in 2 h, and over 98% in 3 h (Fig. 3a) . Furthermore, biosensors with analog circuits have been widely used to detect-and-report the presence of an analyte in concentration dependent manner, and play a crucial role in environmental heavy-metal detection 21,43 . Although, WCBs with analog circuits provide quantitative analysis, no study has reported a direct relation between the raw reporter signal and the analyte concentration. Nevertheless, in our study, the LSTM-based architecture was able to classify the data based on the reporter signal behavior utilizing pre-defined concentration classes. Processing of gold-sensing WCB data with another LSTM-based network showed that conversion of raw signal to a discrete concentration value is possible and data can be classified accurately (82%) in 3 h (Fig. 4) . Our results showed that the utilization of this model to process WCB data is favorable in terms of (i) decreasing response time, (ii) providing a simple output (rather than a raw fluorescence data), and (iii) allowing concentration classification of a single sample based on the signal. We envision that this approach can be further optimized to calculate the exact analyte concentration rather than to classify it. To do so, the dynamic range should be explicit and predefined concentration groups should be selected from analog region of dynamic range in order to get accurate results. Alternatively, the data from both analog and digital response can be trained with the LSTM model so that a better fit can be obtained for a wider range of concentrations. The last but not the least, this approach can be purposed to decrease the LoD which is one of the main challenges in biosensors. In this study, we defined the LoD of the WCB as 5 µM based on dynamic range analysis (Fig. 1c ) which shows only ~2-fold increase in 18 h while we were able to decrease the detection time to 6 h with 75% accuracy Similarly, the nutrient-rich media could boost the signal accumulation resulting in faster response. Especially, in some cases, introducing additional genetic parts (i.e. recombinases, multiple TFs) to circuits result in a significant delay in response time. Although these tools bring certain advantages such as specificity, they become insufficient tools for biosensing because of the delay. Integration of machine learning based algorithms can benefit such studies. After creating a specific neural network model for a biosensor, the trained model can be incorporated into portable microcontroller or field programmable gate array (FPGA) based systems. Such platforms can be combined with onboard portable spectrophotometers to provide on-site measurements. This enables obtaining rapid, simple and accurate results in the field with low cost equipment. Given the current situation of the COVID-19 pandemic, the importance of developing such systems, especially to monitor the presence of biomarkers for pathogens has a critical importance for a better healthcare system. Supplementary information is available for this paper. Reporter expression assays and data analysis. All assay conditions were described above. Fluorescence for gfp expression (485 nm for excitation, 538 nm for emission, 530 nm cut off) and absorbance for optical cell density (OD600) were measured via microplate reader. All sensor output was normalized to cell density (gfp fluorescence/OD600) at specific time point and negative control group (GFP-free cells) was subtracted. Obtained data was normalized in 0-to-1 range: Minimum value was subtracted from each value and divided by the difference between maximum and minimum values (Except Fig. 3c, Fig. 4c , and Supplementary Figure 7) . Continuous GFP expression to feed neural network was measured as following: O/N culture of cells with gold-sensing circuits was diluted 0.4% in fresh MOPS media and placed in a microplate reader. Reporter expression was recorded with 2 min of intervals for 6 h. The experiment temperature was set to 30°C throughout the measurements. Prior to training, for both networks, the datasets were first split randomly distributed subsets. The same subsets were then used to train and evaluate the results of all iterations of their respective networks. 70% of the data was used for training, 20% for validation and 10% for testing. Neural network architecture. For the binary network, the dataset was split such that the 0 µM inputs were labeled as OFF ("0" in the network) and the others (25 µM, 50 µM, 75 µM, and 100 µM inputs) were all labeled as ON ("1" in the network). The architecture of the network consisted of the following layers: a sequential input layer (of size 2; for the raw data and its time-differentiated counterpart), a bidirectional LSTM layer (using the default tanh activation functions in the toolbox), a fully connected layer with two output neurons (each neuron corresponding to one of the outputs), a softmax layer (used to implement the activation functions of the fully connected layer), and a classification layer. In the 5-class (concentration range classifying) network, there were five output classes in total, corresponding to each concentration input (0 µM, 25 µM, 50 µM, 75 µM, and 100 µM). The outputs were in a binary vector format where the positive class was represented by a "1" and the others by "0". The architecture of the network consisted of the following layers: a sequential input layer (of size 2; for the raw data and its time- Neural network testing. After each network was optimized with respect to the validation subset, the validation subset was combined with the training set and the network was trained from scratch with the expanded training set (consisting of 90% of the data). The previously unseen testing set (containing the remaining 10%) was then used to evaluate the performance of the network. The same networks were then used for their respective temporal accuracy analysis. In these tests, the length of each element in the dataset was cropped to its respective time-length. For instance, only the measurements for the first 30 min were used initially, followed by longer sequences corresponding to 60, 90, 120 min, etc. In both cases, the accuracy of the networks was determined by comparing the network outputs of the testing set with their true values. In order to achieve coherent results, leave-one-out cross-validation was used. Ten different networks, were trained and tested with different members of the data in the training and testing sets. The overall percentage accuracy was determined as follows: The test result for the temporal accuracy has been plotted in figures 3a and 4a for the binary and 5-class cases respectively. As can be seen from the figures, the accuracy of the results Mathematical analysis and quantification of fluorescent proteins as transcriptional reporters High Variation of Fluorescence Protein Maturation Times in Closely Related Escherichia coli Strains Systems biology: parameter estimation for biochemical models Obtaining and Estimating Kinetic Parameters from the Literature Determination of phenolic compounds by a polyphenol oxidase amperometric biosensor and artificial neural network analysis Determination of pesticides using electrochemical enzymatic biosensors Speech Recognition with Deep Recurrent Neural Networks. Int Conf Acoust Spee Recurrent Neural Network for Predicting Transcription Factor Binding Sites DeepSite: bidirectional LSTM and CNN models for predicting DNA-protein binding Whole-cell living biosensors -are they ready for environmental application? Genetic Circuits To Detect Nanomaterial Triggered Toxicity through Engineered Heat Shock Response Mechanism Bacterial sensing of and resistance to gold salts Synapsis in phage Bxb1 integration: Selection mechanism for the correct pair of recombination sites Synthetic recombinase-based state machines in living cells Synthetic circuits integrating logic and memory in living cells Long short-term memory Engineering microbes to sense and eradicate Pseudomonas aeruginosa, a human pathogen Programmable probiotics for detection of cancer in urine Culture medium for enterobacteria Enzymatic assembly of DNA molecules up to several hundred kilobases Genetic Circuits To Detect Nanomaterial Triggered Toxicity through Engineered Heat Shock Response Mechanism Culture medium for enterobacteria Enzymatic assembly of DNA molecules up to several hundred kilobases We thanks TUBITAK Grant No 114Z653 and 118S398. UOSS and AEC acknowledge the support of TUBA GEBIP and Bilim Akademisi BAGEP awards. UOSS conceived the idea, UOSS and AEC designed the study, BS, EUB and MG carried out the experiments. All of the authors wrote the paper. The authors do not have a competing interest. increases alongside the duration of the data. Improving the performance of the system may be possible by measuring the fluorescence of the samples with a higher sampling rate.