key: cord-0265731-932pa848 authors: Jagtap, N. V.; Mudunuru, M. K.; Nakshatrala, K. B. title: A deep learning modeling framework to capture mixing patterns in reactive-transport systems date: 2021-01-11 journal: nan DOI: 10.4208/cicp.oa-2021-0088 sha: 95262cbb41bf9fbb32575331f157e99037fc6e7b doc_id: 265731 cord_uid: 932pa848 Prediction and control of chemical mixing are vital for many scientific areas such as subsurface reactive transport, climate modeling, combustion, epidemiology, and pharmacology. Due to the complex nature of mixing in heterogeneous and anisotropic media, the mathematical models related to this phenomenon are not analytically tractable. Numerical simulations often provide a viable route to predict chemical mixing accurately. However, contemporary modeling approaches for mixing cannot utilize available spatial-temporal data to improve the accuracy of the future prediction and can be compute-intensive, especially when the spatial domain is large and for long-term temporal predictions. To address this knowledge gap, we will present in this paper a deep-learning (DL) modeling framework applied to predict the progress of chemical mixing under fast bimolecular reactions. This framework uses convolutional neural networks (CNN) for capturing spatial patterns and long short-term memory (LSTM) networks for forecasting temporal variations in mixing. By careful design of the framework -- placement of non-negative constraint on the weights of the CNN and the selection of activation function, the framework ensures non-negativity of the chemical species at all spatial points and for all times. Our DL-based framework is fast, accurate, and requires minimal data for training. Reactive-transport equations arise in a wide range of scientific domains like epidemiology [Viguerie et al., 2020] , combustion [Naoumov et al., 2019] , hypersonic flows [Huangwei et al., 2020; Ren et al., 2018; Sabelnikov and Vlasenko, 2018; Xiaofeng et al., 2020] , subsurface energy [Xiao et al., 2018] , contamination remediation [Arbogas et al., 1996] , air and water quality research [Zhu and Anderson, 2002] , and population dynamics [Petrovskii and Li, 2003] . Although the underlying mathematical models in these applications are similar, the associated spatial and temporal scales are much different. For example, the spatial domains in combustion and hypersonic flows are small: an internal-combustion engine for the former case while the exterior surface of an atmospheric reentry capsule for the latter. In contrast, the spatial domains involved in subsurface remediation and epidemics could range from a few tens to several hundreds of miles. The reaction time scales in combustion are much less than a second while those associated with an epidemic could be very long (months to years). For complex geometries, presence of nonlinearities, and different spatial/temporal scales, analytical treatment of mathematical models arising in the said applications is not viable. Thus, it is customary to resort to numerical simulations. Finite element [Codina, 2000; Mudunuru and Nakshatrala, 2016] , finite volume [Tambue et al., 2010] , and lattice Boltzmann methods [Gao et al., 2017] are the contemporary numerical methods to solve reactive-transport equations. But these methods suffer from two major drawbacks. First, they do not have an ability to exploit the available data in certain cases to improve prediction accuracy. To elaborate, consider the modeling of an epidemic. During an epidemic, spatial-temporal evolution data gathered by health agencies for the disease is often available [Jha et al., 2020; Viguerie et al., 2020] . However, the traditional numerical methods cannot utilize such data to improve the accuracy of future predictions; the input to these methods will be limited to data such as the boundary and initial conditions of the solution fields. The second drawback is that reactive-transport simulations using traditional numerical methods are in general compute-intensive for practical problems [Amikiya and Banda, 2018] . For example, numerical solutions of a turbulent combustion problem will require small time-steps for accurate predictions to capture the high-speed traveling reaction front and involved multiple chemical species [Hundsdorfer and Verwer, 2013] . On the other hand, for subsurface remediation-related applications or for time-critical scenarios like chemical spills, the evolution of the chemical species for short and long times needs to be understood. Also, the spatial degrees of freedom associated with the subsurface remediation problems could be O(6) to O(9) 1 [Chang et al., 2017] . It takes hours of simulation time to resolve such spatial and temporal scales adequately even on state-of-the-art supercomputers-despite using thousands of processors, as the calculations in the time domain cannot be parallelized. The central aim of this paper is to present a deep learning (DL) framework that overcomes the said deficiencies of the conventional numerical methods. The efficacy of the framework is demonstrated by applying it to a reactive-transport problem. DL is a subset of machine learning that employs multiple neural layers to extract features successively from the input. An attractive feature of DL-based methods is that they can take advantage of the spatial-temporal data available for a limited portion of the total time of interest. This limited-duration data, obtained from physical experiments or high-fidelity numerical simulations can train a DL model. The resulting trained model can then predict accurate results over the remaining duration of interest much faster than the traditional numerical schemes. Such a model will be extremely useful for real-life situations like a pandemic or simulations requiring multiple runs by varying the simulation parameters of interest to assess their impact on the end results. Once the model is developed, the computational cost for training the model and getting results for the remaining duration of interest will be a fraction of the time required to get the entire solution using the traditional numerical methods. In short, the developed model enables seamless integration of modeling and data at scale, and significantly reduces the time required to get the solution. The two main DL architectures used in the current framework are convolutional neural networks (CNNs) and long short-term memory (LSTM) network. CNNs have existed since 1980s [Schmidhuber, 2015] and are particularly suitable for processing data that has a grid-like topology; e.g., images that are in-fact a 2-dimensional grid of pixels. Their popularity and application has increased tremendously in the last decade due to exponential growth in computing speed as well as volume of data. CNNs offer distinct advantages over the dense/fully connected neural layers that CNNs learn the local patterns in their input features spaces whereas the dense neural layers can only learn the global features. The multiple layers of a CNN can learn the hierarchical patterns. For instance, the first convolution neural layer may learn small local patterns. A second convolution layer may learn broader set of abstract patterns comprised of the features from the first layer, and so on [Chollet, 2017; Goodfellow et al., 2016] . In the current model, CNNs are used for spatial feature extraction and prediction of reactive-mixing patterns. A LSTM network is a type of recurrent neural network (RNN), and was developed in the late 1990s [Schmidhuber, 2015] while researching the solutions for the well-known vanishing gradient problem in deep neural networks. CNNs are well-suited for processing a grid-like topology while RNNs are ideal for processing a sequence of values. LSTMs preserve the information across many time-steps by preventing the older signals from gradually vanishing during passing of information across the neural network layers [Chollet, 2017; Goodfellow et al., 2016] . In the current model, the LSTM networks are used for temporal feature extraction and prediction. Thus, CNN and LSTM together are used for predicting spatial and temporal evolution of reactive-mixing. The proposed DL-based framework is validated by applying it to a fast bimolecular reaction problem. High-fidelity simulation results were obtained using finite element method for the entire time domain of interest. The DL-based model is trained using only portion of the total simulation time. The trained model is then used to predict subsequent spatial-temporal evolution of the chemical species. The evolution of the chemical species predicted by this model is compared with the simulation results to assess the accuracy of the model. Sensitivity analysis is performed with respect to amount of training data and various DL algorithm parameters to finalize the hyperparameter values for generating results. The novel features of the proposed framework are: (a) data incorporation: utilizes the available spatio-temporal species data for accurate prediction of evolution of the species; (b) fast: computationally attractive because it requires smaller time-to-solution compared to highfidelity simulations; (c) physically realistic: ensures non-negative concentration fields at all times and at all spatial points unlike many traditional numerical schemes; (d) accurate: provides good accuracy, comparable to high-fidelity numerical solutions; (e) predictive: captures a wide range of reactive mixing; preferential mixing that manifests under small-scale velocity features, as well as incomplete mixing that manifests under large-scale velocity features; (f) requires minimal training data: requires a small amount of training data compared to contemporary data-driven methods; and (g) extendable: can be incorporated into existing simulators with minimal changes to the computer code-an attractive feature for application scientists. Bereft of such a computational framework, forecasting for reactive-transport applications involving chaotic mixing will be prohibitively time-consuming, and addressing subsurface remediation problems under uncertainties would remain elusive. This paper serves as a proof-of-concept by showing the efficacy of the approach using a two component system; however, our approach is applicable even to those problems involving many chemical species. An outline of the rest of this article is as follows. We will consider bimolecular reactions as the prototypical reactive-transport model and present the associated governing equations ( §2). The details about the proposed deep learning based modeling framework for transient reactive-transport will follow this presentation ( §3). The predictive capabilities and computational performance of the proposed framework will be illustrated using representative numerical examples ( §4). The paper will end with concluding remarks along with a discussion on potential extensions of this work ( §5). Let Ω be a bounded domain with piecewise smooth boundary ∂Ω = Ω − Ω, where a superposed bar denotes the set closure. A spatial point is denoted by x ∈ Ω, and the spatial gradient and divergence operators are denoted by grad[·] and div[·], respectively. The unit outward normal vector to the boundary is denoted by n(x). The time is denoted by t ∈ [0, I], where I is the length of the time interval of interest. Consider a bimolecular reaction of the following form: where n A , n B and n C are stoichiometric coefficients; A and B are the reactants; and C is the product of the reaction. We denote the molar concentrations of the chemical species by c A (x, t), c B (x, t) and c C (x, t), respectively. The boundary is divided into two complementary parts. Γ D is that part of the boundary on which concentrations of the chemical species (i.e., Dirichlet boundary conditions) are prescribed. Γ N is the part of the boundary on which diffusive fluxes of the chemical species (i.e., Neumann boundary conditions) are prescribed. Note that In the absence of non-reactive volumetric sources and advection, the equations that govern the fate of the reactants and the product in a bimolecular reaction can be written as follows: where c 0 i (x) is the initial concentration of the i-th chemical species, D(x, t) is the anisotropic dispersion tensor, and r(c A , c B ) is the reactive-part of the volumetric source. c p i (x, t) is the prescribed (molar) concentration of the i-th chemical species on Γ D . h p i (x, t) is the prescribed diffusive flux of the i-th chemical species on Γ N . 2.1. Invariants and fast reactions. One can simplify the solution procedure by introducing invariants, as done in several papers in the literature; for example, see [Nakshatrala et al., 2013] . The name "invariants" is apparent from the fact that pure transport equations govern the fate of these reaction invariants (i.e., without any reaction term) rather than dispersion-reaction equations. For a bimolecular equation, one can find twelve reaction invariants. However, only two of them are (linearly) independent. Herein we will take the following two independent invariants: The fate of these two invariants are governed by the following equations: In this paper, we assume that the bimolecular reaction is fast-the time-scale of the reaction is faster than that of the dispersion process. This assumption implies that at any spatial point and at instance of time, concentration of only one of the reactants A or B could be non-zero. Said differently, if both c A and c B are non-zero, the reaction will proceed further instantaneously until one of the reactants is depleted completely. For fast bimolecular reactions, the solution procedure can be further simplified by noting that reactants A and B cannot coexist. Once c F (x, t) and c G (x, t) are known, the concentrations of the reactants and the product are calculated as follows: Appendix provides details of the non-negative finite element formulation used to generate the high-fidelity simulation data employed herein. To summarize, the mathematical model considered in this paper makes the following assumptions: (i) Advection is neglected. (ii) Non-reactive volumetric source is neglected for all chemical species. (iii) The dispersion tensor is assumed to be the same for all the chemical species involved in the reaction. (iv) Only the diffusive part of the flux is prescribed on the boundary. (v) The time-scale of the reaction is much faster than the time-scale associated with the dispersion process. Some prior works [Ahmmed et al., 2020; Mudunuru and Karra, 2021; Vesselinov et al., 2019] have used a similar bimolecular reaction model, and utilized unsupervised and supervised machine learning methods to perform data mining in the simulation data. Specifically, these works have discovered hidden features, estimated relative importance of reaction-transport model input parameters, and emulated key quantities of interest (e.g., degree of mixing, product yield, species decay). However, these works did not capture the spatial-temporal mixing patterns, which is the main uniqueness of this paper In this section, we will present our proposed deep learning framework. The details on the neural architecture of our non-negative CNN-LSTM model will be outlined first. We will then describe the training process for ingesting reactive-transport simulation data to make predictions at future time-steps. 3.1. Deep learning model architecture. It is well known that many of the traditional finite element formulations for the reactive-transport equations suffer from a deficiency: they could produce negative concentrations [Mudunuru and Nakshatrala, 2017] . For example, the standard single-field Galerkin formulation produces negative values and spurious node-to-node oscillations for the primary variables in advection-and reaction-dominated diffusion equations. The negative concentrations are unphysical and can erode the solution accuracy in the entire domain. Additionally, if these erroneous numerical solutions are used to train deep learning models, then they could produce inaccurate forecasts. Therefore, ensuring non-negative concentrations in the ground truth and during DL-model training/predictions is essential. The framework developed herein ensures non-negativity of the predicted concentrations through careful selection of the trainable parameters. Figure 1 shows a pictorial description of the proposed deep learning model. Figure 1 (a) shows a schematic of the non-negative CNN-LSTM architecture, which is developed to forecast the evolution of spatial patterns in the concentration of product C. The inputs to the convolutional layers are an image stack as shown in Figure 2 . The convolutional layers learn the underlying representations (e.g., feature maps) of the mixing process based on the observed patterns in the concentration, velocity field, and anisotropic dispersion at the previous time-steps. The last dense layers along with the LSTM units predict the evolution of the reactive-mixing based on the learned patterns (e.g., features maps or representations) from convolutional layers. Figure 1 (b) shows a schematic of prediction at future-times based on the trained non-negative CNN-LSTM model. Concentrations (i.e., ground truth data obtained from the non-negative FEM) at initial time-steps along with velocity field and dispersion tensor are used to train the deep learning model. As described in Figure 2 , to forecast the product C concentration at subsequent time-steps, we input these image stacks to the trained non-negative CNN-LSTM model. Keras and Tensorflow packages [Chollet, 2017] are used to build our CNN-LSTM model (see Figure 1 (a)). The convolutional neural layers in our model are trained to find a set of transformations that turn the input image stack into simpler and more general feature maps that represent the reactive-mixing process. The convolution operator learned by a filter that is applied at every spatial point on the input image stack is a local linear combination of neighboring points of that input. Pooling and activation operations are convex operations. These are applied to the convolution operations, which make our CNN a form of high-dimensional composition of spline operators [Balestriero, 2018] . Following Goodfellow et al. [2016] , the CNN layers in the proposed model can be mathematically described as follows: Let us assume L r−1 is a (r − 1)-th neuron layer of size R × R followed by a convolutional layer. A filter F r−1 of size M × M is applied to L r−1 . Then, our convolutional layer output will be of size (N − M + 1) × (N − M + 1). This convolution operation is given as follows: Nonlinearity is then applied to s ij to get the next neuron layer L r . This operation is given by l r ij = σ(s ij ), where l r ij ∈ L r and σ is the sigmoid function. There is no learning in the max-pooling layers. The pooling operation simply take certain k ×k region within the (N −M+1)×(N −M+1) and outputs a single value. This value is the maximum in that k × k region. In our case, for a given neuron layer L r , max-pooling operation will output a (N −M+1) layer. This is because each k × k matrix is reduced to just a single value through the max function. The LSTM layers are primarily composed of three gates -input gate, forget gate, and output gate. Sigmoid activation functions are used in these gates to give non-negative values at different states. Input gate tells us the new information to be stored in the LSTM cell state. Forget gate tells us the features to be throw away or irrelevant. The output gate provides the activation required to estimate the final output of the LSTM block at time-step j. Mathematically, the flow of information from these gates can be written as follows: where denotes the Hadamard product [Horn and Johnson, 2012] . At time-step j = 0, a j = 0 and h j = 0. z j is the input vector to the LSTM cell. h j is the hidden state vector, which is also known as output vector of the LSTM cell. f j , i j , and o j are the activation vectors for forget, input, and output gates, respectively.ã j is the LSTM cell input activation vector. a j is the LSTM cell state vector. W α and U α , ∀α = f, i, o, and, a, are the weight matrices of the input and recurrent connections. b α are the corresponding bias vectors. W α , U α , and b α are learned during the training process. Figure 2 shows a pictorial description of the spatial-temporal sequence forecasting (STSF) of the reactive-transport data. STSF is performed using the non-negative CNN-LSTM model that is built on the proposed DL-based model architecture as shown in Figure 1 . To summarize, our framework ingests data from zero to N -th time-step for training the CNN-LSTM model. The ingested data is an image stack consisting of c C (x, y), v x , v y , D xx , D xy , and D yy images at N -th time-step. c C (x, y) is the concentration of product C from high-resolution numerical simulation using the non-negative FEM (ground truth). v x and v y are the x-and y-components of the vortex-based velocity fields given by Eqs. (4.3)-(4.4). D xx , D xy , and D yy are the xx-, xy-, and yy-components of the anisotropic dispersion tensor given by Eq. 4.1. The trained CNN-LSTM model uses these images to produce a sequence of predictions for c C (x, y) for time-steps greater than N . We also note that the trained CNN-LSTM model constantly ingests data that it produces after N -th time-step to make forecasts until the end of simulation time. 3.2. Non-negative CNN-LSTM model training. The architecture of our CNN consists of three convolutional neural layers with 16 filters. We initialize filters with the Glorot uniform initializer, which is also called Xavier uniform initializer [Glorot and Bengio, 2010] . The CNN layer weights are constrained to be non-negative and non-linear mapping is based on ReLU activation function [Nair and Hinton, 2010] . This allows us to learn non-negative feature maps. A maxpooling operation is applied after each convolution. The final CNN layer is flattened and followed by a LSTM layer with 400 units. This LSTM layer output is then fed to a fully connected layer of size 6561 that has a sigmoid activation function. Note that the dense layer outputs the concentration at the next time-step and its weights are also constrained to be non-negative. The output of the dense layer is then reshaped to a 2D concentration image of product C. The entire model is compiled with an Adam optimizer [Kingma and Ba, 2014] with loss being the mean squared error. The resulting model has a total of 3,298,785 trainable parameters. Training our model is accomplished by backpropagation. The weights in the CNN-LSTM model are updated iteratively on batches of data from the training set. The batch size is 6 and the training is terminated after 100 epochs. The proposed model is updated after seeing each batch by using gradient descent and backpropagation. The weights are adjusted to minimize the mean squared error of the proposed deep learning model (i.e., through the gradient of the error with respect to the weights). Note that we performed sensitivity analysis with respect to model parameters like LSTM units, batch size and epochs to finalize these hyperparameters used to generate results presented in this paper. In this section, we present numerical results to evaluate accuracy and efficiency of the proposed framework. We will first describe an initial boundary value problem that is used to generate data; this problem provides insights into reactive-mixing in subsurface under weakly chaotic flow fields (e.g., vortex-based structures). We then summarize the input parameters used to generate data and train the CNN-LSTM models. Next, we will describe in-detail on the accuracy of the proposed DL-based framework. Ground truth and CNN-LSTM model predictions are compared to show the predictive capability of these models. Finally, we will discuss the computational cost associated with training and testing the deep learning models. 4.1. Reaction tank problem. Figure 3 provides a pictorial description of the initial boundary value problem. The computational domain is a square with side length L = 1. Zero flux boundary conditions are enforced on the sides of the domain. The non-reactive volumetric source f i (x, t) is equal to zero for all the chemical species. Initially species A and species B are segregated such that species A is placed in the left half of the domain while species B is placed in the right half. The stoichiometric coefficients are selected as n A = n B = n C = 1. The total time of interest is selected as I = 1. The dispersion tensor is chosen from the subsurface literature [Pinder and Celia, 2006 ] and is given as follows: where D m is the molecular diffusivity, α L is the longitudinal diffusivity, α T is the transverse diffusivity, I is the identity tensor, ⊗ is the tensor product, v is the velocity vector field, and • is the Frobenius norm. As discussed in Sec. 2.1, we have neglected advection. A model velocity field is used to define the anisotropic dispersion tensor through the following stream function [Adrover et al., 2002; Mudunuru and Nakshatrala, 2016; Tsang, 2009] : where ν = 0, 1, 2, · · · is an integer. κ f L and T are characteristic scales of the flow field. Using Eq. (4.2), the divergence-free velocity field components are given as follows: A non-negative finite element method described in Sec. A is used to generate high-resolution data for training and testing the DL-based framework. Low-order triangular finite element mesh of size 81 × 81 with 6,561 degrees-of-freedom is employed for discretizing the domain. Backward Euler time-stepping scheme with ∆t = 0.001 is used to advance the simulation time from t = 0.0 to t = 1.0. Note that in our previous works Nakshatrala, 2016, 2017; Nakshatrala et al., 2013] , we have performed h-convergence and time-step-refinement studies to solve a similar system of equations described in Sec. 2. The outcome of these studies showed that 81 × 81 mesh with ∆t = 0.001 was sufficient to achieve accurate numerical solutions to capture fine-scale spatial-temporal variations caused by anisotropic dispersion. To demonstrate the applicability of the proposed DL-based framework, we have used a total of four high-resolution numerical simulations to train and test the non-negative CNN-LSTM models. Each simulation has 1000 images of c C , whose size is 81 × 81. A detailed description of the simulation data (a total of 2315 realizations) generated for machine learning analysis is described in References [Ahmmed et al., 2020; Mudunuru and Karra, 2021; Vesselinov et al., 2019] . Herein, we use a subset of this massive dataset for testing our proposed framework. The reaction-diffusion model input parameters associated with this subset are as follows: κ f L = [2, 3, 4, 5] , v 0 = 10 −1 , α L α T = 10 4 , D m = 10 −3 , and T = 10 −4 . That is, κ f L is varied and other parameters (e.g., v 0 , α L α T , D m , T ) are kept constant. The resulting data is shown in Figure 4 (a), Figure 6 (a), Figure 8 (a), Figure 10 (a). The above selected parameters represent reactive-mixing under high-anisotropy. α L α T = 10 4 corresponds to a scenario where longitudinal dispersion (α L along the streamlines) is 10 4 times higher than transverse dispersion (α T across streamlines). Higher values of κ f L enhance mixing and lower values results in regions/islands where species rarely mix. v 0 = 10 −1 results in very small perturbations in the flow field, thereby, preserving the underlying symmetry in the mixing patterns. D m = 10 −3 corresponds to reasonably high molecular diffusion. T = 10 −4 corresponds to highly-oscillating flow field sampled at a frequency of 10 kHz. The data generated based on the above parameters represents preferential and incomplete reactive-mixing. The system always starts in the unmixed state. But depending on the choice of the model parameters, it can result in either preferential mixing or incomplete mixing states. Incomplete mixing corresponds to a parameter scenario where α L α T = 10 4 with large-scale vortex structures (e.g., k f L ≤ 3; see Figure 4 (a), Figure 6(a) ). This results in regions/islands (e.g., specifically near the center of the vortex) where product C concentration is zero. Preferential mixing represents a situation when α L α T = 10 4 with small-scale vortex structures (e.g., k f L ≥ 4, Figure 8 (a), Figure 10(a) ). In this case, the system is not uniformly mixed due to anisotropy and small-scale features in flow field facilitate mixing along a certain direction. As a result, training and testing our CNN-LSTM models on this data allows us to see how well they predict different spatial-temporal patterns in mixing under high-anisotropy at late times. The reason to choose large anisotropic contrast data is because of its rich information content on the spatial-temporal evolution of reactive-mixing patterns (e.g., incomplete mixing, preferential mixing, interfacial mixing). Such patterns are not formed under lower anisotropic contrast. This is because the system tends to move towards uniform or well-mixed state (e.g., when α L α T ≤ 100) even in the early times [Ahmmed et al., 2020; Mudunuru and Karra, 2021] . 4.3. Accuracy of mixing patterns under the proposed DL-based framework. In this subsection, we present detailed analysis and associated accuracy of the trained models. Prediction accuracy and its metrics are focused on how well the proposed CNN-LSTM models forecast mixing patterns under the combined effects of high anisotropy and different characteristic scales of vortex structures. Figure 4 compares the ground truth and deep learning model predictions at t = 1.0. The ground truth is based on reactive-transport simulation data generated for κ f L = 2. This value of κ f L correspond to reactive-mixing under large-scale vortex structures. The proposed scenario also corresponds to incomplete mixing that occurs due to the combined effects of high anisotropy contrast along with large-scale features in velocity field. Specifically, we can find a total of eight regions/islands in the entire domain (e.g., vortex centers), where progress of mixing and formation of product C is very slow. At the center of these islands, the value of c C (x, y) is equal to zero. We analyze whether our trained CNN-LSTM models can capture such incomplete mixing patterns that occur due to large-scale features in the velocity field. As mentioned in the prior section, FEM simulation data was obtained from from t = 0.0 to t = 1.0 for a time increment of 0.001. Figure 4(a) shows the ground truth at the end of simulation time (at t = 1.0). Figures 4(b)-(m) show the predictions at t = 1.0 from the CNN-LSTM models trained using incrementally larger amount of training data as described in the individual captions. To elaborate, the forecast at t = 1.0 in Figure 4 (b) is obtained using the CNN-LSTM model trained using the first 8% (from t = 0.001 to t = 0.08) of the ground truth data. The trained model then predicts c C (x, y) for the rest of the simulation time (i.e., t = 0.09, 0.10, · · · , 1.00). The CNN-LSTM model makes this forecast based on the procedure summarized in Figure 2 . It can be observed in Figure 4 that the forecasts are generally better for higher amounts of training data. A very accurate prediction is obtained when we use 88% of the ground truth data. For this scenario, based on Figure 5 (k), the maximum possible prediction error is less than 4% in the entire domain. Moreover, the maximum possible prediction error to accurately capture interfacial mixing is less than 2%. However, qualitatively, we can see that non-negative CNN-LSTM model trained using ground truth data as low as 32% is able to effectively capture the underlying mixing pattern (e.g., interfacial mixing). Figure 5 compares the prediction errors in the entire domain at t = 1.0. The prediction error is in percentage, which is calculated by taking the difference between the ground truth and the non-negative CNN-LSTM model prediction at every point in the domain. Positive value of the error shows that the CNN-LSTM model under-predicts the true concentration while negative error shows over prediction. From Figure 5 , it is clear that there is a coherent spatial structure in the error. It also shows where CNN-LSTM model fails to make an accurate prediction. That is, qualitatively and quantitatively, the error sheds light on the difficulties faced by the CNN-LSTM model on predicting mixing patterns with different amounts of training data. For instance, with 8% training data, we can see that the maximum possible value for under predictions and over predictions is approximately 40% and 16%, respectively. In this scenario, high-values of over prediction (> 30%) by the CNN-LSTM model are seen in the regions closer to the boundary of the domain (e.g., due to boundary effects). Similarly, high-values of under prediction (> 10%) are seen near the mixing interface (e.g., due to local-scale features). This is where the barrier is removed and species are allowed to interact and facilitate the progress of chemical reaction. From Figures 5(a)-(l), it is evident that to capture the interfacial mixing (e.g., due to high anisotropy) within an error of 10% (e.g., measured using the infinity norm), 32% of ground truth data is sufficient. To capture global-scale mixing features (e.g., resulting from molecular diffusion) within an error of 10%, 64% of ground truth data is needed. With this amount of training data, the prediction error associated with capturing local-scale features near the interface is reduced by more than 50%. Specifically, with 64% ground truth data, the prediction error to capture the interfacial mixing is less than 4%. This level of enhanced accuracy (e.g., using 32%, 64%, 88% data) instills confidence in the proposed CNN-LSTM model architecture to make reasonably good forecasts in both early and late times of reactive-mixing. With minimal data (between 32% to 64%), we can capture various patterns and associated multi-scale features in incomplete mixing arising due to large-scale vortex structures. Figure 6 shows the ground truth and the trained CNN-LSTM model predictions for κ f L = 3. Similar to κ f L = 2 scenario, we have large-scale vortex structures (≈ 18 vortices in the domain). However, the spatial-scale of these vortices for κ f L = 3 are smaller compared to κ f L = 2. As a result, the extent of incomplete mixing is reduced when compared to Figure 4(a) . This is because of the reduced size in the vortices, which improves the interaction between reactants. As a result, we have a better mixing scenario compared to κ f L = 2 even under high anisotropy. Figure 6 (a) has six regions/islands where progress of mixing and formation of product C is very slow. These regions are located primarily near the boundary and at the corners of the domain (e.g., (x, y) ≈ (0.1, 0.25), (0.1, 0.6), (0.25, 0.1), (0.75, 0.9), (0.9, 0.75), (0.9, 0.4)). Compared to Figure 4(a) , we have a lesser number of regions/islands where c C (x, y) = 0.0. Additionally, the interfacial mixing in Figure 6 (a) is stronger than Figure 4(a) . This is because of larger number of vortices in the domain, which produce enhanced mixing. Figure 7 compares the prediction errors at the end of simulation time. Similar to κ f L = 2, it is evident that the model forecasts are better for higher amounts of training data. The best and very accurate forecast is obtained when we use 80% of the ground truth data (see Figure 6 (k)). Unfortunately, 88% training data model perform poorly due to model over-fitting. For 80% training data, the maximum possible prediction error is less than 8% in the entire domain. This accuracy in prediction is greater than that of κ f L = 2 scenario (e.g., see Figure 5 (j)). The maximum possible prediction error to accurately capture interfacial mixing is less than 3% (double that of κ f L = 2). In a nutshell, qualitatively and quantitatively, we can see that non-negative CNN-LSTM model trained using ground truth data, which is greater than or equal to 72% is able to effectively capture various patterns for enhanced mixing. This includes both interfacial mixing and mixing near the vortices. Figure 8 shows the mixing patterns under small-scale vortex structures (≈ 32 vortices in the domain). In this scenario, we observe preferential mixing aligned approximately at an angle of 15 o anti-clockwise to vertical barrier. This preferential direction consists of four small-scale vortices, which allow the reactant to mix faster across the interface when compared to κ f L = 2 and 3. As a result, concentration of product C is non-zero in the entire domain. However, we can see the formation of the product is not homogeneous in the entire domain despite the small-scale features in the velocity field. This spatial inhomogeneity of the product formation is because of high contrast in anisotropy, which results in preferential mixing patterns. Qualitatively and quantitatively, the predictions shown in Figures 8-9 shed light on how the CNN-LSTM model is able to capture different mixing patterns. For example, 72-80% data is reasonably sufficient to capture product formation with 10% error in the entire domain. For this level of input data for training and prediction, we can accurately capture the interfacial mixing within 3-4%. Figure 10 compares the ground truth and predictions for κ f L = 5. This value of κ f L corresponds to preferential mixing under fine-scale features in the velocity field (≈ 50 vortices in the domain). The angle of preferential mixing is approximately 20-23 o anti-clockwise to vertical barrier. In this direction, the maximum possible plume width (e.g., the region in the domain that has c C (x, y) ≥ 0.45) is approximately 0.2-0.25 times the length of the domain. Based on Figures 10 (i)-(k), the CNN-LSTM model prediction based on 64-80% ground truth data is able to capture most of the characteristics of this plume (e.g., plume width, angle of preferential mixing). Figures 11(h)-(j) show that the associated prediction errors (e.g., measured in infinity norm) are less than 5%. This accuracy instill confidence in the predictive capability of our models to forecast mixing patterns at late times under the combined effect of high anisotropy and fine-scale features in the velocity field. The computational cost to run a single high-fidelity numerical simulation is approximately 26 mins (1560 seconds) on a single core and 18 mins (1080 seconds) on a eight core processor (Intel(R)Xeon(R) CPU E5-2695 v4 2.10GHz) [Mudunuru and Karra, 2021] . The simulation dataset was developed using high-performance computing (HPC) resources [Ahmmed et al., 2020; Mudunuru and Karra, 2021; Vesselinov et al., 2019] . However, access to leadership class supercomputing resources (e.g., NERSC [NER, 2021] , ALCF [ALC, 2021] , OLCF [OLC, 2021] ) to researchers is generally limited. Additionally, even after having access to such HPC systems, there are various challenges and bottlenecks to generate data for longer-periods of simulation times. This includes availability of compute time to users (e.g., nodes, machine hours), waiting time in job submission queue, and cap on the maximum amount of wall clock time to run on a supercomputer. For example, at NERSC user facility, one has to restart simulations every two days as the cap to run on these machines is 48 hours in a regular queue. Due to this bottleneck, simulations need to be restarted a lot of times to generate data at future time-steps. As a result, the overall wall clock time to generate simulation data for multiple realizations can be in the order of weeks to months. Our DL-based framework provides an attractive way to overcome these challenges. Specifically, this computational cost can be avoided by training the proposed DL models on partially generated data to forecast spatial-temporal variables at future times. We have trained the proposed CNN-LSTM models on a MacBook Pro Laptop (2GHz 2-Core Intel i7 CPU, 8GB DDR4 RAM). The associated computational cost to train the CNN-LSTM models and make a prediction is shown in Table 1 . From this table, it is evident that model training is computationally expensive as the wall clock times are in the O(10 3 ) to O(10 4 ) seconds. This is expected as CPUs alone (with low-end memory) are really slow for training deep learning models. However, once the CNN-LSTM model is developed, the wall clock time to make an inference (i.e., predicting the spatial concentration of product C at the next time-step) is very low. That is, the inference time on this laptop is in the order of O(10 −1 ) seconds. This shows that our proposed CNN-LSTM model is fast in addition to its predictive capability. Our future work involves accelerating the training process using Google Cloud Platform's GPUs and TPUs [Bisong, 2019] . Table 1 . Training wall clock time (TWCT) and prediction wall clock time (PWCT) for the proposed nonnegative CNN-LSTM models. The reported TWCT and PWCT are in seconds. TWCT is the amount of time taken to ingest the simulation data and perform the training process. PWCT is the average amount of time taken to make a spatial prediction of the concentration field per time-step for all the future times. As mentioned in Sec. 3.2, CNN-LSTM model are trained using 100 epochs with a batch size of 6. TWCT Avg. PWCT data Finally, we note that it is straightforward to perform similar analysis for other realizations (e.g., for different parameters in reaction-diffusion model). This can be achieved by re-training the proposed non-negative CNN-LSTM models with minimal data on other input parameters. The re-training process can be minimal as the weights of the convolutional layers that create the feature maps are not updated. This is because these CNN layers learned the underlying representation of the reactive-transport process through the proposed training approach. Specifically, minimal re-training can be performed through transfer learning techniques, where weights of the dense layer connected to the output are only updated. One should expect similar outcomes as described in this study. Below we summarize our results • Our model's prediction accuracy generally increases with the amount of training data. • However, in certain training scenarios, the prediction accuracy is reduced with larger training data. This poor performance can be attributed to model over-fitting. • The concentration profile in the interfacial mixing zones is captured with reasonably good accuracy by our non-negative CNN-LSTM model trained using 32% of the simulation data. • The high concentration zones are accurately captured within 10% error (measured in infinity norm) by training our model using 64% of the simulation data. • The wall clock time to make prediction/inference is ≈ O(−5) to that of running a highfidelity simulation. Based on the accuracy of the results presented in Sec. 4.3-4.4, the proposed DL framework will be an ideal candidate for modeling virus transport. Several prior studies have used reactive-transport for pandemic modeling. Some concrete example include: predict the evolution of COVID-19 [Jha et al., 2020; Viguerie et al., 2020] , and capture the transport of microbial pathogens in the terrestrial subsurface (primarily at aquatic interfaces) [Bradford et al., 2013; Hansen et al.; Harvey et al., 2007] . In the said applications, accurate forecasts of reactivetransport are essential because fluctuating flow regimes (e.g., water table fluctuations, flow fields given by Eqns. (4.3)-(4.4)) can enhance virus detachment from solid-water interfaces and air-water interfaces [Zhang et al., 2012] . These flow fields can result in regions (e.g., interfacial mixing areas as shown in Figures 4, 6, 8, 10 ) that have high concentrations in drinking water resources, which the cited references did not consider. Our DL framework can be applied to solve such problems. The non-negative CNN-LSTM models would be able to accurately model the spatial-temporal evolution of an epidemic such as COVID-19 and virus transport in porous media. The DL-based model can be trained using the initial few days of epidemic evolution over a region of interest. The trained model can then be used to predict the future evolution of the epidemic. This would provide very significant information to the policymakers in terms of when and where the epidemic infections would be reaching the critical level. This information would also help in planning the allocation of resources and in taking early actions to suppress the spread of the disease (e.g., lockdown, travel ban). There is a need for fast, reliable, and physically meaningful forecasts of the concentrations of chemical species and the evolution of reactive-mixing for diverse engineering applications such as combustion and pandemic modeling. However, high-resolution numerical simulations mostly use the initial and boundary values and cannot utilize any intermediate temporal data to improve accuracy of the predictions. Secondly, high-resolution numerical simulations are often expensive, as each simulation may take several hours or days to provide spatial-temporal patterns of mixing, especially for problems with large spatial and temporal domains. To address the said need, we have presented a fast and predictive framework that can forecast the spatial-temporal evolution of reactive-transport using minimal data. Our approach leverages recent advances in deep learning techniques-nonnegative constrained CNNs for capturing spatial spreading and LSTMs for temporal evolution. The framework was used to develop a new non-negative model, which elegantly overcomes the issue of spurious negative concentration predictions by traditional numerical methods like FEM. The DL-based model developed herein needs to be trained using only a subset of the data for the total temporal domain of interest. The data can be obtained using high-fidelity simulations or experiments. Our results demonstrate that less than 40% of the simulation data is needed to capture the interfacial mixing (e.g., local-scale mixing features due to combined effects of anisotropy and vortex structures) within an error of 10% (as measured using the infinity norm). To capture global features at the same level of accuracy, less than 70-80% data is needed. The time needed for a forecast is a fraction (≈ O(−6)) of the time needed for training. Also, the ratio of the solution time for the proposed deep learning framework to that for the high-fidelity simulation used to generate data for training the model is ≈ O(−5). Thus, the accuracy of the predictions combined with the fast inference makes our non-negative CNN-LSTM models attractive for real-time forecasting of species concentration in reactive transport problems for a wide range of engineering applications. A plausible future work can be towards accelerating the training process of the proposed deep learning models using graphical processing units (GPUs) and tensor processing units (TPUs) [Bisong, 2019] . Distributed training (e.g., using Horovod [Sergeev and Del Balso, 2018] ) and scalable machine learning (e.g., using DeepHyper [Balaprakash et al., 2018] ) provide viable routes to achieve speedup by splitting the workload to train the non-negative CNN-LSTM model among multiple processors (i.e., worker nodes). For completeness, this section provides details on the non-negative single-field formulation used to generate data for training and testing the CNN-LSTM models. Let the total time interval [0, I] be discretized into N non-overlapping sub-intervals: where t 0 = 0 and t N = I. Assuming a uniform time step ∆t, we define: Following , we will first discretize with respect to time the governing equations for the invariants; we will use the backward Euler time-stepping scheme. The resulting time-discrete equations corresponding to Eqs. (2.6a)-(2.6d) are: Note that one can extend the proposed framework to other time-stepping schemes by following the non-negative procedures given in . The standard finite element Galerkin formulation for Eqs. (A.3a)-(A.3d) is given as follows: Find c where the bilinear form and linear functional are, respectively, defined as A similar formulation can be written for the invariant c G . The Galerkin formulation given by Eq. (A.5) can be re-written as a minimization problem as follows: However, it has been shown that the Galerkin formulation leads to negative concentrations for product C [Nakshatrala et al., 2013] , which is unphysical. Instead, we will solve the minimization problem Eq. (A.8) with the constraint that c F is non-negative; that is, A similar minimization problem with non-negative constraint for c G can be written as follows: Upon discretizing Eqs. (A.8)-(A.10a) using low-order finite elements, one arrives at: where •; • represents the standard inner-product on Euclidean spaces, and "ndof s" denotes the (nodal) degrees-of-freedom. c (n+1) F and c (n+1) G are the nodal concentration vectors for the invariant c F and c G at time level t n+1 , respectively. The coefficient matrix K is positive definite, and hence a unique global minimizer exists [Nagarajan and Nakshatrala, 2011] . The datasets used in this study were developed as a part previously published study [Ahmmed et al., 2020; Mudunuru and Karra, 2021; Vesselinov et al., 2019] ; these works are cited and acknowledged in this paper. Y. Xiaofeng, Y. Gui, G. Xiao, Y. Du, L. Liu, and D. Wei Inputs at time-steps = k (k>n); for e.g., n = 170, … For e.g., n = 171, … (e) RT predictions at future time-steps based on a trained non-negative CNN-LSTM model Image stack input is at N-th time-step: Figure 4 . Predictions for κ f L = 2: This figure shows the ground truth from the non-negative FEM and predictions from the trained non-negative CNN-LSTM models at t = 1.0. From this figure, it is evident that our deep learning model is able to effectively capture product C evolution with minimal training data (e.g., greater than or equal to 32%). it is evident that with less than 64% of training data, we can accurately capture different mixing patterns (e.g., interfacial mixing, mixing near vortices due to molecular diffusion). (c) Prediction: 16% data (d) Prediction: 24% data (e) Prediction: 32% data (f) Prediction: 40% data (g) Prediction: 48% data (h) Prediction: 56% data (i) Prediction: 64% data (j) Prediction: 72% data (k) Prediction: 80% data (l) Prediction: 88% data (m) Prediction: 96% data Figure 8 . Predictions for κ f L = 4: This figure compares the ground truth and predictions from the trained non-negative CNN-LSTM models at t = 1.0. From this figure, it is evident that 72% ground truth data is reasonably sufficient to qualitatively capture different mixing patterns (e.g., preferential mixing due to small-scale features in the velocity field). (c) Prediction: 16% data (d) Prediction: 24% data (e) Prediction: 32% data (f) Prediction: 40% data (g) Prediction: 48% data (h) Prediction: 56% data (i) Prediction: 64% data (j) Prediction: 72% data (k) Prediction: 80% data (l) Prediction: 88% data (m) Prediction: 96% data Figure 10 . Predictions for κ f L = 5: This figure compares the ground truth and predictions from the trained non-negative CNN-LSTM models at t = 1.0. From this figure, it is evident that 64% ground truth data is needed to reasonably capture product formation in the entire domain (e.g., preferential mixing patterns, mixing away from interface). Figure 11 . Prediction error in percentage for κ f L = 5: This figure compares the prediction errors in the entire domain at t = 1.0 for different amount of training data. Based on the error values (e.g., ≤ 10%), it is evident that with 80% training data, we can accurately capture product formation in the entire domain. To accurately predict preferential mixing patterns (e.g., with an accuracy less than 5%), 64% of the ground truth is sufficient. Leadership Computing Facility Ridge Leadership Computing Facility A spectral approach to reaction/diffusion kinetics in chaotic flows A comparative study of machine learning models for predicting the state of reactive mixing Modelling and simulation of reactive transport phenomena Computational methods for multiphase flow and reactive transport problems arising in subsurface contaminant remediation Deephyper: Asynchronous hyperparameter search for deep neural networks A spline theory of deep learning Building Machine Learning and Deep Learning Models on Google Cloud Platform Transport and fate of microbial pathogens in agricultural settings Large-scale optimization-based non-negative computational framework for diffusion equations: Parallel implementation and performance studies On stabilized finite element methods for linear systems of convection-diffusion-reaction equations Reactive transport in porous media for co2 sequestration: Pore scale modeling using the lattice boltzmann method Understanding the difficulty of training deep feedforward neural networks Deep Learning Terrestrial-aquatic linkages: Understanding the flow of energy and nutrients across ecosystem boundaries Transport of microorganisms in the terrestrial subsurface: In situ and laboratory methods Large eddy simulation of turbulent supersonic hydrogen flames with openfoam Numerical Solution of Time-Dependent Advection-Diffusion-Reaction Equations Bayesian-based predictions of COVID-19 evolution in Texas using multispecies mixture-theoretic continuum models Adam: A method for stochastic optimization Physics-informed machine learning models for predicting the progress of reactive mixing On enforcing maximum principles and achieving elementwise species balance for advection-diffusion-reaction equations under the finite element method On mesh restrictions to satisfy comparison principles, maximum principles, and the non-negative constraint: Recent developments and new results Enforcing the non-negativity constraint and maximum principles for diffusion with decay on general computational grids Rectified linear units improve restricted Boltzmann machines A numerical framework for diffusioncontrolled bimolecular-reactive systems to enforce maximum principles and the non-negative constraint A numerical methodology for enforcing maximum principles and the non-negative constraint for transient diffusion equations Chemical Kinetics in Combustion and Reactive Flows: Modeling Tools and Applications An exactly solvable model of population dynamics with density-dependent migrations and the allee effect Numerical analysis on interactions of vortex, shock wave, and exothermal reaction in a supersonic planar shear layer laden with droplets Combustion in Supersonic Flows and Scramjet Combustion Simulation Deep learning in neural networks: An overview Horovod: Fast and easy distributed deep learning in Tensorflow An exponential integrator for advection-dominated reactive transport in heterogeneous porous media Predicting the evolution of fast chemical reactions in chaotic flows Unsupervised machine learning based on non-negative tensor factorization for analyzing reactive-mixing Diffusion-reaction compartmental models formulated in a continuum mechanics framework: application to COVID-19, mathematical analysis, and numerical study Reactive Transport Modeling MKM is supported by ExaSheds project during the paper writing process, which was supported by the U.S. Department of Energy, Office of Science, under Award Number DE-AC02-05CH11231. KBN acknowledges the support from the University of Houston through High Priority Area Research Seed Grant.