key: cord-0188898-7f3vjoho authors: Ziyin, Liu; Hartwig, Tilman; Ueda, Masahito title: Neural Networks Fail to Learn Periodic Functions and How to Fix It date: 2020-06-15 journal: nan DOI: nan sha: 784752120d5093baa592eaf85fde021df0a5db3b doc_id: 188898 cord_uid: 7f3vjoho Previous literature offers limited clues on how to learn a periodic function using modern neural networks. We start with a study of the extrapolation properties of neural networks; we prove and demonstrate experimentally that the standard activations functions, such as ReLU, tanh, sigmoid, along with their variants, all fail to learn to extrapolate simple periodic functions. We hypothesize that this is due to their lack of a"periodic"inductive bias. As a fix of this problem, we propose a new activation, namely, $x + sin^2(x)$, which achieves the desired periodic inductive bias to learn a periodic function while maintaining a favorable optimization property of the ReLU-based activations. Experimentally, we apply the proposed method to temperature and financial data prediction. In general, periodic Functions are one of the most basic functions of importance to human society and natural science: the world's daily and yearly cycles are dictated by periodic motions in the Solar System [23] ; the human body has an intrinsic biological clock that is periodic in nature [18, 31] , the number of passengers on the metro follows daily and weekly modulations, and the stock market also experiences (semi-)periodic fluctuations [25, 39] . Global economy also follows complicated and superimposed cycles of different periods, including but not limited to Kitchin cycle, Juglar cycle [9, 20] etc.. In many scientific scenarios, we want to model a periodic system in order to be able to predict the future evolution, based on current and past observations. While deep neural networks are excellent tools in interpolating between existing data, their fiducial version is not suited to extrapolate beyond the training range, especially not for periodic functions. If we know beforehand that the problem is periodic, we can easily solve it, e.g., in Fourier space, or after an appropriate transformation. However, in many situations we do not know a priori if the problem is periodic or contains a periodic component. In such cases it is important to have a model that is flexible enough to model both, periodic and non-periodic functions, in order to overcome the bias of choosing a certain modelling approach. In fact, despite the importance of being able to model periodic functions, no satisfactory neural network-based method seems to solve this problem. Some previous methods that propose to use periodic activation functions exist [34, 41, 27] . This line of works propose using standard periodic functions such as sin(x) and cos(x) or their linear combinations as activation functions. However, such activation functions are very hard to optimize due to large degeneracy in local minima [27] , and the experimental results suggest that using sin as the activation function does not work well except for some very simple model, and that it can not compete against ReLU-based activation functions [30, 6, 22, 38] on standard tasks. (first column), y = tanh(x) (second column), y = sin(x) (third column), and y = x 2 (last column). The red curves represents the median model prediction and the shaded regions show the 90% credibility interval from 21 independent runs. Note that the horizontal range is re-scaled so that the training data lies between −1 and 1. The contribution of this work is threefold: (1) we study the extrapolation properties of a neural network beyond a bounded region; (2) we show that standard neural networks with standard activation functions are insufficient to learn periodic functions outside the bounded region where data points are present; (3) we propose a handy solution for this problem and it is shown to work well on toy examples and real tasks. However, we think that our proposed method is not perfect, and the question remains open as to whether better activation functions or methods can be designed . A key property of periodic functions that makes them differ from regular functions is the extrapolation property of such functions. With a period 2π, a period function f (x) = f (x + 2π) repeats itself ad infinitum. Learning a periodic function, therefore, not only requires fitting of pattern on a bounded region, but the learned pattern needs to extrapolate beyond the bounded region. In this section, we experiment with the inductive bias that the common activation functions offer. While it is hard to investigate the effect of using different activation functions in a general setting, one can still hypothesize that the properties of the activation functions are carried over to the property of the neural networks. For example, a tanh network will be smooth and extrapolates to a constant function, while ReLU is piecewise-linear and extrapolates in a linear way. We set up a small experiment in the following way: we use a fully connected neural network with one hidden layer consisting of 512 neurons. We generate training data by sampling from four different analytical functions in the interval [-5,5] with a gap in the range [-1,1] . This allows us to study the inter-and-extrapolation behaviour of various activation functions. The results can be seen in Fig. 1 . We see that their extrapolation behaviour is dictated by the analytical form of the activation function: ReLU diverges to ±∞, and Tanh levels off towards a constant value. In this section, we study and prove the incapability of standard activation fucntions to extrapolate. where σ is the activation function applied element-wise to its input vector, and W i ∈ R di×di+1 . f σ (x) is called a feedforward neural network with activation function σ, and d 1 is called the input dimension, and d h+1 is the output dimension. Now, one can show that for arbitrary feedforward neural networks the following two extrapolation theorems hold. While the theorem specifically refers to ReLU and tanh for examples, they only require the asymptotic properties of the activation functions, and so we expect any activation function with similar asymptotic behavior to obey the respective extrapolation theorems. Theorem 1. Consider a feed forward network f ReLU (x), with arbitrary but fixed depth h and widths d 1 , ..., d h+1 . Then lim where z is a real scalar, and u is any unit vector of dimension d 1 . The above theorem says that any feedforward neural network with ReLU activation converges to a linear transformation W u in the asymtotic limit, and this extrapolated linear transformation only depends on u, the direction of extrapolation. See Figure 1 for illustration. Next, we prove a similar theorem for tanh activation. Naturally, a tanh network extrapolates like a constant function. Theorem 2. Consider a feed forward network f tanh (x), with arbitrarily fixed depth h and widths d 1 , ..., d h+1 . Then lim where z is a real scalar, and u is any unit vector of dimension d 1 , and v u ∈ R d h+1 is a constant vector that only depends on u. We note that these two theorems can be proved through induction, and we give their proofs in the appendix. The above two theorems show that any neural network with ReLU or the tanh activation function cannot extrapolate a periodic function. 3 Proposed Method: x + sin 2 (x) The proposed activation is shown as a blue dashed curve. We see that Snake is easier to optimize than other periodic baselines. Also interesting is that Snake (and x + sin(x)) are also easier to train than the standard ReLU on this task. As we have seen in the previous section, the choice of the activation functions plays a crucial role in affecting the interpolation and extrapolation properties of neural networks, and such interpolation and extrapolation properties in return affect the generalization of the network equipped with such activation function. To easily address the proposed function, we propose to use x + sin 2 (x) as an activation function, which we call the "Snake" function. One can augment it with a factor a to control the frequency of the periodic part. Thus propose the Snake activation with frequency a Snake a ∶= x + 1 a We plot Snake for a = 0.2, 1, 5 in Figure 2 . We see that larger a gives higher frequency. There are also two conceivable alternatives choices for a periodicitybiased activation function. One is the sin function, which has been proposed in [27] , along with cos and their linear combinations as proposed in Fourier neural networks [41] . However, the problem of these functions does not lie in its generalization ability, but lies in its optimization. In fact, sin is not a monotonic function, and using sin as the activation function creates infinitely many local minima in the solutions (since shifting the preactivation value by 2π gives the same function), making sin very hard to optimize. See Figure 3 for a comparison on training a 4-layer fully connected neural network on MNIST. We identify the root cause of the problem in sin as its non-monotonicity. Since the gradient of model parameters is only a local quantity, it cannot detect the global periodicity of the sin function. Therefore, the difficulty in biasing activation function towards periodicity is that it needs to achieve monotonicity and periodicity at the same time. We also propose two other alternatives, x + sin(x) and x + cos(x). They are easier to optimize than sin similar to the commonly used ReLU. In the neural architecure search in [30] , these two functions are found to be in list of the best-performing activation functions found using reinforcement learning; while they commented that these two are interesting, no further discussion was given regarding their significance. While these two and x + sin 2 (x) have the same expressive power, we conjecture Table 1 : Comparision of different periodic and non-periodic activation functions. that x + sin 2 (x) is technically better for the following reason. First, it is important to note that the preactivation values are centered around 0 and the standard initialization schemes such as Kaiming init normalizes such preactivation values to the unit variance [36, 14] . By the law of large numbers, the preactivation roughly obeys a standard normal distribution. This makes 0 a special point for the activation function, since most of pre-activation values will lie close to 0. However, x + sin(x) seems to be a choice inferior to x + sin 2 (x) around 0. Expanding around 0: Of particular interest to us is the non-linear term in the activation, since this is the term that drives the neural network away from its linear counterpart, and learning of a non-linear network is explained by this term to leading order. One finds that the first non-linear order expansion for x + sin(x) is already third order, while that of x + sin 2 (x) is contains a non-vanishing second order term, which can probe non-linear behavior that is odd in x. We hypothesize that this non-vanishing second order term gives Snake a better approximation property than x + sin(x). In Table 1 , we compare the properties of each activation function Extension: We also suggest the extention to make the a parameter becomes a learnable parameter for each preactivation value. The benefit for this is that a no-longer needs to be determined by hand. While we do not study this extension in detail, one experiment is carried out with learnable a, see the atmospheric temperature prediction experiment in Section 6.2. In this section, we regress a simple 1−d periodic function, sin(x), with the proposed activation function, in comparison with other standard activation functions. We use a two layer-neural network with 512 hidden neurons with the specified activation function as non-linearity, trained with stochastic gradient descent (SGD) for 15000 steps (where the training loss stops decreasing for all activation functions). The learning rate is set to 1e − 2 with 0.9 momentum. See Figure 4 . As expected, all three activation functions learn to regress the training points. However, neither ReLU nor tanh seems to be able to capture the periodic nature of the underlying function; both baselines inter-and extrapolate in a naive way, with tanh being slightly smoother than ReLU. On the other hand, Snake learns to both interpolate and extrapolate very well, even though the learned amplitude is a little different from the ground truth, it has grasped the correct frequency of the underlying periodic function, both for the interpolation regime and the extrapolation regime. This shows that the proposed method has the desired flexibility towards periodicity, and has the potential to model such problems. Snake can also be used in a recurrent neural network, and is also observed to improve upon ReLu and Tanh for predicting long term periodic time evolution. Due to space constraint, we discuss this in section A.2. In contrast to the well-known universal approximation theorems [17, 7, 11] that qualifies a neural network on a bounded region, we prove a theorem that we refer to as the universal extrapolation theorem, which focuses on the behavior of a neural network with Snake on an unbounded region. This theorem says that a Snake neural network with a sufficient width can approximate any well-behaving periodic function. Theorem 3. Let f (x) be a piecewise C 1 periodic function with period L. Then, a Snake neural network, f w N , with one hidden layer and with width N can converge to f (x) uniformly as N → ∞, i.e., there exists parameters w N for all N ∈ Z + such that for all x ∈ R, i.e., the convergence is point-wise. If f (x) is continuous, then the convergence is uniform. As a corollary, this theorem implies the classical approximation theorem [17, 28, 8] , which states that a neural network with certain non-linearity can approximate any continuous function on a bounded region. Corollary 1. Let f (x) be a two-layer neural network parameterized by two weight matrices W 1 and W 2 , and let w be the width of the network, then for any bounded and continuous function g( This shows that the proposed activation function is a more general method than the ones previously studied, and the practical usefulness of our method is demonstrated experimentally to be on par with standard tasks, and to outperform previous methods significantly on learning periodic functions. Notice that the above theorem not only applies to Snake but also to the basic periodic functions such as sin and cos and monotonic variants such x + sin(x), x + cos(x) etc.. As shown in [14] , different activation functions actually require different initilization schemes (in terms of the sampling variance) to make the output of each layer unit variance, thus avoiding divergence or vanishing of the forward signal. Let W ∈ R d1×d2 , whose input activations are h ∈ R d2 with a unit variance for each of its element, and the goal is to set the variance of each element in W such that Snake(W x) has a unit variance. To leading order, Snake looks like an identity function, and so one can make this approximation in finding the required variance: which is a factor of √ 2 smaller in range than the Kaiming uniform initialization. We notice that this initialization is often sufficient. However, when higher order correction is necessary, we provide the following exact solution, which is a function of a in general. , which is maximized at a max ≈ 0.56045. The second term can be thought of as the "response" to the non-linear term sin 2 (x). Therefore, one should also correct an additional bias induced by the sin 2 (x) term by dividing the post-activation value by σ a . Since the positive effect of this correction is the most pronounced when the network is deep, we compare the difference between having such a correction and having no correction on ResNet-101 on CIFAR-10. The results are presented in the appendix Section A.6.1. We note that using the correction leads to better training speed and better converged accuracy. We find that for standard tasks such as image classification, setting 0.2 ≤ a ≤ a max to work very well. We thus set the default value of a to be 0.5. However, for tasks with expected periodicity, larger a, usually from 5 to 50 tend to work well. In this section, we demonstrate that the the wide applicability of Snake is not limited to learning periodic functions. We start with a standard image classification task, where Snake is shown to perform competitively against the popular activation functions, showing that Snake can be used as a general activation function. We then focus on the tasks where we expect Snake to be very useful, including temperature and financial data prediction. Experiment Description. We train ResNet-18 [15] , with roughly 10M parameters, on the standard CIFAR-10 dataset. We simply replace the activation functions in ReLU with the specified ones for comparison. CIFAR-10 is a 10-class image classification task of 32 × 32 pixel images; it is a standard dataset for measuring progress in modern computer vision methods 1 . We use LaProp [42] with the given default hyperparameters as the optimizer. We set learning rates to be 4e − 4 for the first 50 epochs, and 4e − 5 for the last 50 epochs. The standard data augmentation technique such as random crop and flip are applied. We note that our implementation reproduces the standard performance of ResNet18 on CIFAR-10, around 92 − 93% testing accuracy. This experiment is designed to test whether Snake is suitable for standard and large-scale tasks one encounters in machine learning. We also compare this result against other standard or recently proposed activation functions including tanh, ReLU, Leaky−ReLU [38] , Swish [30] , and sin [27] . Result and Discussion. See Figure 5 . We see that sin shows similar performance to tanh, agreeing with what was found in [27] , while Snake shows comparative performance to ReLU and Leaky−ReLU both in learning speed and final performance. This hints at the generality of the proposed method, and may be used as a replacement for ReLU in a straightforward way. We also test against other baselines on ResNet-101, which has 4 times more parameters than ResNet-18, to check if Snake can scale up to even larger and deeper networks, and we find that, consistently, Snake achieves similar performance (94.1% accuracy) to the most competitive baselines. For illustration, we first show two real-life applications of our method to predicting the atmospheric temperature of a local island, and human body temperature. These can be very important for medical applications. Many diseases and epidemics are known to have strong correlation with atmospheric temperature, such as SARS [5] and the current COVID-19 crisis ongoing in the world [40, 32, 24] . Therefore, being able to model temperature accurately could be important for policy making. Atmospheric temperature prediction. We start with testing a feedforward neural network with two hidden layers (both with 100 neurons) to regress the temperature evolution in Minamitorishima, an island south of Tokyo (longitude: 153.98, latitude: 24.28). The data represents the average weekly temperature after April 2008 2 and the results are shown in Fig. 6 . We see the tanh and ReLU based models fail to optimize this task, and do not make meaningful extrapolation. On the other hand, the Snake-based model succeeds in optimizing the task and makes meaningful extrapolation with correct period. Also see Figure 7 . We see that Snake achieves vanish- ing training loss and generalization loss, while the baseline methods all fail to optimize to 0 training loss, and the generalization loss is also not satisfactory. Human body temperature. Modeling the human body temperature may also be of huge importance; for example, fever is known as one of the most important symptom signifying a contagious condition, including COVID19 [13, 35] . Experiment Description. We use a feedforward neural netowrk with 2 hidden layers, (both with 64 neurons) to regress the human body temperature. The data is measured irregularly from an anonymous participant over a 10-days period in April, 2020, of 25 measurements in total. While this experiment is also rudimentary in nature, it reflects a great deal of obstacles the community faces, such as very limited (only 25 points for training) and insufficient measurement taken over irregular intervals, when applying deep learning to real problems such as medical or physiological prediction [16, 29] . In particular, we have a dataset where data points from certain period in a day is missing, for example, from 12am to 8am, when the participant is physically at rest (See Figure 8b) , and for those data points we have, the intervals between two contiguous measurements are irregular with 8 hours being the average interval, yet this is often the case for medical data where exact control over variables is hard to realize. The goal of this task is to predict the body temperature at at every hour. The model is trained with SGD with learning rate 1e − 2 for 1000 steps, 1e − 3 for another 1000 steps, and 5e − 4 for another 1000 steps. Results and Discussion. The performances of using ReLU, tanh , sin, sin + cos and Snake are shown in Figure 8 . We do not have a testing set for this task, since it is quite unlikely that a model will predict correctly for this problem due to large fluctuations in human body temperature, and we compare the results qualitatively. In fact, we know some basic knowledge about body temperature. For example, (1) it should fall within a reasonable range from 35.5 to 37.5 Celsius degree [12] , and, in fact, this is the range where all of the training points lie; (2) at a finer scale, the body temperature follows a periodic behavior, with highest in the afternoon (with a peak at around 4pm), and lowest in the midnight (around 4am) [12] . At a bare minimum, a model needs to obey (1), and a reasonably well-trained model should also discover (2). However, tanh or ReLU fail to limit the temperature to the range 35.5 and 37.5 degree. Both baselines extrapolate to above 39 degree at 20 days beyond the training set. In contrast, learning with Snake as the activation function learned to obey the first rule. See Figure 8 .a. To test whether the model has also grasped the periodic behavior specified in (2), we plot the average hourly temperature predicted by the model over a 30 days period. See Figure 8b . We see that the model does capture the periodic oscillation as desired, with peak around 16pm and minimum around 4am. The successful identification of 4am is extremely important, because this is in the range where no data point is present, yet the model inferred correctly the periodic behavior of the problem, showing Snake really captures the correct inductive bias for this problem. Problem Setting. The global economy is another area where quasi-periodic behaviors might happen [19] . At microscopic level, the economy oscillates in a complex, unpredictable manner; at macro- Figure 9 : Prediction of Wilshire 5000 index, an indicator of the US and global economy. scopic level, the global economy follows a 8 − 10 year cycle that transitions between periods of growth and recession [4, 33] . In this section, we compare different models to predict the total US market caplitalization, as measured by the Wilshire 5000 Total Market Full Cap Index 3 (We also did the same experiment on the well-known Buffet indicator, which is seen as strong indicator for predicting national economic trend [21] ; we also see similar results). For training, we take the dayly data from 1995-1-1 to 2020-1-31, around 6300 points in total, the ending time is deliberately chosen such that it is before the COVID19 starts to affect the global economy [3, 10] . We use the data from 2020 − 2 − 1 to 2020 − 5 − 31 as the test set. Noticeably, the test set differs from training set in two ways (1) a market crush called black Thursady happens (see Figure 9 ); (2) the general trend is recessive (market cap moving downward on average). It is interesting to see whether the bearish trend in this period is predictable without the affect of COVID19. For neural network based methods, we use a 4-layer feedforward network with 1 → 64 → 64 → 1 hidden neurons, with specified activation function, we note that no activation function except Snake could optimize to vanishing training loss. The error is calculated with 5 runs. Method MSE on Test Set ARIMA (2, 1, 1 Results and Discussion. See Table 2 , we see that the proposed method outperforms the competitors by a large margin in predicting the market value from 2020-2-1. Qualitatively, we focus on making comparison with ARIMA, a traditional and standard method in economics and stock price prediction [26, 2, 37] . See Figure 9 . We note that ARIMA predicts a growing economy, Snake predicts a recessive economy from 2020-2-1 onward. In fact, for all the methods in Table 2 , the proposed method is the only method that predicts a recession in and beyond the testing period, we hypothesize that this is because the proposed method is only method that learns to capture the long term economic cycles in the trend. Also, it is interesting that the model predicts a recession without predicting the violent market crash. This might suggest that the market crash is due to the influence of COVID19, while a simultaneous background recession also occurs, potentially due to global business cycle. For purely analysis purpose, we also forecast the prediction until 2023 in Figure 9 . Alarmingly, our method predicts a long-term global recession, starting from this May, for an on-average 1.5 year period, only ending around early 2022. This also suggests that COVID19 might not be the only or major cause of the current recessive economy. In this work, we have identified the extrapolation properties as a key ingredient for understanding the optimization and generalization of neural networks. Our study of the extrapolation properties of neural networks with standard activation functions suggest the lack of capability to learn a periodic function: due to the mismatched inductive bias, the optimization is hard and generalization beyond the range of observed data points fails. We think that this example suggests that the extrapolation properties of a learned neural networks should deserve much more attention than it currently receives. We then propose a new activation function to solve this periodicity problem, and its effectiveness is demonstrated through the "extrapolation theorem", and then tested on standard and real-life application experiments. We also hope that our current study will attract more attention to the study of modeling periodic functions using deep learning. It is interesting to study the behavior of the proposed method on different kinds of periodic functions (continuous, discontinuous, compound periodicity, etc..). See Figure 12 . We see that using different a seems to bias model towards different frequencies. Larger a encourages learning with larger frequency and vice versa. For more complicated periodic functions, see Figure 10 and 11. In this section, we try to fit a periodic dynamical system whose evolution is given by x(t) = cos(t 2) + 2 sin(t 3), and we use a simple recurrent neural network as the model, with the standard tanh activation replaced by the designated activation function. We use Adam as the optimizer. See Figure 13 . The region within the dashed vertical lines are the range of the training set. Figure 14 : Comparison between Snake, tanh, and ReLU as activation functions to regress and predict the EUR-USD exchange rate. We investigate how Snake performs on financial data with the goal to predict the exchange rate between EUR and USD. We use a two-layer feedforward network with 256 neurons in the first and 64 neurons in the second layer. We train with SGD, a learning rate of 10 −4 , weight decay of 10 −4 , momentum of 0.99, and a mini-batch size of 16 4 . For Snake, we make a a learnable parameter. The result can be seen in Fig. 14 . Only Snake can model the rate on the training range and makes the most realistic prediction for the exchange rate beyond the year 2015. The better optimization and generalization property of Snake suggests that it offers the correct inductive bias to model this task. Figure 15 : Full Training set for Section 6.3 Figure 16 : Learning Trajectory of Snake. One notices that Snake firsts learns linear features, then low frequency features and then high frequency features. We take this chance to study the tranining trajectory of Snake using the market index prediction task as an example. We set a = 20 in this task. See Figure 15 for the full training set for this section (and also for Section 6.3). See Figure 16 for how the learning proceeds. Interestingly, the model first learns an approximately linear function (at epoch 10) , and then it learns low frequency features, and then learns the high frequency features. In many problems such as image and signal processing [1] , the high frequency features are often associated with noise and are not indicative of the task at hand. This experiment explains in part the good generalization ability that Snake seems to offer. Also, this suggests that one can also devise techniques to early stopping on Snake in order to prevent the learning of high-frequency features when they are considered undesireable to learn. Figure 19 : ResNet100 on CIFAR-10. We see that the proposed method achieves comparable performance to the ReLU-style activation functions, significantly better than tanh and sin. In this section, we show that the proposed activation function can achieve performance similar to ReLU, the standard activation function, both in terms of generalization performance and optimization speed. See Figure 17 . Both activation functions achieve 93.5 ± 1.0% accuracy. To show that Snake can scale up to larger and deep neural networks, we also repeat the experiment on CIFAR-10 with ResNet101. See Figure 19 . Again, we see that the Snake achieves similar performance to ReLU and Leaky-ReLU. In this section, we show the effect of variance correction is beneficial. Since the positive effect of correction is the most pronounced when the network is deep, we compare the difference between having such correction and having no correction on ResNet101 on CIFAR-10. See Figure 20 ; we note that using the correction leads to better training speed and better converged accuracy. We also restate the proposition here. , which is maximized at a max ≈ 0.56045. Proof. The proof is straight-forward algebra. The second moment of Snake is 1 + 3+e −8a 2 −4e −2a 2 8a 2 , while the squared first moment is e −4a 2 (−1+e 2a 2 ) 2 4a 2 , and subtracting the two, we obtain the desired variance B Proofs for Section 2.2 C Universal Extrapolation Theorems Theorem 6. Let f (x) be a piecewise C 1 periodic function with period L. Then, a Snakeneural network, f w N , with one hidden layer and with width N can converge to f (x) uniformly as N → ∞, i.e., there exists parameters w N for all N ∈ Z + such that for all x ∈ R, i.e., the convergence is point-wise. If f (x) is continuous, then the convergence is uniform. Proof. Now it suffices to show that a neural network with sin as activation function can represent a Fourier series to arbitrary order, and then applying the Fourier convergence theorem we are done. By the celebrated Fourier convergence theorem, we know that for unique Fourier coefficients α m , β m , and we recall that our network is defined as then we can represent Eq. 15 order by order. For the m-th order term in the Fourier series, we let and let the unspecified biases b i be 0: we have achieved an exact parametrization of the Fourier series of m order with a sin neural network with 2m many hidden neurons, and we are done. ◻ The above proof is done for a sin(x) activation; we are still obliged to show that Snake can approximate a sin(x) neuron. A finite number of Snake neurons can represent a single cos activation neuron. Proof. Since the frequency factor a in Snake can be removed by a rescaling of the weight matrices, we set a to be such that Recall that x + sin 2 (x) = x − cos(x) + 1 2 . We also reverse the sign in front of cos, and remove the bias 1 2 , and prove this lemma for x + cos(x). We want to show that for a finite D, there w 1 and w 2 such that This is achieavable for D = 2, let w 1,1 = −w 1,2 = 1, and let b 1,i = b 2,i = 0, we have: cos(x) = (w 2,1 − w 2,2 )x + D i=1 w 2,i cos(x) and set w 2,1 = w 2,2 = 1 2 achieves the desired result. Combining with the result above, this shows that a Snake neural network with 4m many hidden neurons can represent exactly a Fourier series to m-th order. Digital signal processing Stock price prediction using the arima model What will be the economic impact of covid-19 in the us? rough estimates of disease scenarios Detrending and business cycle facts The effects of temperature and relative humidity on the viability of the sars coronavirus Fast and accurate deep network learning by exponential linear units (elus) Approximation with artificial neural networks. Faculty of Sciences Approximation by superpositions of a sigmoidal function Common socio-economic cycle periods Economic effects of coronavirus outbreak (covid-19) on the world economy Approximation of dynamical systems by continuous time recurrent neural networks Normal body temperature: a systematic review Clinical characteristics of coronavirus disease 2019 in china Delving deep into rectifiers: Surpassing human-level performance on imagenet classification Deep residual learning for image recognition Deep learninga technology with the potential to transform health care Multilayer feedforward networks are universal approximators Biological clocks Global recessions Kitchin, juglar and kuznetz business cycles revisited. Wroclaw: Institute of Economic Sciences Market cap to gdp: An updated look at the buffett valuation indicator Rectified linear units improve restricted boltzmann machines Philosophiae naturalis principia mathematica Temperature dependence of covid-19 transmission The prediction of business cycle phases: Financial variables and international linkages A hybrid arima and support vector machines model in stock price forecasting Taming the waves: sine as activation function in deep neural networks Approximation capabilities of neural networks on unbounded domains Machine learning in medicine Swish: a self-gated activation function The circadian rhythm of body temperature Temperature and latitude analysis to predict potential spread and seasonality for covid-19 Sources of business cycle fluctuations Fourier neural networks A review of coronavirus disease-2019 (covid-19) On the importance of initialization and momentum in deep learning Stock market trend prediction using arima-based neural networks Empirical evaluation of rectified activations in convolutional network Stock price prediction via discovering multi-frequency trading patterns Association between ambient temperature and covid-19 infection in 122 cities from china Fourier neural networks: A comparative study (a) training loss vs. epoch (b) testing accuracy vs. epoch Figure 20: Effect of variance correction Solving for this numerically from a numerical solver (we used Mathematica) renders the maximum at a max ≈ 0 Let f (x) be a two layer neural network parameterized by two weight matrices W 1 and W 2 , and let w be the width of the network, then for any bounded and continuous function g(x) on [a, b], there exists m such that for any w ≥ m, we can find W 1 This follows immediately by setting We reproduce the statements of the theorems for the ease of reference.Theorem 4. Consider a feed forward network f ReLU (x), with arbitrarily fixed depth h and widths d 1 , ..., d h+1 , then limwhere z is a real scalar, and u is any unit vector of dimension d 1 .We prove this through induction on h. We first prove the base case when h = 2, i.e., a simple non-linear neural network with one hidden layer.Thenfor all unit vector u.Proof. In this case,where σ(x) = ReLU(x), and let 1 x>0 denote the vector that is 1 when x > 0 and zero otherwise, and let M x>0 ∶= diag(1 x>0 ), then for any fixed u we haveand W u is the desired linear transformation and b u the desired bias; we are done. ◻ Apparently, due the self-similar structure of a deep feedforward network, the above argument can be iterated over for every layer, and this motivates for a proof based on induction.Proof of Theorem. Now we induce on h. Let the theorem hold for any h ≤ n, and we want to show that it also holds for h = n + 1. Let h = n + 1, we note that any f ReLU, h=n+1 can be writen asthen, by assumption, f ReLU, h=n (x) approaches zW u u + b u for some linear transformation W u , b u :and, by the lemma, this again converge to a linear transformation, and we are done. ◻ Now we can prove the following theorem, this proof is much simpler and does not require induction.Theorem 5. Consider a feed forward network f tanh (x), with arbitrarily fixed depth h and widths d 1 , ..., d h+1 , then limwhere z is a real scalar, and u is any unit vector of dimension d 1 , and v u ∈ R d h+1 is a constant vector only depending on u.Proof. It suffices to consider a two-layer network. Likewise, f tanh (zu) = (W 2 σ(W 1 zuwhere σ(x) = tanh(x). As z → ∞, W 1 zu + b 1 approaches either positive or negative infinity, and so σ(W z zu + b 1 ) approaches a constant vector whose elements are either 1 or −1, which is a constant vector, and (W z σ(W z zu + b 1 ) + b 2 ) also approaches some constant vector v u . Now any layer that are composed after the first hidden layer takes in an asymptotically constant vector v u as input, and since the activation function tanh is a constinuous function, f tanh (x) is continuous, and so lim z→∞ f tanh, h=n (x) = f tanh, h=n−1 (v u ) = v ′ u .We are done. ◻