key: cord-0535596-gg9lrlnk
authors: Mastromichalakis, Stamatis
title: ALReLU: A different approach on Leaky ReLU activation function to improve Neural Networks Performance
date: 2020-12-11
journal: nan
DOI: nan
sha: cfede33afa53efb5d1c3dec5de503d6c45661a48
doc_id: 535596
cord_uid: gg9lrlnk

Despite the unresolved 'dying ReLU problem', the classical ReLU activation function (AF) has been extensively applied in Deep Neural Networks (DNN), in particular Convolutional Neural Networks (CNN), for image classification. The common gradient issues of ReLU pose challenges in applications on academy and industry sectors. Recent approaches for improvements are in a similar direction by just proposing variations of the AF, such as Leaky ReLU (LReLU), while maintaining the solution within the same unresolved gradient problems. In this paper, the Absolute Leaky ReLU (ALReLU) AF, a variation of LReLU, is proposed, as an alternative method to resolve the common 'dying ReLU problem' on NN-based algorithms for supervised learning. The experimental results demonstrate that by using the absolute values of LReLU's small negative gradient, has a significant improvement in comparison with LReLU and ReLU, on image classification of diseases such as COVID-19, text and tabular data classification tasks on five different datasets.

Although recent developments of AFs for Shallow and Deep Learning Neural Networks (NN), such as the QReLU/m-QReLU (Parisi et al., 2020a) and m-arcsinh (Parisi et al., 2020b) , the repeatable and reproducible functions have remained very limited and confined to three activation functions regarded as 'gold standard' (Parisi et al., 2020b) . The sigmoid and tanh are well-known for their common vanishing gradient issues and only ReLU function seems to be more accurate and scalable for deep neural networks, despite its 'dying ReLU' problem, which has recently appears to been solved by (Parisi et al., 2020a) . Earlier, many other methods are proposed in order to fix these known problems.

Some of them are multiple variations of the ReLU AF, such as the Leaky ReLU (LReLU), the Parametric ReLU (PReLU), the Randomised ReLU (RReLU) and the Concatenated ReLU (CReLU).

Furthermore, Leaky ReLU (LReLU) introduced (Maas et al. 2013) by providing a small negative gradient for negative inputs into a ReLU function, instead of being 0. A constant variable , with a default value of 0.01, was used to compute the output for negative inputs. Using this modification, LReLU can lead to small improvements in classification performance when compared to the ReLU AF.

But the most of the above mentioned AFs are tend to have lack of robustness with classification tasks of varying degrees of complexity, e.g., slow or lack of convergence (Vert and Vert, 2006) (Jacot et al., 2018) , caused by trapping at local minima (Parisi et al., 2020b) . Amongst the mentioned AFs, only the ReLU is really applicable on NNs, with its novel quantum variations (QReLU and m-QReLU) found more scalable than its traditional version only recently (Parisi et al., 2020a) .

In this work, a new approach of LReLU is proposed in order to solve the common gradient vanishing and 'dying ReLU' problems, by using the absolute values of the small negative gradient used on LReLU. This method leads in a significance improvement on training and classification procedures as it is concluded from the results of the numerical evaluation performed. Quantitative evaluation metrics such as accuracy, AUC, recall, precision and F1-score have been computed to reveal the performance of the proposed technique and to provide the necessary criteria for a reliable objective evaluation of the method.

The outline of this paper is as follows: Section 2 contains one of the main contributions of this work, which includes the implementation of ALReLU in Keras. Section 3 presents experimental results of the proposed AF, including an evaluation of the accuracy of the training. Also it is compared to other well defined AFs in the field. Finally, discussion and the main conclusions of the work are devoted on Section 4.

The following data sets for image, text and tabular data classification were used in the experiments described and discussed in this study: 

Rectified Linear Unit, or ReLU, is one of the most common AFs used in NNs today. It is commonly used on between layers to add nonlinearity in order to handle more complex and nonlinear datasets. (1)

(2) Despite its success, especially on DNN, ReLU has a few issues. Firstly, ReLU is not continuously differentiable. At x=0, the gradient cannot be computed. Although it is not a serious problem, it slightly impact training performance.

Moreover, ReLU sets all values < 0 to zero. This can benefit on sparse data, but since the gradient of 0 is 0 and hence neurons arriving at large negative values cannot recover from being stuck at 0. The neuron effectively dies and hence the problem is known as the 'dying ReLU' problem. This can leads the network essentially stops learning and underperforms.

Despite appropriate initialization of the weights to small random values, with large weight updates, the summed input to the traditional ReLU AF is always negative, although the input values fed to the NN. Current improvements to the ReLU, such as the LReLU, allow for a more non-linear output to either account for small negative values or facilitate the transition from positive to small negative values, without eliminating the problem though.

The LReLU is trying to solve these problems by providing a small negative gradient for negative inputs into a ReLU function. Fig. 2 and Eqs. (3) and (4) demonstrate the LReLU and its derivative. On the other hand, in Fig. 4 it is demonstrated the QReLU (L. Parisi, at al., 2020a) . It can be considered that this AF and its derivatives are a little similar with the proposed one. The main difference is that ALReLU has smaller value and derivative. In a theoretically perspective, the ALReLU has also the properties of QReLU, such as the advantage of superposition and 0.01 x < 0 ( ) 1 x 0

However, this claim is only in theory and it is not proven in this paper. In fact, this work demonstrates the advantages of the little modification of LReLU that leads to the proposed ALReLU AF. The experiments and results on Section 3, indicate the significance impact of this modification.

The following code snippets, show the Keras implementation of the proposed AF, as well as its usage after the Convolution Layers. The derivatives and gradients of AFs are automatically calculated in TensorFlow 2, so they have not implemented manually. 

Since the NN models are trained for given datasets, for estimating the performance and the classification accuracy a 9-Fold validation used in COVID-19 dataset and a 5-Fold cross-validation for other datasets. Cross-validation is a statistical method of evaluating and comparing learning algorithms by avoiding overfitting. The K-Fold validation procedure has been executed 4 times for every NN model and dataset. The average of these measures is computed and demonstrated in this section. The results support the theoretical superiority of the proposed ALReLU AF for image and text classification tasks, as well as tabular data classification. The classification performance results are demonstrated in Table 1 and are described above: 

In this tabular data classification task the proposed AF has a little better results on Accuracy and AUC. The other metrics are the same. (Table 1 , third row "Microsoft Malware Prediction (Kaggle)").

In this paper, a different approach on LReLU function was demonstrated, by employing the absolute values of its negative gradient. It was proven a more accurate and robust AF for Neural Networks that used in image, text and tabular data classification. The proposed AF is merely centering on the improvement of the classification accuracy in sections wherein high accuracy and reliability is very important to be achieved, such as in the medical sector. Numerous tests were performed in order to validate the theoretical framework developed in this paper. In fact, the proposed method is used in several experiments of training and classification, on different datasets, and it shows superiority to the well-established ReLU and LReLU in terms of ROC/AUC metrics, recall, precision, F1 scores and accuracy. The important conclusion of these results, is the proposal of an alternative method for solving the common dying and vanishing gradient problems, a serious flag on NN training.

Qrelu and m-qrelu: Two novel quantum activation functions to aid medical diagnostics

hyper-sinh: An Accurate and Reliable Function from Shallow to Deep Learning in TensorFlow and Keras arXiv preprint

Consistency and convergence rates of one-class svms and related algorithms

Neural tangent kernel: Convergence and generalization in neural networks

Rectifier Nonlinearities Improve Neural Network Acoustic Models. Proceedings of the 30th International Conference on Machine Learning

What is the best multi-stage architecture for object recognition

Deep Learning