key: cord-1012442-klw72zis authors: Khalifa, Nour Eldeen M.; Taha, Mohamed Hamed N.; Manogaran, Gunasekaran; Loey, Mohamed title: A deep learning model and machine learning methods for the classification of potential coronavirus treatments on a single human cell date: 2020-10-17 journal: J Nanopart Res DOI: 10.1007/s11051-020-05041-z sha: c0b9bd43b1cf7d0b910d8e2e587047c040338fba doc_id: 1012442 cord_uid: klw72zis Coronavirus pandemic is burdening healthcare systems around the world to the full capacity they can accommodate. There is an overwhelming need to find a treatment for this virus as early as possible. Computer algorithms and deep learning can participate positively by finding a potential treatment for SARS-CoV-2. In this paper, a deep learning model and machine learning methods for the classification of potential coronavirus treatments on a single human cell will be presented. The dataset selected in this work is a subset of the publicly online datasets available on RxRx.ai. The objective of this research is to automatically classify a single human cell according to the treatment type and the treatment concentration level. A DCNN model and a methodology are proposed throughout this work. The methodical idea is to convert the numerical features from the original dataset to the image domain and then fed them up into a DCNN model. The proposed DCNN model consists of three convolutional layers, three ReLU layers, three pooling layers, and two fully connected layers. The experimental results show that the proposed DCNN model for treatment classification (32 classes) achieved 98.05% in testing accuracy if it is compared with classical machine learning such as support vector machine, decision tree, and ensemble. In treatment concentration level prediction, the classical machine learning (ensemble) algorithm achieved 98.5% in testing accuracy while the proposed DCNN model achieved 98.2%. The performance metrics strengthen the obtained results from the conducted experiments for the accuracy of treatment classification and treatment concentration level prediction. SARS virus spread around the world and caused a lot of panic globally at the end of February 2003 (Chang et al. 2020; Chamola et al. 2020) . This led to set an alarm about viruses and their devastating impact in the new century. The 2019 latest coronavirus was described by the World Health Organization (WHO) in the form of 2019-nCov (COVID-19) (Singhal 2020; Loey et al. 2020a ). The 2019 coronavirus was identified as the SARS-CoV-2 by the International Committee on Taxonomy of Viruses (ICTV) in 2020 (Lai et al. 2020; Li et al. 2020; Sharfstein et al. 2020 ). More than 500,000 fatalities in 213 countries and territories were affected by an outbreak of SARS-CoV-2 before the date of the published article (Worldometer 2020) . The transmission of coronavirus (person to person) was spreading so fast for example, in Italy (Giovanetti et al. 2020) , US (Holshue et al. 2020) , India (Khattar et al. 2020) , and Germany (Rothe et al. 2020 ). On 10 July 2020, SARS-CoV-2 confirmed more than 12 million cases, 6 million recovered cases, and 550,000 death cases. Figure 1 shows some statistics about recovered and death cases of COVID-19 map 2020). Generally, most of the publication focus is on the classification and detection of X-ray and CT images of COVID-19 (Civit-Masot et al. 2020; Waheed et al. 2020; Narayan Das et al. 2020; Ardakani et al. 2020) . In this research, our focus is on recognizing and detecting a drug to help in healing from COVID-19 and study a morphological effect of COVID-19. Today, DL is quickly becoming a crucial technology in image/video classification and detection (Loey et al. 2020b, c; Khalifa et al. 2019a) . In this paper, a deep learning model and machine learning methods for the classification of potential coronavirus treatments on a single human cell will be presented. The objective of this research is to automatically classify a single human cell according to the treatment type and the treatment concentration level. The novelty of this research is using a proposed classification model based on deep learning and machine learning for COVID-19 virus treatments. The remainder of the document is structured appropriately. "Datasets characteristics" includes a summary of the data set characteristics. "The proposed model" provides a detailed description of the proposed model. Throughout "Experimental results", preliminary findings are recorded and evaluated, and the assumptions and potential future research are presented in "Conclusion and future works". This research conducted its experiments based on the dataset presented in research (Heiser et al. 2020) . The dataset attribute description is presented in detail in Table 1 . The data are publicly available at RxRx.ai under the name of "RxRx19a Dataset". It is a highdimensional dataset that analyzes more than 1660 of FDA-approved drugs in a human cellular model of SARS-CoV-2 infection and included more than 300,000 recorded experiments. Although the presented data is in vitro screen that represents data from In this research, a subset of data is included in the conducted research experiments. The subset includes VERO cells which are a continuous cell lineage derived from kidney epithelial cells of an African green monkey and human renal cortical epithelial (HRCE) cells. Both cells were selected along with 10, 30, and 100 treatment concentration level with active SARS-CoV-2. This subset includes 32 treatments and three treatment concentration levels with two classes of cell type. Only 3750 cell records are included in the experiment carried out in this research. The introduced model consists of three phases. The first phase is the preprocessing phase that converts the numerical values of the 1024 cell features to a digital image. The second phase is the training phase based on machine learning algorithms for numerical features and deep convolutional neural networks for the converted image features. The third phase is the testing phase and the evaluation of proposed model accuracy for treatment classification and treatment concentration level prediction. Figure 2 presents the proposed model structure. The pre-processing phase includes (1) loading the 1024 features of cells on to computer memory, (2) change the cell feature original numerical domain that ranges from − 0.00046466477, 4.508815065 to image range [0, 255] according to equation (1), (3) construct image by converting the data vector of 1024 feature cells into a 32 × 32 pixel image according to the pseudocode presented in Algorithm 1. The result of this phase will be 3750 images. Figure 3 illustrates a set of images after the pre-processing phase. where − 0.00046466477 is the minimum cell value and 4.508815065 is the maximum cell value in the 1024 features of cell data and 255 is the maximum value of the image domain. The training phase is conducted based on two methodologies. The first methodology uses machine learning algorithms such as support vector machine, decision trees, and ensemble algorithms. The second methodology is depending on deep convolutional neural networks. SVM is one of the most common and impressive machine learning techniques for recognition and regression. SVM is a functioning algorithm, as shown in equation (2), where l is the label from 0 to 1, w. a − q is the output, w and q are the linear category coefficients, and a is the input vector. Equation (3) will enforce the loss function that is to be reduced (Çayir et al. 2018; Jogin et al. 2018) . Decision tree The decision tree is the computing classification paradigm focused on entropy method and knowledge acquisition. Entropy computes the amount of uncertainty in data as shown in equation (4), where CD is the data, b is the class output, and p(x) is the proportion of q label. Measuring the entropy gap from results, we calculate knowledge acquisition (KA) as illustrated in equation (5), where x is the subset of data (Navada et al. 2011; Tu and Chung 1992) . Ensemble methods Ensemble methods are algorithms for machine study that build several classifiers, which is used to identify new cases in one direction or another through specific decisions (typically through weighted or unweighted votes) (Polikar 2012) . The used methods are linear regression (Naseem et al. 2010) , logistic regression (Kleinbaum and Klein 2002) , and K-nearest neighbors algorithm (k-NN) (Mangalova and Agafonov 2014) . We improve our ensemble by equation (6) to achieve the best outcomes (Xiao et al. 2018) . Deep convolutional neural networks The structure of the proposed deep convolutional neural networks is presented in Fig. 4 . The proposed DCNN consists of three main convolutional layers with window size 3 × 3 pixels, three ReLU layers, and three pooling layers. The previous layers are used as feature extractions while two fully connected layers are used as classification layers. The proposed model for DCNN is a result of a lot of architecture tuning and tweaking based on work presented in (Khalifa et al. 2018; Khalifa et al. 2019b; Khalifa et al. 2020; Loey et al. 2020d ). One problem that faces DCNN is overfitting. Overfitting can be solved by data augmentation (Shorten and Khoshgoftaar 2019; El-Sawy et al. 2017a, b) . Data augmentation increases the number of images used for training by applying label-preserving transformations. Also, it is applied to the training set to make the The augmentation process raises the number of images from 3750 images to 15,000 images, 3 times larger than the original dataset. This will lead to a significant improvement in the neural network training phase. Additionally, it will make the proposed DCNN immune to memorize the data and be more robust. The testing phase is the phase where the proposed model proves its performance and efficiency. The main goals of the proposed model are correctly classifying the treatments based on numerical features by using machine learning algorithms and correctly classifying the treatment images of the features based on DCNN. Also, the prediction of the treatment concentration on every cell is based on numerical features and image features using both machine learning and DCNN. For machine learning, the performance evaluation will include testing accuracy along with receiver operating characteristic (ROC) curve under 5k-fold crossvalidation. For DCNN, testing accuracy, precision, recall, and F1 score (Goutte and Gaussier 2010) are included based on the calculation of the confusion matrix. The performance metrics are presented from equation (2) to equation (10). where TruePos is the count of true positive samples, TrueNeg is the count of true negative samples, FalsePos is the count of false positive samples, and FalseNeg is the count of false negative samples from a confusion matrix. The experiments are implemented using MATLAB software on a computer server with 96 GB of RAM and Intel Xeon processor (2 GHz & Dataset was divided into two sections (70% of the data for the training process and 30% for the testing process). & Data augmentation is applied for treatment classification problems. & Testing accuracy, precision, recall, and F1 score are selected as performance metrics. There are 32 classes of treatment according to the subset selected from the original dataset and they are presented in Table 2 . The treatment classification will be experimented on by machine learning for numerical format and DCNN for digital image format. The first results to be recorded are using classical machine learning, three classical machine learnings are selected, and they are DT, SVM, and ensemble. Table 3 presents the average testing accuracy for the selected machine learning algorithm using 5k cross-validation. ROC curve is one of the performance metrics for the machine learning algorithms. An ROC curve is a graph showing the performance of a classification model at all classification thresholds using true positive rate and false positive rate. Figure 5 presents a set of ROC curves for the different machine learning algorithms for one treatment oseltamivir-carboxylate. The AUC provides an aggregate measure of performance across all possible classification thresholds. The AUC for treatment oseltamivir-carboxylate using DT was 73% while using SVM, the AUC was 84%, and using ensemble, the AUC was 86%. There are about 96 ROC curves that can be produced by experimental trails, but there is no need to repeat the figures for different treatments, and the testing accuracy can be a good indicator of the quality of the machine learning algorithm. Using deep learning architecture, the achieved results are better than using machine learning algorithms in terms of testing accuracy and performance metrics. Using the proposed DCNN model and the conversion to the image domain with augmentation helped the model to achieve better results. The achieved testing accuracy was 98.05%. The recall measure was 95.03% accuracy. The precision measure was 96.52% accuracy. The F1 score measure was 95.97% accuracy. The confusion matrix is presented in Fig. 6 . It is clearly shown that using a deep learning model with the conversion to image domain for features enhanced the testing accuracy by 25.35% rather than using an ensemble algorithm which achieved 72.7% testing accuracy. The progress of the training phase of the proposed deep learning model is presented in Fig. 7 , which reflects the advancement of the training process to achieve better accuracy; the model has tuned for early stop of the training if there is no better accuracy achieved in 10 iterations. The batch size was 32 with a learning rate of 0.0001. Examples of testing accuracy along with treatment classification are presented in Fig. 8 . Another goal for the proposed model is to predict the concentration of the treatment on the cell. The first direction to investigate the accuracy of the model is by using a machine-learning algorithm to predict the concentration level of treatment. Three concentration levels are investigated, and they were 10, 30, and 100% concentration level. Table 4 presents the testing accuracy of treatment concentration using DT, SVM, and ensemble algorithms using 5k cross-validation. ROC curves and AUC are also extra indicators of the quality of the classifier. Figure 9 presents the ROC curves for the different machine learning algorithms for the different classes of the level of the treatment concentration of 10, 30, and 100. The SVM and the ensemble algorithms achieved AUC with 100% which is a good indicator for the quality of the classifier. Also, according to Table 3 , both classifiers (SVM and ensemble) achieved a testing accuracy with 97.3% and 98.5% for a three-class problem. The second direction is to use deep learning to solve this problem using the same proposed DCNN model for the feature of digital images without using augmentation. There was no need to use the augmentation process as the proposed model achieved a good testing accuracy with 98.2%. Figure 10 presents the confusion matrix for the level of the concentration level of the potential treatment. The proposed model with the conversion of For the concentration level, 10% of the achieved accuracy was 98.1%, for the concentration level 30%, the achieved accuracy was 100%. For the concentration level of 100%, the achieved accuracy was also 100%. The achieved accuracy for every class reflects the performance of the proposed DCNN model. For the treatment classification which includes 32 classes, the proposed DCNN achieved a superior result if it is compared with machine learning algorithms in terms of testing accuracy. The proposed DCNN achieved a result of 98.05% while classical machine learning such as DT, SVM, and ensemble achieved 57.7%, 71.5%, and 72.7%, respectively. The performance metrics supported the obtained results for the proposed DCNN with feature image conversion. In the treatment concentration level prediction, the classical machine learning algorithms such as DT and SVM achieved a near result with the proposed DCNN. The DT and SVM achieved 96.4% and 97.3%, respectively, while the DCNN achieved 98.2% in testing accuracy. The ensemble algorithm achieved a superior testing accuracy rather than the DCNN and achieved 98.5%. As a general notice, the classical machine learning algorithm for simple classification problems such as treatment concentration The coronavirus pandemic is putting healthcare systems around the world into a critical situation. Until now, there is a cure for this virus. One of the methods that can help to defeat this virus is trying approved treatments on human cells as a primary stop to shorten the gap between treatments and finding an actual cure. Computer algorithms and deep learning can close that gap and help in finding a cure. In this paper, a deep learning model and machine learning methods for the classification of potential coronavirus treatments on a single human cell. The dataset selected in work is a subset of the publicly online dataset on RxRx.ai. The objective of this research is to automatically classify the human cell according to treatment and treatment concentration levels. The proposed DCNN model and methodology are based on converting the numerical features from the original dataset to the image domain. The proposed model consists of three convolutional layers, three ReLU layers, three pooling layers, and two fully connected layers. The experimental results showed that the proposed DCNN model for treatment classification (32 classes) achieved 98.05% testing accuracy if it is compared with classical machine learning such as support vector machine, decision tree, and ensemble. In treatment concentration level prediction, the classical machine learning (ensemble) algorithm achieved 98.5% testing accuracy while the proposed DCNN model achieved 98.2%. One of the potential future work is performing same experiments with deep transfer models such as Alexnet and Resnet50 or even deeper neural networks to investigate its performance with used dataset in this research. Funding This research received no external funding. Conflict of interest The authors declare that they have no conflict of interest. Application of deep learning technique to manage COVID-19 in routine clinical practice using CT images: results of 10 convolutional neural networks Comparison of cubic SVM with Gaussian SVM: classification of infarction for detecting ischemic stroke A comparison of decision tree ensemble creation techniques Feature extraction based on deep learning for some traditional machine learning methods A comprehensive review of the COVID-19 pandemic and the role of IoT, drones, AI, blockchain, and 5G in managing its impact Feature ranking using linear SVM. In: Causation and prediction challenge Coronavirus disease 2019: coronaviruses and blood safety Deep learning system for COVID-19 diagnosis aid Fig. 10 Confusion matrix for the treatment concentration level prediction using X-ray pulmonary images Principal component analysis and relieff cascaded with decision tree for credit scoring Arabic handwritten characters recognition using convolutional neural network CNN for handwritten Arabic digits recognition based on LeNet-5 BT -Proceedings of the International Conference on Advanced Intelligent Systems and Informatics The first two cases of 2019-nCoV in Italy: where they come from A probabilistic interpretation of precision, recall and F-score, with implication for evaluation Matrix-based discriminant subspace ensemble for hyperspectral image spatial-spectral feature fusion Identification of potential treatments for COVID-19 through artificial intelligence-enabled phenomic analysis of human cells infected with SARS-CoV-2 Washington State 2019-nCoV Case Investigation Team (2020) First case of 2019 novel coronavirus in the United States Feature extraction using convolution neural networks (CNN) and deep learning Aquarium family fish species identification system using deep neural networks Deep transfer learning models for medical diabetic retinopathy detection Deep bacteria: robust deep learning data augmentation design for limited bacterial colony dataset Artificial intelligence technique for gene expression by tumor RNA-Seq data: a novel optimized deep learning approach Effects of the disastrous pandemic COVID 19 on learning styles, activities and mental health of young Indian students -a machine learning approach Severe acute respiratory syndrome coronavirus 2 (SARS-CoV-2) and coronavirus disease-2019 (COVID-19): the epidemic and the challenges ) Game consumption and the 2019 novel coronavirus Within the lack of chest COVID-19 X-ray dataset: a novel detection model based on GAN and deep transfer learning Deep learning in plant diseases detection for agricultural crops: a survey A survey on blood image diseases detection using deep learning Deep transfer learning in diagnosing leukemia in blood cells Wind power forecasting using the k-nearest neighbors algorithm IRBM) Automated deep transfer learning-based approach for detection of COVID-19 infection in chest X-rays Linear regression for face recognition Overview of use of decision tree algorithms in machine learning Ensemble machine learning: methods and applications Transmission of 2019-nCoV infection from an asymptomatic contact in Germany Diagnostic testing for the novel coronavirus A survey on image data augmentation for deep learning A review of coronavirus disease-2019 A new decision-tree classification algorithm for machine learning CovidGAN: data augmentation using auxiliary classifier GAN for improved Covid-19 detection Countries where Coronavirus has spread A deep learning-based multimodel ensemble method for cancer prediction