key: cord-0912130-ymlee7jv authors: Renjini, A; Swapna, M S; Raj, Vimal; Sankararaman, S title: Graph-based feature extraction and classification of wet and dry cough signals: a machine learning approach date: 2021-11-12 journal: J Complex Netw DOI: 10.1093/comnet/cnab039 sha: f3f9e9fc4ab0def8840f49a123fc23523d724c83 doc_id: 912130 cord_uid: ymlee7jv This article proposes a unique approach to bring out the potential of graph-based features to reveal the hidden signatures of wet (WE) and dry (DE) cough signals, which are the suggestive symptoms of various respiratory ailments like COVID 19. The spectral and complex network analyses of 115 cough signals are employed for perceiving the airflow dynamics through the infected respiratory tract while coughing. The different phases of WE and DE are observed from their time-domain signals, indicating the operation of the glottis. The wavelet analysis of WE shows a frequency spread due to the turbulence in the respiratory tract. The complex network features namely degree centrality, eigenvector centrality, transitivity, graph density and graph entropy not only distinguish WE and DE but also reveal the associated airflow dynamics. A better distinguishability between WE and DE is obtained through the supervised machine learning techniques (MLTs)—quadratic support vector machine and neural net pattern recognition (NN), when compared to the unsupervised MLT, principal component analysis. The 93.90% classification accuracy with a precision of 97.00% suggests NN as a better classifier using complex network features. The study opens up the possibility of complex network analysis in remote auscultation. Eliciting the information regarding the airway inflammation and obstructions in the respiratory tract can point out various respiratory ailments, which are the major causes of morbidity and mortality globally. The spreading of the recent pestilence, COVID 19, which mainly affects the respiratory system, has exceeded 4 million deaths worldwide [1] . One of the significant symptoms of this disease is a dry cough. Hence, a reliable diagnosis of cough events has significance in the effective treatment of lung pathologies like COVID 19, pneumonia, influenza, tuberculosis and a hundred others [2, 3] . Cough is a natural defense mechanism that clears the mucosal secretions from the respiratory tract and prevents external irritants from entering the lungs. Analysis and classification of cough sounds are thus an essential part of detecting most lung ailments, as these diseases differ in acoustical and dynamical characteristics followed by spontaneous coughs. A simple distinction between the cough sounds based on their acoustic calibre leads to the classification of cough sounds as wet (WE) and dry (DE) coughs [4, 5] . The lower respiratory tract bacterial infections cause the building up of phlegm, which is ejected out during the expulsive phase of the cough, called wet cough. If the patient does not produce sputum after coughing, it is termed dry cough, which is associated with a viral infection in the upper respiratory airway [6] . The identification and prediction of these non-stationary cough sounds employing potential mathematical tools and complex 2 A. RENJINI ET AL. network analysis help the pulmonologists in the detection and prognosis treatment of several respiratory diseases. The digitized time-series cough sounds carry information about the functional changes associated with the respiratory tract while coughing. The transformation of the time domain signal of the cough sound to the time-frequency domain by wavelet transform tells not only about the frequency components but also its time of occurrence [7] . The potential of complex networks in analysing the structural and functional systems of the brain [8] , clinical traits and their interactions in the circulatory system [9] , brain connectivity architecture [10] , depressive disorder from electroencephalograms [11] , effective diseasedrug combinations [12] and others has been reported in various literatures. The analysis of the time-varying cough signal by constructing a complex network with nodes and edges using various mapping techniques helps in understanding the airflow dynamics and the condition of the respiratory tract. The topology of the complex network is revealed from its parameters-the number of edges (E), degree centrality (D), graph density (G), transitivity (T ), Eigenvector centrality (E c ) and graph entropy (E n ) [13] [14] [15] . Efficient disease detection and classification is possible in the healthcare sector because of the advancements in the field of artificial neural networking and machine learning that employs both unsupervised and supervised learning techniques (MLTs). They offer the accurate prediction that leads to efficient classification and precise disease diagnosis, thereby reducing the rate of false positives. Unsupervised learning and supervised learning are two basic techniques in artificial intelligence and machine learning. When in unsupervised learning, the data are unlabelled; it is labelled in supervised learning. Clustering, association and dimensionality reduction are the three major tasks for which unsupervised learning models are employed. Supervised tasks are designed to learn a mapping function from labelled data that maps each data point to its label and generalizes to unseen data points. The labelled datasets in supervised learning are used to train or 'supervise' the algorithms to categorize the data or forecast the outcomes adequately. The model may assess its accuracy and learn over time by using labelled inputs and outputs. When it comes to data mining, supervised learning may be divided into two categories of problems: classification and regression [16] . The classification of data sets possessing identical features can be transformed from a higher dimensional to a lower-dimensional space, retaining much of its important information through the unsupervised principal component analysis (PCA) [17] . Support vector machines (SVMs) are supervised MLTs that use a high dimensional hypothesis feature space that separates data points of one class from other classes. They are based on a set of mathematical functions (kernels) that transform the output data into the required form from a given set of input variables [18] . Polynomial kernels are very popular in signal processing, with the polynomial degree indicating the kernel function used. The kernel-based supervised machine learning algorithm like SVM classifies the given dataset into multiclass data. Quadratic kernel algorithms are employed in quadratic support vector machine (QSVM) to have maximum separability between the two classes of data by creating a higher dimensional hyperplane between them [19] . An efficient classification between cough signals can also be achieved by employing the powerful neural network pattern recognition tool (NN) that clusters and classifies the data with minimum possible epochs [20, 21] . This work studies the possibility of utilizing the spectral and complex network features to distinguish and predict wet and dry cough sounds employing MLTs, PCA, QSVM and NN. Cough signal inputs of 69 WE and 46 DE sourced from the 'COVID 19 SOUNDS APP' provided by the University of Cambridge [22] , on a data access agreement, are studied in this work to inquire into the airflow dynamics in the respiratory tract during coughing. All the signals (sampled at 44,100 Hz and length ranging from 0.25 s to 0.5 s) are cured such that the leading and trailing portions containing GRAPH-BASED FEATURE EXTRACTION AND CLASSIFICATION 3 silences and noises are trimmed out to contain only the expulsive (XP), intermediate (IP) and voiced (VP) phases of WE and DE. As the time domain cough signal depicts the time-amplitude relation, the wavelet analysis is employed to obtain the temporal distribution of frequency components with their amplitudes. Morse wavelet is used as the mother wavelet to generate wavelet scalograms of WE and DE using MATLAB R2020a Toolbox. Time-series signals can be modelled using complex networks to unravel the hidden information by investigating the interactions of a number of network parameters [23] . The characteristics of a time-series signal can be evaluated through a complex network-based approach utilizing various mapping algorithms like correlation matrix, recurrence plot, visibility graph and others [24] . A simpler method to construct a complex network from a time series is that based on the correlation mapping approach, where various correlation coefficients can be employed like Pearson (p), Spearman's rank, Kendall tau and polychoric coefficients. p is a simple measure of the degree of linear correlation between multiclass data, whose value lies in the range [−1, +1] and p > 0.8 indicates a strong correlation between the datasets [25] . According to the correlation mapping method, each of the segments forms the nodes (N) of the network when the time-series data of length, L, is divided into segments of equal width, w, and is given as (1) . The Pearson correlation coefficient between the xth and yth nodes is obtained from an N × Ncorrelation matrix of elements, C xy , given in (2). A set of links formed between the nodes is termed edges (e), and the strength of interaction between the nodes through the edges brings out the underlying hidden behaviour of the system. If the correlation matrix has a value greater than the set threshold value (τ ), the adjacency matrix (A x,y ) value is one; and zero otherwise. If A x,y = 1, an edge is said to be formed between the nodes, x and y and the strength of correlation between them can be calculated in terms of the number of edges (e) given in (3) [26] . In graph theory, the central nodes always have functional importance as they maintain strong interaction with the remaining nodes in the network. Hence centrality measures like degree centrality (D), betweenness centrality, closeness centrality, eigenvector centrality (E), Katz centrality and density centrality play a key role in delineating different topological properties of the complex network [8, 27] . The degree centrality indicates the number of edges emanating from a particular node and can be considered as the centrality degree over the whole network. The eigenvector centrality E c (x) of a node is based on the eigenvector of the largest eigenvalue (λ) of the adjacency matrix, which accounts for the centrality measures of the connected nodes [28] . The influence of a node on the rest of the nodes in the network can be measured by quantifying E c (x), which can be calculated as given in (4) and (5) [15] . where, The eigenvector centrality of the entire network, E c can be calculated using the Freeman centralization approach [13] . A strong correlation between the nodes in a network can also be estimated using another parameter called the graph density (G), given by (6), which is the ratio of the number of existing edges between the nodes to how many of them are possible in between the nodes [27] . Thus G points the actual potential connections in the network. Transitivity, T is a measure of the overall clustering capability of the nodes in a network indicating the tendency to which they shall form a cluster and is given in (7). Networks with maximally connected nodes will have a higher value of transitivity [29] . where The graph entropy (E n ) is a term derived from the information theory that stands as a complexity measure of the given time-series data similar to the Shannon entropy [30, 31] . If d (i) is the probability that a node in the network has a degree, k = i, the graph entropy, E n can be written as (9) . The graph-based complex network analysis is carried out in the R software with the help of 'igraph' package. An adjacency matrix is created to construct the complex network by taking the width of the time-series segments, w = 27, and correlation parameter, τ = 0.8. The features related to the network topology-D, E c , T , G and E n -are quantified using R software and fed as input predictors to the MLTs. The open-source software, Gephi 0.9.2 is used for the construction and visualization of complex networks. The nodes in the network are subdivided according to the modularity class and are indicated in different colours. The Fruchterman-Reingold algorithm based on the force-directed placement principle is employed for the construction of the complex network, in which the nodes of the network are minimally moved to minimize the system energy by adjusting the forces between them [32] . When the graph is visualized, emphasizing complementarities (class of objects has complementary properties that cannot all be measured simultaneously), this layout algorithm can be implemented to define the final layout of the network. The PCA biplot is plotted using R software, and the 'classification learner app' along with the 'neural network pattern recognition' tool in the MATLAB 2020a software is employed for performing the classification using the supervised MLTs, QSVM (15-fold cross-validation), and NN with scaled conjugate gradient backpropagation algorithm. The architecture of NN is a shallow feed-forward neural network with one hidden layer of 25 neurons having 5 predictors in the input layer. The performance of the MLTs is evaluated from the outcome metrics-sensitivity, specificity, precision and accuracy of the corresponding confusion matrices [33] . The methodological workflow is shown in Fig. 1. The development of robust mathematical techniques has paved the way for advancements in the field of biomedical engineering. The majority of lung diseases like COVID 19 produce cough as a significant symptom. Hence, it is necessary to monitor and characterize cough sounds using powerful mathematical tools to achieve non-invasive, non-contact techniques that aid the physicians in the accurate disease diagnosis. The representative time-domain cough signals, WE and DE, displayed in Fig. 2(a, b) , respectively, are subjected to wavelet analysis (Fig. 2(c, d) ) for unveiling their characteristic features that point to various respiratory tract conditions. The expulsive (XP), intermediate (IP) and the voiced phases (VP) of WE and DE are visible from their time-domain signals ( Fig. 2(a, b) ). The inflamed, infected respiratory tract carrying mucosal obstructions makes its cross-sectional area non-uniform. This causes the airflow to be more turbulent, creating vortices along the tract during WE, unlike that in DE. In both WE and DE ( Fig. 2(a, b) ), the glottal opening happens for the initial outburst of air, indicated by the region, XP. The glottis narrows after the expulsive phase causing a noisy intermediate phase (region IP), and the vocal folds approach each other to cause the second burst of air in the voiced phase (region VP). The time of occurrence of the frequency components along with their intensity and the frequency spread of WE and DE can be observed from the wavelet scalograms that help unveil the airflow dynamics in the respiratory tract while coughing. The turbulent movement of the high-velocity air molecules impinging the walls of the infected respiratory tract due to mucosal obstructions in WE generates more number of low-intense frequency components, unlike that in DE as depicted in Fig. 2(c, d) , respectively. Multiple obstructions and the resultant narrowings are comparatively lesser in DE, causing a near streamline flow of air through the tract, and hence a lesser spectral spread of high-intense frequencies is observed from the wavelet scalogram of DE (Fig. 2(d) ). Complex network analysis is a powerful tool that helps to unravel the underlying signature features of a dynamic system. The constructed complex networks from the time-series signals of WE and DE are depicted in Fig. 3 , with a lesser number of edges in WE than DE. It is observed that the connections are of low clustering density in the network of WE (Fig. 3(a) ), indicating a low correlation between different signal segments. The complex network of DE (Fig. 3(b) ) shows a higher percentage of densely correlated segments, which is an indication of stronger interaction between them. This may be due to the 6 A. RENJINI ET AL. When the degree centrality, D measures the total number of edges tied with a node, the influence of a node on the rest of the nodes in the network can be measured by estimating eigenvector centrality, E c . From the boxplot of D in Fig. 4(a) , it can be seen that the value of D for WE (0.032) is less than that of DE (0.049). Thus it is evident that the correlation between the signal segments of WE is less, which points to the random nature of the signal. This may be due to the presence of low-intense multiple frequency components generated as a result of the turbulent movement of air within the infected respiratory tract in WE, unlike that in DE. The value of eigenvector centrality, E c is higher for WE (0.922) than DE (0.848) reflecting a weaker correlation among the nodes (Fig. 4(b) ). The reason for this may be the same as that stated for degree centrality for both WE and DE. Graph density, G indicates the extent of correlation and thus is a measure of correlated node pairs. It is evident that WE is having a low number of correlated node pairs as it has a low value for G (Fig. 4(c) ). The lower value of G for WE is attributed to the persistence of a higher number of low amplitude frequency components that cause a comparatively higher spectral width occurring due to the blockages in the infected respiratory tract. This shows that WE is more complex than DE and hence G could be used to indicate the complexity of a signal. Transitivity, T indicates the clustering capability of the nodes in the network. The lower value of T for WE (0.532) when compared to DE (0.646) depicts ( Fig. 4(d) ) the low correlation between the nodes in the network of WE due to the persistence of multiple low-frequency components. Another topological parameter of a complex network that measures the complexity of a dynamic system is the graph entropy, E n . It is evident that WE is of more complex nature than that of DE from the boxplot of E n depicted in Fig. 4 Fig. 5 , it can be understood that the features cannot distinguish the signals through PCA. This suggests that the graph features are non-linearly correlated, necessitating supervised MLTs like QSVM and NN. Optimal values of hyperparameters are required to generate the best-trained model. Hence, the hyperparameter tuning process is necessary to achieve a balance between the underfitting and overfitting of the model. For that, 70% of the total number of breath sound signals taken for training the network are made to undergo 15-fold cross-validation that prevents the overfitting in the trainer QSVM model to obtain a prediction accuracy of 88.88%, whose confusion matrix is displayed in Fig. 6(a) showing a high positive predictive value (PPV) and low false discovery rate (FDR). Another supervised MLT, neural net pattern recognition, gives a prediction accuracy of 93.9% for WE and DE with its confusion matrix shown in Fig. 6(b) . Table 1 shows the outcome metrics of QSVM and NN, from which it is evident that NN classifies and predicts WE and DE more efficiently using the powerful complex network features. Innovative methods for non-invasive and non-contact detection have become a necessity of the hour due to the outbreak of the pandemic COVID 19. The present work proposes a novel approach employing the potential of the complex networks and their features in effectively distinguishing wet and dry coughs. Different stages of cough signals that indicate the glottal opening and closure-XP, IP and VP-are observed from the time domain signals of both WE and DE. Low intense, multiple number of frequency components are observed in the wavelet scalogram of WE because of the reduced energy distribution due to the collision of air molecules within the infected respiratory tract. The potential of graph features-D, E c , T , G and E n -are employed in the present study to have a deeper understanding of the airflow dynamics within the tract. Lower values of (D, T , G) and higher values of (E c , E n ) indicate the low correlation between the signal segments of WE due to its complex and random nature. The complexity is attributed to the occurrence of a higher number of low amplitude frequency components in WE that cause a comparatively higher spectral width due to the blockages in the infected respiratory tract. The power of complex network features in the classification and prediction of WE and DE is studied using PCA, QSVM and NN. The inefficiency of the unsupervised MLT, PCA is evident from the overlapping of the ellipses in the biplot. Supervised MLTs, QSVM and NN show 88.88% and 93.9% classification accuracy with a precision of 95.65% and 97.00%, respectively, showcasing the capability of NN in efficiently identifying and predicting WE and DE. Thus, the present study unveils the capability of complex network features for apprehending the airflow dynamics in the infected respiratory tract and employing them as efficient predictors for distinguishing the cough sounds. World Health Organization (2021) Weekly epidemiological update on COVID-19 Analysis of the cough sound: an overview Cough and cold remedies for the treatment of acute respiratory infections in young children A novel method for wet/dry cough classification in pediatric population Automated algorithm for wet/dry cough sounds classification Automatic identification of wet and dry cough in pediatric patients with respiratory diseases The effectiveness of the wavelet transforms method in the heart sounds analysis Complex brain networks: graph theoretical analysis of structural and functional systems Cardiovascular networks Small World" architecture in brain connectivity and hippocampal volume in Alzheimer's disease: a study via graph theory from EEG data Graph theory analysis of directed functional brain networks in major depressive disorder based on EEG signal Network-based prediction of drug combinations Centrality in social networks conceptual clarification Statistical mechanics of complex networks Study on centrality measures in social networks: a survey Machine Learning: The Art and Science of Algorithms that Make Sense of Data Principal component analysis: a review and recent developments Automatic breast segmentation and cancer detection via SVM in mammograms A comparative study of the svm and k-nn machine learning algorithms for the diagnosis of respiratory pathologies using pulmonary acoustic signals An alternative respiratory sounds classification system utilizing artificial neural networks Neural net pattern recognition based auscultation of croup cough and pertussis using phase portrait features. Chin Exploring automatic diagnosis of COVID-19 from crowdsourced respiratory sound data From time-series to complex networks: application to the cerebrovascular flow patterns in atrial fibrillation Recurrence networks-a novel paradigm for nonlinear time series analysis Statistical methods for transport demand modeling. Modeling of Transport Demand Identification of the epileptogenic zone from stereo-EEG signals: a connectivity-graph theory approach Factoring and weighting approaches to status scores and clique identification Comparison of different generalizations of clustering coefficient and local efficiency for weighted undirected graphs A history of graph entropy measures Graph drawing by force-directed placement Unravelling the potential of phase portrait in the auscultation of mitral valve dysfunction The authors are thankful to the Covid Cough Sound App database and its provider, The Chancellor of Masters and Scholars of the University of Cambridge of The Old Schools, Trinity Lane, Cambridge CB2 1TN, UK and the Provider Scientist, Professor Cecilia Mascolo of the Department of Computer Science and Technology, University of Cambridge.