key: cord-0031178-78rl00re authors: You, Yujie; Lai, Xin; Pan, Yi; Zheng, Huiru; Vera, Julio; Liu, Suran; Deng, Senyi; Zhang, Le title: Artificial intelligence in cancer target identification and drug discovery date: 2022-05-10 journal: Signal Transduct Target Ther DOI: 10.1038/s41392-022-00994-0 sha: 8e08f45826991850f3077dc09206ae4a4f94d194 doc_id: 31178 cord_uid: 78rl00re Artificial intelligence is an advanced method to identify novel anticancer targets and discover novel drugs from biology networks because the networks can effectively preserve and quantify the interaction between components of cell systems underlying human diseases such as cancer. Here, we review and discuss how to employ artificial intelligence approaches to identify novel anticancer targets and discover drugs. First, we describe the scope of artificial intelligence biology analysis for novel anticancer target investigations. Second, we review and discuss the basic principles and theory of commonly used network-based and machine learning-based artificial intelligence algorithms. Finally, we showcase the applications of artificial intelligence approaches in cancer target identification and drug discovery. Taken together, the artificial intelligence models have provided us with a quantitative framework to study the relationship between network characteristics and cancer, thereby leading to the identification of potential anticancer targets and the discovery of novel drug candidates. As one of the cutting-edge cancer treatments, targeted drug therapy has the advantages of high efficiency, few side effects, and low drug resistance for patients 1 . However, there are several drawbacks to the existing targeted therapies, such as a few druggable targets 2 , ineffective coverage of the patient population, and the lack of alternative responses to drug resistance in patients 1 . Therefore, identifying novel therapeutic targets and evaluating their druggability 3, 4 becomes the current cancer research focus of targeted drug therapy. Since we have difficulty in comprehensively understanding the pathogenesis of cancer due to the complexity of the disease 5 , most of the current targeted drugs are developed based on the experimentally validated hypothesis that can explain a possible mechanism underlying carcinogenesis but ignore other facts of the disease 6 . As a result, these therapies could have undesired impacts on normal tissues and even provoke serious side effects for patients 7, 8 . To elucidate the molecular mechanisms underlying cancer genesis, interactome data can be comprised and modelled in network structures in which components are biological entities (e.g., genes, proteins, mRNAs, and metabolites) and edges are associations/interactions between them (e.g., gene co-expression, signalling transduction, gene regulation, and physical interaction between proteins [9] [10] [11] [12] [13] [14] . Artificial intelligence biology analysis algorithms are effective method to process the biological network data, which build machines or programs to simulate human intelligence, so as to implement classification, clustering and prediction tasks in biological network 15 . Therefore, artificial intelligence algorithms can effectively tackle the complexity of cancer that arises from interactions between genes and their products 16, 17 in biological network structures, so as to improve our understanding of carcinogenesis 11, 12, [18] [19] [20] [21] [22] and explore novel anticancer targets [23] [24] [25] [26] [27] [28] [29] . Over the past few decades, we have seen a fast development of artificial intelligence biology analysis algorithms. To make this study easy to understand, we not only divide these artificial intelligence algorithms into network-based biology analysis algorithm and machine learning-based (ML-based) biology analysis algorithm according to the data of biological network structure, but also employ Fig. 1 to describe the historical milestone for these artificial intelligence biology analysis algorithms. On the one hand, network-based biology analysis algorithms provide a variety of alternative network approaches to identify cancer targets. More importantly, various network-based biology analysis algorithms can investigate network data from different perspectives, therefore they can compensate each other to provide accurate biological explanations 30 . On the other hand, ML-based biology analysis [31] [32] [33] not only can efficiently handle high throughput, heterogeneous, and complex molecular data, but also can mine the feature or relationship in the biological networks. Thus, we should develop more ML-based biology analysis algorithms to provide such advanced biology analyses that can allow precise target identification and drug discovery for cancer. Although artificial intelligence biology analysis has been widely used to improve our understanding of carcinogenesis, to the best of our knowledge, there is no systematic review that introduces the scope of related research and explains the network-based and the ML-based biology analysis algorithms to identify novel anticancer targets and discover drugs. Therefore, in the next section, we will describe the scope of artificial intelligence biology analysis for novel anticancer targets investigation. In the third section, we will introduce the basic principles and theory of commonly used artificial intelligence biology analysis algorithms. Then, we will briefly review and discuss studies that utilize network-based and ML-based biology analysis for cancer target identification and drug discovery. Finally, we will summarize the content of the article, discuss the limitations and challenges faced by the community, and point out the potential of artificial intelligence biology analysis to identify the therapeutic targets and discover drugs for cancer. Recently, the rapid development of cancer-related multiomics technologies [34] [35] [36] has been one of the most important factors for artificial intelligence biology analysis to explore novel anticancer targets [37] [38] [39] . Figure 2 classifies these technologies into five aspects: epigenetics, genomics, proteomics, metabolomics, and multiomics integration analysis. Furthermore, Table 1 lists the related major diseases, drug targets, genomics, and network databases commonly used in multiomics integration analysis for these five aspects. Next, we will detail these five aspects. Epigenetics analyses the reversal modifications of DNA or DNArelated proteins 54 . These modifications affect gene expression without changing the DNA sequence 54 . Investigating epigenetic data through artificial intelligence is not only important for elucidating fundamental mechanisms of cancer but also necessary for the design of targeted therapeutics. For example, Wilson et al. 55 took advantage of information-rich transcriptomic and epigenetic data to study regulatory networks surrounding histone lysine demethylation and highlighted the importance of epigenetic regulators in mitogenic control and their potential as therapeutic targets, which showed that epigenetic regulators such as KDM1A, KDM3A, EZH2, and DOT1L 56 are critical in oncogenesis and drug resistance. Genomics aims to characterize the function of every genomic element of an organism by using genome-scale assays such as genome sequencing 57 . Applications of genomics include finding associations between genotype and phenotype 58 , discovering biomarkers for patient stratification 59 , predicting the function of genes 60 and charting biochemically active genomic regions such as transcriptional enhancers 49 . Recent developments in networkbased biology analysis methods, such as sequence-similarity networks, genome networks, and gene family networks, have significantly improved the usability of molecular datasets in comparative genomics analysis 61 . These network methods collect expression and interaction data in the beginning and then transform them into interpretable biological processes 62, 63 , leading to the identification of tumour subtypes and the discovery of drug targets 64 . For example, Medi et al. 65 integrated gene expression profiles into genome-scale molecular networks to identify novel therapeutic targets for cervical cancer, including receptors, microRNAs (miRNAs), transcription factors (TFs), proteins (e.g., CRYAB, CDK1, PARP1, WNK1, GSK3B, and KAT2B), and metabolites (arachidonic acids). Laura et al. 66 developed a network-based biology analysis workflow that integrates different layers of genomic information, including transcription factor cotargeting, miRNA cotargeting, protein-protein interaction and gene coexpression, into a biological network. Then, the authors applied a consensus clustering algorithm (An ML-based biology analysis algorithm that divide the network into sub-modules with different functions) 67-73 on identified network communities to discover cancer driver genes, which demonstrated that F11R, HDGF, PRCC, ATF3, BTG2, and CD46 could be oncogenes and promising markers for pancreatic cancer. For proteomics, proteomic experiments are performed for annotation and correlation of genome sequences, quantitation of protein abundance, detection of posttranslational modifications, and identification of protein-protein interactions (PPIs) 74 . PPIs not only play fundamental roles in structuring and mediating biological processes but also have been widely used for proteomics data analysis 75 . For example, Vinayagam et al. 37 analysed the human PPI interaction network to identify indispensable proteins that affect the controllability of the network with control theory 76 , which shows that if a system can be driven from any initial state to any desired final state in finite time with a suitable choice of inputs, the system is controllable. By changing the number of driver nodes in the network upon removal of that protein, the hub can be classified as "indispensable" "neutral" or "dispensable", which correlates with increasing, no effect, or decreasing the number of driver nodes in the network upon removal of the key protein. The evidence shows that these indispensable proteins are primary targets of disease-causing mutations, viruses, and drugs. Furthermore, analysing data from 1,547 cancer patients revealed 56 indispensable genes in nine cancers. 46 of these genes were associated with cancer for the first time, demonstrating the ability of intelligent network controllability analysis to identify novel disease genes and potential drug targets 77 . Moreover, Valle et al. 78 developed a network-based biology analysis framework to compute the proximity between polyphenol targets and disease proteins. The calculated results indicated that the diseases whose proteins are proximal to polyphenol targets have significant gene expression changes, while the diseases whose proteins are distal to polyphenol targets have no such change. The network relationship between disease proteins and polyphenol targets provides not only a computing method to reveal the effect of polyphenols on diseases but also a basis to identify novel anticancer targets. Metabolomics is routinely applied for biomarker discovery by profiling metabolites in biofluids, cells and tissues 34 . Because of the inherent sensitivity of biotechnology, subtle alterations in metabolic pathways can be detected to provide insights into the mechanisms that underlie various physiological conditions and cancer processing 34 . Owing to innovative developments in network biology, researchers employ biological networks to perform metabolomic analyses and provide us with a systemslevel understanding of the role that metabolites play in cancer. For example, Basler et al. 79 proposed an effective networkbased biology analysis framework for the systematic study of flow control and identification of driver reactions in large-scale metabolic networks. They found that the driver reactions were under complex cellular regulation in Escherichia coli, suggesting their preeminent role in facilitating cellular control. Correlation statistics indicate that the driven response plays an important role in inhibiting tumour growth and represents a potential therapeutic target. For multiomics integration analysis, addressing the complexity of tumour-host interactions requires an approach to handle integrative omics data 80 . Compared to single omics studies, multiomics data provide researchers with various and interconnected molecular profiles to study carcinogenesis 80 . Thus, integrated multiomics datasets in a network structure to artificial Fig. 2 Artificial intelligence to integrate multiomics data (e.g., epigenetics, genomics, proteomics, and metabolomics) for cancer therapeutic targets identification. (Created with BioRender.com) intelligence biology analysis has emerged as a powerful tool to fully appreciate the complex interlayer regulatory interactions in cancer progression. Such an approach allows us to benefit from prior information that can be summarized and presented in networks, thereby providing us with insights into carcinogenesis from an overall perspective 81 . For example, Gov et al. 82 first performed comparative analyses of transcriptome data, and then identified common and tissuespecific reporter biomolecules such as genes, receptors, membrane proteins, TFs, and miRNAs. Second, they used the interactions among receptors, TFs, miRNAs, and their targeted DEGs to reconstruct a tissue-specific network for ovarian cancer and used network-based biology methods to identify interaction hubs. Finally, GATA2 and miR-124-3p were identified as hub nodes, suggesting that they are potential biomarkers for ovarian cancer. This study divides these commonly used artificial intelligence biology analysis algorithms into two categories. One is networkbased biology analysis algorithm, including shortest path 83 , module detection 84 , and network centrality 85 ; the other is MLbased biology analysis algorithm including decision tree [86] [87] [88] and deep learning models [89] [90] [91] . The principles and theory of network-based biology analysis algorithms Biological networks are efficient in integrating complicated biological data, because they can capture the property of biological entities and their relationships 92 . Mathematically, a network can be represented as a graph G = (V, E) where V and E are a set of nodes (vertices) and edges, respectively. Nodes in biological networks can represent proteins, genes, diseases, and drugs and edges in the network represent various biochemical physical or functional interactions between nodes. Therefore, network-based biology analysis algorithms focuses on identifying therapeutic targets and discovery of novel drugs for cancer from molecular networks such as protein-protein interaction networks 75 , gene regulatory networks 93 , metabolic networks 94 , and drug-drug interaction networks 95 . Computational biologists have developed several networkbased biology analysis algorithms to effectively process and analyze non-ordered or non-Euclidean data in biological networks, which can perform tasks such as link prediction 96 , node ranking 85 , network propagation 97 , network modularization 98 , and network control 99 . Here, we briefly review and discuss the shortest path algorithm, module detection algorithm, and node prioritization methods using node centrality in identifying cancer therapeutic targets and discovering drugs. Tthe shortest path algorithm. The shortest path algorithm, one of network link algorithm, is used to intelligently identify the shortest connection between two genes or proteins in a graphical model that represents a cellular network 100, 101 . The algorithm is illustrated in Fig. 3 and Algorithm 1. The shortest distance for a given network is calculated by Eq. (1): Here, S and T stand for the source and target node, respectively. d (S,T) is the length of the shortest path from node S to T. V is a set of network nodes. K stands for a node in the network, and d K,T represents the lengths of possible paths connecting nodes K and T. The shortest path algorithm has been widely used to determine regulatory paths in cancer networks 103, 104 and then discover the key targets on the paths 105 . For example, Li et al. 106 first identified a set of six genes that can distinguish colorectal tumours from normal adjacent tissues using the maximum relevance minimum redundancy approach 107 . The method ranks genes according to their relevance to the class of samples concerned while considering the redundancy of genes. Those genes that had the best trade-off between the maximum relevance to the sample class and the minimum redundancy were considered "good" biomarkers. Then, the authors applied the shortest path algorithm among the six genes in a PPI network underlying cancer and identified 15 shortest paths between any two genes of the gene set. Last, they found 35 genes on the identified shortest paths and ranked them according to their betweenness 108 . The results showed that androgen receptor (AR), a ligand-dependent transcription factor, is ranked as the top gene, suggesting its involvement in colon carcinogenesis through regulating the proliferation and differentiation of tumour cells 109 . Additionally, Chen et al. 105 used a network-based biology analysis method, SAM (Significance Analysis of Microarrays) 110 , to analyse omics data and identified 153 differentially methylated CpG sites and differentially expressed molecules, including 42 miRNAs and 1,373 protein-coding genes. The authors first used the differentially expressed genes from the STRING database 111 to construct a PPI network. Then, they searched all the shortest paths connecting dysfunctional genes to identify potential cancer driver genes. Next, they ranked the genes by a permutation test and their network properties, such as betweenness and interaction scores. The top-ranking genes at different levels (i.e., methylation level, miRNA level, mutation level, and mRNA level) were regarded as driver genes of lung adenocarcinoma. Among these cancer driver genes, some appeared to be top candidates at different levels, suggesting their multifaceted contribution to lung carcinogenesis. Above all, the shortest path algorithms 100,101 can help us efficiently identify regulatory paths in networks, allowing us to identify potential genes that are proximate to known cancer genes and thereby important for tumorigenesis. However, due to the complexity of the disease, potential cancer genes are not always on the identified shortest paths 106 , revealing the limitations of such algorithms. To resolve this issue, Lu et al. 112 proposed a random walk with restart algorithm method and identified 298 potential CRC-associated genes, which is more effective and accurate than the shortest path algorithm proposed by Li et al. 106 . In particular, the computing efficacy of the shortest path algorithm could be compromised by large networks and their search strategies 112 . The module detection algorithm. Cancers usually result from disruption of interactions of key regulatory genes with their partners 81, 113 . Module detection algorithms 114 , one of network propagation algorithm, identify communities of cancer genes in complex networks 115 by analysing their topological structures ( Fig. 4 and Algorithm 2). Here, we explain and illustrate the commonly used modularity maximization algorithm 116 , which identifies network modules with the maximum modularity coefficients by Eq. 2. where Q represents the modularity coefficient of an identified module, M is the total number of edges in the network, A ij is the adjacency matrix, and P ij represents the expected number of edges between nodes i and j. C i or C j represents the module to which node i or node j belongs. If i and j belong to the same module, δ Ci ;Cj ¼ 1; otherwise, δ Ci ;Cj ¼ 0. The identified modules are a group of genes that are supposed to have a similar biological function, such as promoting or inhibiting tumourigenesis. Algorithm 2. Module detection algorithm. Currently, many researchers employ module detection algorithms to intelligently identify potential therapeutic targets for cancer [117] [118] [119] . For example, Ghiassian et al. 120 used the DIseAse MOdule Detection (DIAMOnD) method 121 to identify the local modules within the interconnected map of molecular components. They found that disease-related genes were significantly enriched in highly overlapping modules, which indicated that the predicted modules may help identify new anticancer targets. Of note, since the results of module detection algorithms depend mainly on network structures, the identified modules may vary for the same disease network with slightly different topology 85, 117 . Since potential drug targets may exist in different network modules, we can make use of the correlation between modules to identify reliable cancer treatment targets 81 . Therefore, Wang et al. 122 proposed the seed connector algorithm (adding a few extra hidden nodes as much as possible to link disease proteins) by considering the interactions among cancerassociated proteins. First, this algorithm starts with known seed proteins and induces a loosely connected subnetwork consisting of only seed proteins. Second, Wang et al. sequentially select such proteins as seed connectors that maximally increase the size of the largest connected component of the subnetwork until there is no additional protein that can be selected as a seed connector. Finally, the cancer modules are pinpointed. While these aforementioned algorithms [122] [123] [124] can intelligently identify meaningful functional modules from network topologies, it may be difficult to capture disease modules 125 . One possible reason is that disease proteins do not constitute particularly densely connected subgraphs but agglomerate in specific large regions of the network. For this reason, Tripathi et al. 126 considered analysing the patterns of connectivity in a disease module to be an effective way to understand the properties of disease modules. The node centrality. Node centrality measures the importance of nodes and is suitable to intelligently locate key nodes with important biological functions for network biology 127 . Usually, we listed four types of node centrality as follows: (1) As the simplest form of network centrality, degree centrality is the number of nodes directly connected to the network 127,128 ; (2) Coreness centrality considers both the degree of nodes and their positions in a network 129 ; (3) Betweenness centrality of a node is the probability for the shortest path between two randomly chosen nodes to go through that node, and it determines the actor that controls information among other nodes by connecting paths 130 ; (4) Eigenvector centrality 131 not only considers the number of edges and the position of nodes but also the impact of adjacent nodes on the interactive network. Table 2 shows the formulas for node centrality computing. Figure 5 (a-d) illustrates the above four types of node centrality, and Algorithm 3 presents the pseudocode to compute four types of node centrality. Degree centrality Coreness centrality C C ðiÞ ¼ P j2NðiÞ ksðjÞ Vertex j belongs to the neighbours of vertex i, ks(j) is the k-shell index of vertex j. Betweenness centrality C B ðiÞ ¼ P j