key: cord-0932002-bjjqxv52
authors: Ma, Lifei; Li, Huiyang; Lan, Jinping; Hao, Xiuqing; Liu, Huiying; Wang, Xiaoman; Huang, Yong
title: Comprehensive analyses of bioinformatics applications in the fight against COVID-19 pandemic
date: 2021-11-02
journal: Comput Biol Chem
DOI: 10.1016/j.compbiolchem.2021.107599
sha: 7c2d83dbcb71a5d87a38bb6e6c789dc97c48847f
doc_id: 932002
cord_uid: bjjqxv52

Novel coronavirus disease 2019 (COVID-19) is a global pandemic caused by severe acute respiratory syndrome coronavirus type 2 (SARS-CoV-2), which can be transmitted from person to person. As of September 21, 2021, over 228 million cases were diagnosed as COVID-19 infection in more than 200 countries and regions worldwide. The death toll is more than 4.69 million and the mortality rate has reached about 2.05% as it has gradually become a global plague, and the numbers are growing. Therefore, it is important to gain a deeper understanding of the genome and protein characteristics, clinical diagnostics, pathogenic mechanisms, and the development of antiviral drugs and vaccines against the novel coronavirus to deal with the COVID-19 pandemic. The traditional biology technologies are limited for COVID-19-related studies to understand the pandemic happening. Bioinformatics is the application of computational methods and analytical tools in the field of biological research which has obvious advantages in predicting the structure, product, function, and evolution of unknown genes and proteins, and in screening drugs and vaccines from a large amount of sequence information. Here, we comprehensively summarized several of the most important methods and applications relating to COVID-19 based on currently available reports of bioinformatics technologies, focusing on future research for overcoming the virus pandemic. Based on the next-generation sequencing (NGS) and third-generation sequencing (TGS) technology, not only virus can be detected, but also high quality SARS-CoV-2 genome could be obtained quickly. The emergence of data of genome sequences, variants, haplotypes of SARS-CoV-2 help us to understand genome and protein structure, variant calling, mutation, and other biological characteristics. After sequencing alignment and phylogenetic analysis, the bat may be the natural host of the novel coronavirus. Single-cell RNA sequencing provide abundant resource for discovering the mechanism of immune response induced by COVID-19. As an entry receptor, angiotensin-converting enzyme 2 (ACE2) can be used as a potential drug target to treat COVID-19. Molecular dynamics simulation, molecular docking and artificial intelligence (AI) technology of bioinformatics methods based on drug databases for SARS-CoV-2 can accelerate the development of drugs. Meanwhile, computational approaches are helpful to identify suitable vaccines to prevent COVID-19 infection through reverse vaccinology, Immunoinformatics and structural vaccinology.

Since December 2019, several cases of acute respiratory tract infection have been confirmed in Wuhan, Hubei province, China (C. Zhu et al., 2020) . The large-scale pandemic of confirmed cases jumps unpredictably, spreads rapidly, and poses a severe threat to public health (Mahase, 2020 ; J. T. Wu et al., 2020) . The World Health Organization (WHO) has declared the outbreak an "International Public Health Emergency" on March 11, 2020. The newly identified β-coronavirus (β-CoVs) was named severe acute respiratory syndrome coronavirus type 2 (SARS-CoV-2) by the Coronavirus Study Group (CSG) of the International Committee. The disease caused by SARS-CoV-2 was suspected to be an infectious disease and officially designated coronavirus disease 2019 (COVID-19) by WHO . According to the current data, SARS-CoV-2 is easily transmissible among people through respiratory droplets and survives in the air for about 2 h, remaining highly infectiousness (Y. Munster et al., 2020; wei Lu et al., 2020; . The incubation period after infection is usually 4-8 days (Chen et al., 2020a; . All ages are susceptible to SARS-CoV-2, however, older patients and those who smoke experience severe illness with comorbidities (Chen et al., 2020b; . Importantly, the likelihood of local J o u r n a l P r e -p r o o f transmission tends to increase due to cases, including human-to-human transmission in the asymptomatic infection period (Rothe et al., 2020; Yuan et al., 2006) .

The two highly pathogenic β-CoVs, which are known to cause severe acute respiratory syndrome (SARS) and Middle East respiratory syndrome (MERS), are zoonotic pathogens that mainly infect the lower respiratory tract and result in fatal pneumonia in humans, and cause a serious threat to human health (Channappanavar and Perlman, 2017; Luk et al., 2019; Ramadan and Shaib, 2019; . The overall mortality rate reported by WHO is 9.6% for SARS and 34.4% for MERS (Munster et al., 2020; World Health Organization, 2020) . In comparison, SARS-CoV-2 has a mortality rate of 2.05% in the world, and elder patients with chronic diseases and intensive care unit (ICU) patients have reached mortality rates of 17-38% (Chen et al., 2020a; . Global Real Time COVID-19 outbreak map from WHO can not obtain such timely and large-scale data, and has other shortcomings. Bioinformatics technology has made great contributions to the rapid and in-depth study of SARS and MERS and will also bring a new dawn to the study of SARS-CoV-2 (Woo et al., 2010) .

Sequencing diagnosis is still the main method in the diagnosis of the novel coronavirus and plays an important role in virus functional research. In this scientific and technological battle, the J o u r n a l P r e -p r o o f analysis of sequencing the virus data by bioinformatics tools provides an important basis for the discovery of SARS-CoV-2. Databases of SARS-CoV-2 provide genome and protein structure quality evaluation, variant calling, and interactive visualization and other biological characteristics.

Sequence alignment among different species help us to understand virus origin and evolutionary characteristics. The rapid, feasible, accurate, and sensitive point-of-care testing (POCT) approaches for detecting SARS-CoV-2 in different situation are urgently developed. By bioinformatics analysis of single-cell RNA sequencing of SARS-CoV-2 RNA, the features of epithelial and immune cell types were revealed. In addition, it can provide predictive and screening high-throughput computational support to accelerate pandemic treatment and prevention, such as specific drugs, and the development of vaccines. Small molecule drugs, screened from drug databases for SARS-CoV-2 through high-performance computing approaches, such as molecular dynamics simulation, molecular docking and artificial intelligence technology, can competitively bind to the functional sites of viral proteins, which prevents the viral protein from binding to its true substrate, thereby inhibiting viral activity or the receptors induce the host to generate immune responses and produce neutralizing antibodies against novel coronavirus (Yin et al., 2020) .

Although we know comparatively little about SARS-CoV-2, it is a highly pathogenic human pathogen, and possibly a zoonotic agent. Challenges will continue to be confronted in several key areas, therefore, in this review, evidence of the virus RNA genome sequence by bioinformatics methods is instrumental in the characteristic, origin, evolution, detection, pathogenesis and function of SARS-CoV-2, and the development of antiviral drugs and vaccines, to provide data for public health decision-making. 

The following sections provide several of the most important methods and applications of bioinformatics technologies in combating the COVID-19 based on currently available reports and the descriptive workflow is shown in Figure 1 . 

According to the next-generation sequencing (NGS) and third-generation sequencing (TGS) technology, significantly updated and publicly available databases of SARS-CoV-2 offer genome and protein structure quality evaluation, variant calling, and interactive visualization. According to the Global Initiative on Sharing Avian Influenza Data (GISAID, https://www.gisaid.org) (Shu and McCauley, 2017) , National Center for Biotechnology Information (NCBI, https://www.ncbi.nlm.nih.gov) (O'Leary et al., 2016) , National Microbiology Data Center (NMDC, https://www.nmdc.cn) , China National GeneBank (CNGB, https://www.cngb.org) (Xiao et al., 2019) , China National Center for Bioinformation (CNCB, https://bigd.big.ac.cn/ncov) and several other sources, SARS-CoV-2 genome sequences resources worldwide and their related metadata have been published (Table 1) . After obtaining nucleic acid sequence, comprehensive bioinformatics analysis demonstrated that COVID-19 was caused by the novel coronavirus SARS-CoV-2 which belongs to lineage B β-CoVs and is a positive single-stranded RNA virus with a genome length of about 29.9 kb, with non-coding regions at both ends, and non-structural protein-coding regions and structural J o u r n a l P r e -p r o o f protein-coding regions in the middle (Chan et al., 2020; Roujian Lu et al., 2020; Zhu et al., 2020) .

SARS-CoV and MERS-CoV have positive-sense RNA genomes of 27.9 kb and 30.1 kb, respectively (De Wit et al., 2016) . The complete RNA genome of SARS-CoV-2 includes 29,870 nucleotides and also contains a variable number of open reading frames (ORFs) (Chen et al., 2020a; Munster et al., 2020; wei Lu et al., 2020; . Excluding the poly (A) tail (GenBank no. MN908947), the ORFs encode 9860 amino acids with five typical ORFs on a common coding strand and in the order of 5′ UTR-replicase polyprotein (orf1a/b, 7096-aa) -Spike glycoprotein (S, 1273-aa) -Envelope protein (E,75-aa) -Membrane protein (M, 222-aa) -Nucleocapsid protein (N, 419-aa)-3′ UTR, in which the four essential protein-coding regions S, E, M, and N encode structural proteins (Figure 2A , B) (Chan et al., 2020) .

In addition, there are several accessory genes that influence the host innate immune response, such as ORF3b and ORF8. ORF3b encodes a completely novel short protein without an exact function.

The new ORF8 probably encodes a secreted protein formed by an alpha-helix, followed by beta sheets that contain six strands that have no known functional domain or motif similarity (Chan et al., 2020) . Two-thirds of viral RNA, mainly located in the coding region of non-structural proteins, including the ORF1a and ORF1b genes, translates two replicase polyproteins (pp1a and pp1ab) and are proteolytically cleaved into 16 non-structural proteins (NSP), namely NSP1-16, which are indispensable for viral transcription and replication (Chan et al., 2020; . There are two main methods for predicting the virus structure through high-performance computing and bioinformatics. The first method is to clarify the virus structure through gene sequencing results. To ensure the accuracy of the method, the results require statistical verification.

Among them, AlphaFold (Senior et al., 2020) de novo simulation is a deep two-dimensional dilated convolutional residual network, which predicts the protein structure or directs homologous modeling based on similar proteins in the sequence. The higher the sequence similarity, the higher is the reliability of structural simulation, common software include Swiss-Model (https://swissmodel. expasy.org/) (Waterhouse et al., 2018) , Robetta (http://robetta.bakerlab.org/) (Kim et al., 2004) I-TASSER suite (Yang et al., 2014) , and so on. Novel coronavirus structure can be predicted from the SARS virus structure, so de novo simulations using artificial intelligence programs are seldom needed. The second method to discover the structure of the virus is to reconstruct its three-dimensional structure through imaging techniques. In this process, high-performance computing, mainly using RELION software (Scheres, 2012) , can accelerate the reconstruction of the three-dimensional structure. AlphaFold and RELION is recommended to calculate the virus protein structure, but sometimes computing power of AlphaFold may lead to over fitting. The coronavirus molecular weight is large and, therefore, cryo-electron microscopy (Cryo-EM) combined with bioinformatics technologies can accelerate the reconstruction of coronavirus.

Currently, SARS-CoV-2 protein sequences and functional information were described in Universal Protein (UniProt, https://www.uniprot.org) and the world's first three-dimensional structure of the new coronavirus protein is available in the Research Collaboratory for Structural Bioinformatics Protein Data Bank Database (RCSB PDB, https://www.rcsb.org, PDB ID: 6LU7) J o u r n a l P r e -p r o o f . The functional polypeptides are released from the polyproteins (pp1a and pp1ab) by extensive proteolytic processing, predominantly by a 33.8-kDa main protease (M pro ), which is also referred to as the 3C-like protease. M pro digests the polyproteins at no less than 11 conserved sites, starting with the autolytic cleavage of this enzyme itself from pp1a and pp1ab (Hegyi and Ziebuhr, 2002) . The important role of M pro in the viral life cycle, together with few closely related homologs in humans, identifies it as a key drug target against novel coronavirus (Pillaiyar et al., 2016) . Through a high-resolution complex structure, academician Chen's team found that the antibody 4A8 has strong virus neutralization ability and can significantly inhibit the activity of the virus (Chi et al., 2020) .

Bioinformatics technology has become a powerful tool for identifying, analyzing, and monitoring the novel coronavirus, and has laid an important foundation for controlling the spread of the virus. The rapid acquirement of the genome sequence and protein structure of SARS-CoV-2 by bioinformatics technology has opened a fast channel for targeted drug screening and vaccine development.

The origin and evolution of SARS-CoV-2 remain largely unclear. Several studies have suggested that bats are known to be hosts for more than 30 CoVs (Wong et al., 2019) , and could be the potential natural reservoir of SARS-CoV-2 (Giovanetti et al., 2020; Paraskevis et al., 2020; Wong et al., 2019) . On May 1, 2020, WHO stated that many scientists had studied the gene sequence of COVID-19 and were convinced that it originated from nature. However, there is no J o u r n a l P r e -p r o o f definitive evidence of the origin or intermediate host of the novel virus Phan et al., 2020) .

The virus genome sequencing results, sequencing alignment and phylogenetic analysis with bioinformatics tool MUSCLE (3.8.31) (Edgar, 2004) demonstrated that SARS-CoV-2 had a 96.3% sequence identity to BatCoV RaTG13, which originated from Yunnan, China in 2013. This was the closest relative to two bat-derived SARS-like CoVs, bat-SL-CoVZC45 and bat-SL-CoVZXC21, sharing 88% nucleotide similarity, originating from Zhoushan, China, in 2018, whereas it shared 79.5% identity to SARS-CoV and 50% identity to MERS-CoV ( Figure 3 ) (Roujian Paraskevis et al., 2020; Zhou et al., 2020) . The results suggest that bat CoV and human SARS-CoV-2 might come from a common ancestor, although bats are not sold in the seafood market, in Wuhan, China (F. Wu et al., 2020) . Therefore, the natural host of novel coronavirus SARS-CoV-2 may be bats, but the origin and intermediate host is not yet conclusive. Protein sequences alignment and phylogenetic studies (Y. observed that the residues of the receptor were similar in many species, which provides more possibilities of alternative intermediate hosts, such as turtles, pangolins, minks and snakes (Table 2) (Ghorbani et al., 2021; Ji et al., 2020; Li et al., 2020; Xiao et al., 2020) . A recent study suggests that pangolin has the probability to be an intermediate host of SARS-CoV-2 but may not be the only one. Through the genomic sequence similarity analysis of SARS-CoV-2 obtained from different patients, its homology is more than 99.9%. Therefore, it is reasonable to consider that the SARS-CoV-2 strain originates from one source and can be detected relatively quickly (Roujian Lu et al., 2020; Zhou et al., 2020) . The S gene of the SARS-CoV-2 has less than 75% sequence identity to those of the bat SARS-like CoVs (SL-CoVZXC21 and ZC45) and human SARS-CoV .

Similarly, the spike glycoprotein encoded by the S genes of the SARS-CoV-2 was longer than that of the SARS-CoV . The spike protein, composed of the S1 and S2 domain, is crucial in determining host tropism and transmission capacity through the mediation of receptor-binding and membrane fusion (Roujian . Among these, the S2 subunit of the SARS-CoV-2 is highly conserved and has a 99% identity with that of SARS-CoV (Roujian Lu et al., 2020) . The receptor-binding domain is usually located in the C-terminal domain of S1 to directly associate with the human receptor. Although the S1 domain of the SARS-CoV-2 only has approximately 70% identity with SARS-CoV, homology modeling revealed that the SARS-CoV-2 had a similar receptor-binding domain structure to that of SARS-CoV (Roujian .

Notably, Zhou et al. found that SARS-CoV-2, just like SARS-CoV, may use angiotensin-Converting Enzyme 2 (ACE2) as an entry receptor in ACE2-expressing cells, the majority of which are type II alveolar cells (AT2) in human lung (Edgar, 2004; Ziegler et al., 2020) . The similarities and differences of the S protein of β coronavirus are shown in Table 3 . Therefore, high-throughput bioinformatics technology is needed for large-scale screening to monitor whether ACE2 targeted drugs are effective in the treatment of SARS-CoV-2, and individual mutations need to be constantly observed to determine whether they have an effect on drug activity. 

At this time, diagnostic testing for SARS-CoV-2 can be conducted only at Centers for Disease Table 4 .

Nucleic acid detection is the golden standard for COVID-19 diagnosis. From the testing methods conducted to detect the causative virus, qRT-PCR is specific, rapid, economic, the most robust, and widely applied technology for qualitative and quantitative diagnosis (Corman et al., 2020; Li et al., 2019a; Wan et al., 2016) . However, the qRT-PCR requires sophisticated equipment and exhibits high false-negative rates, low sensitivity, with only 30-50% positive detection ratio, and cannot identify suspected infection, or close contact with confirmed cases, thereby resulting in patients mistakenly considered uninfected or cured and impeding pandemic containment (Chu et al., J o u r n a l P r e -p r o o f 13 2020; Lan et al., 2020; Lo and Chiu, 2020) . LAMP is a high sensitivity technology with a simple operation, fast reaction speed (within 50 min), is less dependent on complex equipment, and has been widely used for the development of POCT techniques for the detection of viruses, various pathogens, and even viral variants (Renfei Lu et al., 2020; Notomi et al., 2000; Thai et al., 2004) .

Later, the mismatch-tolerant version of the LAMP method has higher sensitivity and faster reaction speed than the traditional method (Li et al., 2019b; Zhou et al., 2019) . The limitation of the RT-LAMP assay was calculated to be 118.6 copies per reaction.

Next-generation and third-generation sequencing are widely applied for mutation identification, pathogen identification and monitoring virus evolution (Metsky et al., 2017; Wilson et al., 2014) including SARS-CoV, but require expensive equipment, operator expertise, and a turnaround time of >24 h, making them unsuitable for the current crisis. To effectively interrupt the transmission of SARS-CoV-2, promising POCT detection methods are needed. And to break the cycle of isolation and spread, LAMP-Seq and nanopore target sequencing (NTS) are capable of population-scale processing.

Because traditional PCR detection needs a professional laboratory, its speed is greatly limited.

LAMP-Seq, a low cost, rapid (within 40 min) and highly sensitive protocol, allows for population-scale testing using massively parallel RT-LAMP by using sequence-specific barcodes. It hours (Fauver et al., 2020; . Combining the advantage of target amplification and long-read real-time nanopore sequencing, the NTS to detect SARS-CoV-2 and other respiratory viruses was established simultaneously within 6-10 h, with higher sensitivity than standard qPCR (Taiaroa et al., 2020) . For the virulence regions, NTS designed 11 fragments of 600-950 bp as targets, completely covering the 9,115 bp region, and 26 primers were used to develop the SARS-CoV-2 primer panel ( Figure 2C ) (M. . For more testing samples, NTS could rapidly detect many positive samples within 10 min (for quick detection) and 4 h (for final evaluation), showing good concordance with qPCR results. Parallel testing with approved qPCR of SARS-CoV-2 and NTS using a variety of nucleic acid samples from suspected COVID-19 cases confirmed that NTS identified more infected patients as positive, and could also monitor mutations in the nucleic acid sequence or in other respiratory viruses. NTS is a rapid, accurate, and comprehensive detection method, which makes it suitable for the diagnosis of contemporary COVID-19 to reduce mortality. In addition, this approach has the potential to be further extended to diagnose more viruses or pathogens (Fauver et al., 2020; .

Oxford Nanopore MinION and GridION were used for NTS. MinION, the smallest Oxford Nanopore sequencer, is smaller than a cell phone and can be used in various environments with low equipment cost. Base-calling and the quality assessment of MinION sequencing data use Guppy (v.

3.1.5) software, and GridION uses MinKNOW (v. 3.6.5). The reads of each sample are mapped against the virus genome reference database using BLASTn (v. 2.9.0+) (Altschul et al., 1990) .

Sequence correction of nanopore sequencing data was performed using medaka (v. 0.10.5) (Oxford Nanopore Technologies, 2018). The ARTIC bioinformatics workflow is used to obtain consistent sequence, SNPs, and mutation data through published SARS-CoV-2 sequence information. Variants were identified in infected patients by single nucleotide mutations at seven sites by NTS and qPCR.

Silent variants exist that have not been characterized by existing whole-genome sequencing methods (M. . Finally, for data analysis, bioinformatics analysis may also be introduced for high-quality genome, variants, and high-throughput detection (Charalampous et al., 2019; Greninger et al., 2015) . 

Single-cell RNA sequencing (scRNA-seq) is another important technology which provides a comprehensive cell atlas comprehensive to identify that different peripheral immune subtype changes are associated with distinct clinical features of COVID-19. By bioinformatics analysis, SARS-CoV-2 RNA was found in a variety of epithelial and immune cell types, accompanied by significant changes in the transcriptome of virus positive cells. Monocytes and lymphocytes do not express large amounts of pro-inflammatory cytokines in peripheral blood (Wilk et al., 2020) . While, upregulation of S100A8/A9 of peripheral megakaryocytes and monocytes mainly causes cytokine storm in severe patients (Ren et al., 2021) . Collectively, single-cell RNA sequencing results represent a rich resource for understanding the complex multicellular immune response induced by COVID-19 and developing effective therapeutic directions for COVID-19.

At present, bioinformatics based on high-performance computing is the standard of scientific research tools and will certainly be used to accelerate the development of specific drugs. In general, it is necessary for modern pharmaceuticals to understand the structure of the virus, and then evaluate the small protein molecules that can bind to and inactivate the virus. ACE2 as an entry receptor, potential drug target, plays an important role in the discovery and optimization of inhibitors blocking the entry of cells to treat disease. Small molecule drugs can competitively bind to the functional sites of viral proteins, which prevents the viral protein from binding to its true J o u r n a l P r e -p r o o f substrate, thereby inhibiting viral activity (Abd El-Aziz and Stockand, 2020). Drug databases for SARS-CoV-2 (D3Targets-2019-nCoV, CoViLigands, CORDITE, DockCov2, DrugBank, TCMD), accelerating molecular dynamics simulation (vsREMD), molecular docking (AutoDock, D3Docking, D3Similarity, CDOCKER), and artificial intelligence (AI) technology (DNN, MolAICal) will assist in drug design and screening through high-throughput computing with bioinformatics analysis. (Table 5 ). D3Targets-2019-nCoV, a molecular docking-based webserver, is accessible freely to public (https://www.d3pharma.com/D3Targets-2019-nCoV/index.php) , which currently contains 42 viral proteins with 69 conformations for predicting potential target protein for active compounds and for drugs screening against targets that may have potential inhibitory effects on from Protein Data Bank (Chen et al., 2021) . Drugbank database (https://www.drugbank.ca) contains comprehensive bioinformatics and chemoinformatics resources, and provides detailed drug, drug target information and its mechanism, including pharmacochemistry, pharmacology, pharmacokinetics, and drug-drug interaction information. Traditional Chinese Medicines Database 

can display the microscopic evolution of the system from the atomic level (J. .

After the vsREMD simulation, the homology model of the novel coronavirus M pro was constructed and N3 could fit inside the substrate-binding pocket, which could target the current novel coronavirus M pro . Kinetic studies by balancing the binding constant Ki (designated as K2 / K1) and the inactivation rate constant K3 formed by covalent bonds also showed that N3 has a strong inhibitory effect on the novel coronavirus M pro . Moreover, Joshi et al. conducted phylogenetic and sequence similarity network analysis on the main drug target, M pro , and identified molecules that were strongly active against SARS-CoV-2 M pro , such as δ-viniferin, myricitrin, Taiwanhomoflavone A, lactucopicrin 15-oxalate, nympholide A, afzelin, biorobin, hesperidin, and phyllaemblicin B, which also proved that these molecules have a powerful combination to other potential targets of SARS-CoV-2 infection, such as the viral receptor human angiotensin-converting enzyme 2 (HACE-2) and RNA-dependent RNA polymerase (RdRp) (Joshi et al., 2020) . In addition, calculates each small molecule in a small molecule library, and observes the energy released through binding with proteins from various angles. The more stable the binding, the more likely this small molecule will be developed into a drug (Morris et al., 2009 ). D3Docking is a structure-based module embedded in the D3Targets-2019-nCoV web server, which utilized molecular docking to explore drug targets for drugs or active compounds observed from previous clinical studies and identify compounds against potential drug targets via protein structure based virtual screening . D3Similarity is a complementary approach to docking-based methods for the SARS-CoV-2 target prediction, virtual screening and target identification of potential coronavirus antivirals according to two-dimensional (2D)/three-dimensional (3D) molecular similarity, which lays the foundation for the screening of new COVID-19 drugs . For example, Remdisvir is an RdRp inhibitor, which is reported to effectively inhibit SARS-CoV and SARS-CoV-2 in D3Docking and clinical practice. Recent research reported that a batch of old and traditional Chinese medicines that could be effective in the therapy of pneumonia were discovered by using high-performance computing and bioinformatics methods . The old drugs that are active natural products in traditional Chinese medicines and may have therapeutic effects on SARS-CoV-2 have been discovered, which greatly shortens the development time (Beck et al., 2020; Jin et al., 2020) .

efficient way to screen the potential drugs for the treatment of COVID-19 by using two different learning databases in which one is composed of compounds which have been reported or confirmed to have activity against SARS-CoV, SARS-CoV-2, human immunodeficiency virus, influenza virus, and the other contains the known 3C-like protease inhibitors . After selection, 13 drugs (bedaquiline, brequinar, celecoxib, clofazimine, conivaptan, gemcitabine, tolcapone, vismodegib, boceprevir, chloroquine, homoharringtonine, tilorone, and salinomycin) may prevent virus replication and indicate the potential to develop antiviral drugs . Deep learning is another approach of AI that applied into medicine and 2D/3D ligand design. MolAICal J o u r n a l P r e -p r o o f software supplies for 3D drug design of protein targets by artificial intelligence and classical algorithm .

At present, the SARS-CoV-2 infectious disease outbreak is tending to expand. Screening and developing of therapeutic drugs for viral diseases have become the focus and difficulty of world scientific research. Old drugs and traditional Chinese medicines of anti-SARS-CoV-2 are screened by many kinds of molecular dynamics simulation, molecular docking and artificial intelligence technology with bioinformatics which are considered to be more effective and to short the development time. After screening, ledipasvir, velpatasvir and remdisvir are effective as therapeutic drugs against SARS-CoV-2. As well, these molecules, δ-viniferin, myricitrin, Taiwanhomoflavone A, lactucopicrin 15-oxalate, nympholide A, afzelin, biorobin, hesperidin, and phyllaemblicin B, have a powerful combination to other potential targets of SARS-CoV-2 infection. In addition, the bedaquiline, brequinar, celecoxib, clofazimine, conivaptan, gemcitabine, tolcapone, vismodegib, boceprevir, chloroquine, homoharringtonine, tilorone, and salinomycin are the potential antiviral drugs. And, one of the traditional Chinese medicines named Rhubarb, can be used for pandemic virus obstruction of the lung. However, new drugs also need to be designed and still need a large number of clinical trials.

Vaccination is the most effective means of preventing and controlling infectious diseases. Since the COVID-19 pandemic, researchers around the world have been accelerating the development of an effective candidate vaccine to combat this viral disease. At present, several applications of J o u r n a l P r e -p r o o f bioinformatics to vaccine research can develop the effectiveness and safety of the vaccines through reverse vaccinology, Immunoinformatics and structural vaccinology (Chukwudozie et al., 2021a) .

Reverse vaccinology (RV) is a promising approach that aims to find novel vaccine candidates through genome or proteome of pathogens. Systematic sequencing revealed that many isolates have deletions in the core genomes and may have a significant impact on viral structure or function. RV relies on bioinformatics methods to identify target antigens according to genomic information. The open reading frame (ORF) of organism genome was identified by VaxiJen server to determine various antigens and physicochemical properties related to antigen epitopes, reducing the time on the antigen selection process. Using the existing SARS-COV-2 genome sequence information, few researchers have used RV in the design of novel COVID-19 vaccine. In addition, it is also applied to develop a multi-epitope chimeric vaccine against SARS-COV-2 (Enayatkhani et al., 2021; Ong et al., 2020) .

Immunoinformatics approach involves the application of computational methods to epitope-based peptide vaccine design. After getting the SARS-COV-2 genome and 3D protein structure, the most antigenic polyprotein and inhibitor-binding sites were predicted using various bioinformatics tools, which can generate a specific immune response. In order to predict the immunogenic targets from the viral genome sequences, computational tools such as TEpredict, CTLPred, NetMHC, and Epitopemap were applied. Immunoinformatics tools can help researchers develop potentially suitable vaccine candidates by satisfactorily understanding the human immune response to organisms in a short time (Delany et al., 2013; María et al., 2017) .

Structural vaccinology is a reasonable method to develop effective vaccines by joining epitopes and immunogenic domains of the SARS-COV-2 proteins. The conformational characteristics of viral epitopes can make them become good candidate antigens (María et al., 2017) . Structural properties (structural stability of peptides, solvent exposure, hydrophobicity and codon optimization) are used to map antigen epitopes to detect conformational features that may affect immunogenicity. In a word, structural vaccine technologies have been used to predict candidate vaccines for SARS-COV-2 via molecular docking, dynamics simulations, and homology modeling.

These techniques are helpful to identify effective vaccination for the prevention of COVID-19 infection (Chukwudozie et al., 2021b) .

Although many bioinformatics tools and algorithms have been used to develop safe candidate vaccines for COVID-19, there are still some limitations in fully integrating this prediction method in the design and development of traditional vaccines. Compared with most other DNA viruses with stable genomes such as smallpox virus, novel coronavirus has a distinctive feature and is more prone to gene mutations because it is an RNA virus, which is likely to change the nature of the virus itself and also causes a great reduction in the effectiveness of the vaccine. One of the main deficiencies of bioinformatics in finding effective vaccines against COVID-19 is that it can not solve the long-standing problem of selecting a suitable animal model to test vaccine candidates. In addition, reverse vaccinology can only target proteins and identify linear and discontinuous epitopes, which may lead to some predicted epitopes hidden in viral proteins and not easy to detect in vivo (Silva-Arrieta et al., 2020) . Also, for the immunoinformatics screening, the pleomorphism of MHC class I molecules may prevent the precision of vaccines designing (Greenbaum et al., 2011) . These may be the reasons for the failure of candidate vaccine screening. Ultimately, it is of SARS-COV-2 were 88% and 67%, respectively (Lopez Bernal et al., 2021) . No matter how the virus mutates, the more we know about its variants, the better the vaccines that will be developed by combining bioinformatics methods and biological technologies. The pandemic is not over yet, and whether it will return in the future remains unknown. Even if the pandemic ends suddenly, we should continue to develop the most promising vaccine candidates. 

Not applicable. SARS-CoV-2 had a 88% identity to bat-SL-CoVZC45 and bat-SL-CoVZXC21 originating from Zhoushan, China, in 2018, whereas it shared 79.5% identity to SARS-CoV and 50% identity to MERS-CoV.

Turtle maybe Intermediate host.

The interaction between the key amino acids of S protein RBD and ACE2 indicated that, turtles (Chrysemys picta bellii, Chelonia mydas, and Pelodiscus sinensis) may act as the potential intermediate hosts.

Pangolin maybe Intermediate host.

The isolation of a coronavirus from pangolins that is closely related to SARS-CoV-2.

Mink maybe Intermediate host.

A high rate of variation within SARS-CoV-2 mink isolates implies that mink populations were infected before human populations.

Snake maybe Intermediate host.

SARS-CoV-2 has most similar codon usage bias with snake. BAT-CoV-RaTG13 S protein contains short insertion sequences in the N-terminal domain ACE2

The S1 domain RBD is different from SARS-CoV, which contains a core structure and a subsidiary subregion that functions as RBM DPP4 (CD26) HCoV-HKU Three identical S1 domains form an interwoven cap at the top of the S2 stem unknow Table 4 Comparison of detection methods of SARS-CoV-2 Methods Advantages Disadvantages qRT-PCR Gold standard, high specific, high rapid, economic, the most robust, and widely applied technology Low sensitivity, speed is limited, require sophisticated equipment and exhibit false-negative rates NGS High accurate, high throughout, sequencing for high-quality genome Equipment is expensive, operator expertise, and a turnaround time of >24 h NTS Target amplification and long-read, High rapid, high accurate, higher sensitivity than standard qRT-PCR and sequencing for high-quality genome Equipment is expensive, and more expensive than qRT-PCR

High sensitivity, simple operation, fast reaction speed (within 50 min), less dependent on complex equipment, and can detect of viruses, various pathogens, and even viral variants

The limitation of the RT-LAMP assay was calculated to be 118.6 copies per reaction.

Low cost, high rapid (within 40 min) and highly sensitive protocol, population-scale testing Without clinical certification Antibody Traditional mature technology Lower sensitivity, and antibody production takes a long time CT Easy to operate and helpful in diagnosis Asymptomatic infection cannot be diagnosed Bioinformatics is the application of computational methods and analytical tools in the field of SARS-CoV-2 research which has obvious advantages in predicting the structure, product, function, and evolution of unknown genes and proteins, and in screening drugs and vaccines from a large amount of sequence information.

J o u r n a l P r e -p r o o f Highlights：  We summarized several applications of COVID-19 based on bioinformatics technologies.

 Bioinformatics technology accelerates to revel genome and protein structure, variant calling, mutation, and other biological characteristics of SARS-CoV-2.

 Bioinformatics analysis helps to find the evidence of the origin, host and evolution of the virus.

 As an entry receptor, angiotensin-converting enzyme 2 (ACE2) can be used as a potential drug target to treat COVID-19.

 Molecular dynamics simulation, molecular docking and artificial intelligence (AI) technology of bioinformatics methods can accelerate the development of drugs.

 Bioinformatics technologies can develop the effectiveness and safety of the vaccines through reverse vaccinology, Immunoinformatics and structural vaccinology.

J o u r n a l P r e -p r o o f

Recent progress and challenges in drug development against COVID-19 coronavirus (SARS-CoV-2) -an update on the status

Basic local alignment search tool

MolAICal: a soft tool for 3D drug design of protein targets by artificial intelligence and classical algorithm

Predicting commercially available antiviral drugs that may act on the novel coronavirus (SARS-CoV-2) through a drug-target interaction deep learning model

Genomic characterization of the 2019 novel human-pathogenic coronavirus isolated from a patient with atypical pneumonia after visiting Wuhan

Pathogenic human coronavirus infections: causes and consequences of cytokine storm and immunopathology

Nanopore metagenomics enables rapid clinical diagnosis of bacterial lower respiratory infection

Epidemiological and clinical characteristics of 99 cases of 2019 novel coronavirus pneumonia in Wuhan, China: a descriptive study

Cases of 2019-Novel Coronavirus (2019-nCoV) Pneumonia in Wuhan, China. SSRN Electron

DockCoV2: A drug database against SARS-CoV-2

A potent neutralizing human antibody reveals the N-terminal domain of the Spike protein of SARS-CoV-2 as a site of vulnerability

Molecular Diagnosis of a Novel Coronavirus (2019-nCoV) Causing an Outbreak of Pneumonia

The Relevance of Bioinformatics Applications in the Discovery of Vaccine Candidates and Potential Drugs for COVID-19 Treatment

Immuno-informatics design of a multimeric epitope peptide based vaccine targeting SARS-CoV-2 spike glycoprotein

Detection of 2019 novel coronavirus (2019-nCoV) by real-time RT-PCR

SARS and MERS: Recent insights into emerging coronaviruses

Vaccines, reverse vaccinology, and bacterial pathogenesis

MUSCLE: Multiple sequence alignment with high accuracy and high throughput

Reverse vaccinology approach to design a novel multi-epitope vaccine candidate against COVID-19: an in silico study

Coast-to-Coast Spread of SARS-CoV-2 during the Early Epidemic in the United States

Comparative phylogenetic analysis of SARS-CoV-2 spike protein-possibility effect on virus spillover

The first two cases of 2019-nCoV in Italy: Where they come from?

Functional classification of class II human leukocyte antigen (HLA) molecules reveals seven different supertypes and a surprising degree of repertoire sharing across supertypes

Rapid metagenomic identification of viral pathogens in

Conservation of substrate specificities among coronavirus main proteases

RT-LAMP for rapid diagnosis of coronavirus SARS-CoV-2

Cross-species transmission of the newly identified coronavirus 2019-nCoV

Structure of Mpro from COVID-19 virus and discovery of its inhibitors

Inhibition of SARS-CoV 3CL protease by flavonoids

Discovery of potential multi-target-directed ligands by targeting host-specific SARS-CoV-2 structurally conserved main protease

Crystal structure of SARS-CoV-2 nucleocapsid protein RNA binding domain reveals potential unique drug targeting sites

Artificial intelligence approach fighting COVID-19 with repurposing drugs

Protein structure prediction and analysis using the Robetta server

Positive RT-PCR Test Results in Patients Recovered from COVID-19

A mismatch-tolerant RT-quantitative PCR: Application to broad-spectrum detection of respiratory syncytial virus

A Mismatch-tolerant RT-LAMP Method for Molecular Diagnosis of Highly Variable Viruses

Aerodynamic analysis of SARS-CoV-2 in two Wuhan hospitals

Composition and divergence of coronavirus spike proteins and host ACE2 receptors predict of SARS-CoV-2

Racing Towards the Development of Diagnostics for a Novel Coronavirus (2019-nCoV)

Effectiveness of Covid-19 Vaccines against the B.1.617.2 (Delta) Variant

Genomic epidemiology of SARS-CoV-2 in Guangdong Province

Development of a Novel Reverse Transcription Loop-Mediated Isothermal Amplification Method for Rapid

Genomic characterisation and epidemiology of 2019 novel coronavirus: implications for virus origins and receptor binding

Molecular epidemiology, evolution and phylogeny of SARS coronavirus

Study on screening potential traditional Chinese medicines against 2019-nCoV based on Mpro and PLP

China coronavirus: what do we know so far?

The Impact of Bioinformatics on Vaccine Design and Development

CORDITE: The Curated CORona Drug InTERactions Database for SARS-CoV-2. iScience 23

Zika virus evolution and spread in the Americas

Software news and updates AutoDock4 and AutoDockTools4: Automated docking with selective receptor flexibility

A Novel Coronavirus Emerging in China -Key Questions for Impact Assessment

Loop-mediated isothermal amplification of DNA

Reference sequence (RefSeq) database at NCBI: Current status, taxonomic expansion, and functional annotation

COVID-19 Coronavirus Vaccine Design Using Reverse Vaccinology and Machine Learning

Medaka: Sequence correction provided by ONT Research

Full-genome evolutionary analysis of the novel corona virus (2019-nCoV) rejects the hypothesis of emergence as a result of a recent recombination event

Importation and Human-to-Human Transmission of a Novel Coronavirus in Vietnam

An overview of severe acute respiratory syndrome-coronavirus (SARS-CoV) 3CL protease inhibitors: Peptidomimetics and small molecule chemotherapy

Middle east respiratory syndrome coronavirus (MERS-COV): A review

Transmission of 2019-nCoV Infection from an Asymptomatic Contact in Germany

RELION: Implementation of a Bayesian approach to cryo-EM structure determination

LAMP-Seq: Population-scale COVID-19 diagnostics using combinatorial barcoding

Improved protein structure prediction using potentials from deep learning

2019 novel coronavirus of pneumonia in Wuhan, China: emerging attack and management strategies

GcMeta: A Global Catalogue of Metagenomics platform to support the archiving, standardization and analysis of microbiome data

D3Targets-2019-nCoV: a webserver for predicting drug targets and for multi-target and multi-site based virtual screening against COVID-19

GISAID: Global initiative on sharing all influenza data -from vision to reality

In silico veritas? Potential limitations for SARSCoV-2 vaccine development based on T-cell epitope prediction

The Global Landscape of SARS-CoV-2 Genomes, Variants, and Haplotypes

Direct RNA sequencing and early evolution of SARS-CoV-2

Development and Evaluation of a Novel Loop-Mediated Isothermal Amplification Method for Rapid Detection of Severe Acute Respiratory Syndrome Coronavirus

A melting curve-based multiplex RT-qPCR assay for simultaneous detection of four human coronaviruses

Clinical Characteristics of 138 Hospitalized Patients with 2019 Novel Coronavirus-Infected Pneumonia in Wuhan

Exploring Conformational Change of Adenylate Kinase by Replica Exchange Molecular Dynamic Simulation

Nanopore target sequencing for accurate and comprehensive detection of SARS-CoV-2 and other respiratory viruses

Clinical characteristics and therapeutic procedure for four cases with 2019 novel coronavirus pneumonia receiving combined Chinese and Western medicine treatment

SWISS-MODEL: Homology modelling of protein structures and complexes

2019-nCoV transmission through the ocular surface must not be ignored

A single-cell atlas of the peripheral immune response in patients with severe COVID-19

Actionable Diagnosis of Neuroleptospirosis by Next-Generation Sequencing

Global epidemiology of bat coronaviruses

Coronavirus genomics and bioinformatics analysis

World Health Organization, 2020. Who statement regarding cluster of pneumonia cases in wuhan

A new coronavirus associated with human respiratory disease in China

Nowcasting and forecasting the potential domestic and international spread of the 2019-nCoV outbreak originating in Wuhan, China: a modelling study

Isolation of SARS-CoV-2-related coronavirus from Malayan pangolins

Increased interactivity and improvements to the GigaScience database

The I-TASSER suite: Protein structure and function prediction

Ligand-based approach for predicting drug targets and for virtual screening against COVID-19

Structural basis for inhibition of the RNA-dependent RNA polymerase from SARS-CoV-2 by remdesivir

A climatologic investigation of the SARS-CoV outbreak in Beijing

Digestive system is a potential route of COVID-19: An analysis of single-cell coexpression pattern of key proteins in viral entry process

Changes in contact patterns shape the dynamics of the COVID-19 outbreak in China

Rapid molecular detection of SARS-CoV-2 (COVID-19) virus RNA using colorimetric LAMP

The 2019 novel coronavirus resource

Single-Cell RNA Expression Profiling of ACE2, the Receptor of SARS-CoV-2

A mismatch-tolerant reverse transcription loop-mediated isothermal amplification method and its application on simultaneous detection of all four serotype of dengue viruses

A Novel Coronavirus from Patients with Pneumonia in China

SARS-CoV-2 Receptor ACE2 is an Interferon-Stimulated Gene in Human

Table 5 Drug databases for SARS-CoV-2

of SARS-CoV-2. (A) Schematic representation of the structure of SARS-CoV-2. It has four structural proteins, S (spike), E (envelope), M (membrane), and N (nucleocapsid) proteins; the N protein holds the single-strand, positive-sense RNA genome, and the S, E, and M proteins together create the viral envelope. (B) SARS-CoV-2 genome comprises a 5′ untranslated region (5′ UTR) including 5′ leader sequence, open reading frame (ORF) 1a/b, envelope, membrane and nucleoprotein, accessory proteins such as orf 3, 6,7a, 7b, 8, and 9b and 3′ untranslated region (3′ UTR) in sequence. (C) SARS-CoV-2 structure is based upon amplification targets of the NTS method

Education Institutions (QN2021012, ZD2021007). J o u r n a l P r e -p r o o f

The authors declare that they have no competing interests.

LFM & YH conceptualization, data curation, resources, writing original draft, editing, approval of final article. LFM, YH, HYL, JPL, XQH, HYL and SQY supervision, writing and editing. All authors read and approved the final manuscript

✔ The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.☐The authors declare the following financial interests/personal relationships which may be considered as potential competing interests: