key: cord-0919573-yozeyt5l authors: Maison, David P.; Ching, Lauren L.; Cleveland, Sean B.; Tseng, Alanna C.; Nakano, Eileen; Shikuma, Cecilia M.; Nerurkar, Vivek R. title: Algorithm for the Quantitation of Variants of Concern for Rationally Designed Vaccines Based on the Isolation of SARS-CoV-2 Hawaiʻi Lineage B.1.243 date: 2021-08-19 journal: bioRxiv DOI: 10.1101/2021.08.18.455536 sha: 840ccd231785547e5c5c3be486d028f68e151918 doc_id: 919573 cord_uid: yozeyt5l SARS-CoV-2 worldwide emergence and evolution has resulted in variants containing mutations resulting in immune evasive epitopes that decrease vaccine efficacy. We acquired clinical samples, analyzed SARS-CoV-2 genomes, used the most worldwide emerged spike mutations from Variants of Concern/Interest, and developed an algorithm for monitoring the SARS-CoV-2 vaccine platform. The algorithm partitions logarithmic-transformed prevalence data monthly and Pearson’s correlation determines exponential emergence. The SARS-CoV-2 genome evaluation indicated 49 mutations. Nine of the ten most worldwide prevalent (>70%) spike protein changes have r-values >0.9. The tenth, D614G, has a prevalence >99% and r-value of 0.67. The resulting algorithm is based on the patterns these ten substitutions elucidated. The strong positive correlation of the emerged spike protein changes and algorithmic predictive value can be harnessed in designing vaccines with relevant immunogenic epitopes. SARS-CoV-2 is predicted to remain endemic and continues to evolve, so must SARS-CoV-2 monitoring and next-generation vaccine design. Since the origin of the Coronavirus Disease 2019 (COVID-19) pandemic, severe acute respiratory syndrome coronavirus-2 (SARS-CoV-2) has rapidly evolved into seven Variants of Interest (VOI) and four Variants of Concern (VOC). 1 Further, as of July 15, 2021, over 2,559,000 SARS-CoV-2 genomic sequences have been deposited in the publicly available GenBank and the Global Initiative on Sharing Avian Influenza Data (GISAID) databases. 2 From the establishment of the now universal D614G substitution 3 to the emergence of the VOC and VOI with dozens of different mutations across their respective genomes, 1 the SARS-CoV-2 evolution, and adaptations, are apparent and constant. To give nomenclature to these evolutionary events, previously. 14 Briefly, viral RNA was eluted in 60 µL of the elution buffer. RNA extraction was confirmed using the Takara RNA LA PCR Kit (CAT #RR012A) and previously reported primer sets. 14, 20 RNA was reverse transcribed into cDNA using the Takara RNA LA PCR Kit (Cat #RR012A) according to the manufacturer's protocol but with an extension time of 90 minutes. WGS was conducted by the ASGPB Core, UHM. Briefly, libraries prepared as per the manufacturer's protocol (Illumina Document #1000000025416 v09) using Illumina DNA Prep kit (Cat #20018704) and Nextera XT indexes were sequenced using the MiSeq Reagent Kit v3 (600 cycle) (Cat #MS-102-3003) and an Illumina MiSeq sequencer. WGS reads were compiled using the UHM MANA High-Performance Computing Cluster (HPC). Raw fastq sequence files were evaluated by the FASTQC program 21 for technical sequence error. After confirming a lack of technical errors, low-quality sequences were filtered and trimmed from each read with Trimmomatic 22 using paired-end adapter sequence NexteraPE-PE (ILLUMINACLIP:NexteraPE-PE:2:30:10) and a sliding 4 base window evaluating for quality with a PHRED score over 30. Trimmed result quality was confirmed with FASTQC. The trimmed-paired-end reads were then mapped to the NC_045512 reference genome using Bowtie2 23 and variants called with samtools mpileup 24 and transformed from VCF to FASTQ using bcftools and vcfutils 25 and finally converted to FASTA using seqtk. For comparison and validation, the fastq file was also inputted into Geneious Prime 2021.1.1 to produce FASTA files with the coronavirus assembly workflow. 26 The resultant consensus sequence was defined as Hawai'i Isolates and submitted to GenBank (SARS-CoV-2, Isolate USA-HI498 2020 (MZ664037) and SARS-CoV-2, Isolate USA-HI708 2020 ( MZ664038)). The lineage of each sequence was determined with the Phylogenetic Assignment of Named Global Outbreak (PANGO) Lineage nomenclature. [27] [28] [29] All Hawai'i sequences as of July 28, 2021, from both GenBank and GISAID were downloaded and searched for the presence of potential lineages of concern using PANGO lineage as described previously. 13 From the aforementioned variant comparison section, the selected S gene sequences underwent pairwise alignment with NC_045512 in SnapGene, and SNPs were identified. The SNPs were inputted into the SnapGene sequence feature and Nextclade 45 to determine amino acid substitutions (AAS). Non-synonymous substitutions were confirmed in GISAID using the metadata for each accession number. The PANGO Server was used to identify and confirm the lineage of each of the aforementioned thirteen strains described in the Variant Comparison section. 28, 29 The lineages and their collective AAS were identified and individually searched within GISAID for worldwide prevalence from March 2020 -April 2021. Each lineage was filtered separately, as were AAS. Parameters for selection were sequences that included a full month, day, and year of collection. Each month's prevalence for each lineage and AAS was logarithmically transformed and evaluated against month as an interval value with Pearson's correlation to determine an exponential increase in worldwide emergence as described previously. 14 Separately, the following search parameter was used in PubMed to locate in silico studies predicting vaccine epitopes to SARS-CoV-2: "((B-cell) OR (B cell)) AND ((T-cell) OR (T cell)) AND (peptide) AND (vaccine epitope) AND ((SARS-CoV-2) OR (COVID-19))." From this search on January 28, 2021, the three most recent articles 46-48 and the three best matching articles [49] [50] [51] were selected for further analysis by mapping to the Spike protein. All predicted epitopes able to be searched and defined with SnapGene's "Find Protein Sequence" feature were included. Article overlaps in the systematic review were only included once. The algorithm herein described was developed from the quantitation of the ten most emerged amino acid substitutions and deletions. The algorithm is as follows: This algorithm classifies AAS and deletions into two categories based on Pearson's r-value, r≥0.9 and r<0.9. For AAS and deletions to be called out as concerning, the r-value should be ≥0.9 and the previous month's worldwide prevalence of these AAS and deletions should be >30%. Further, these concerning AAS and deletions can be considered for inclusion in the next-generation vaccine design. If the r≥0.9 and the previous month's prevalence is between 2% and 30%, then the AAS or deletion is classified as interesting, and needs to be evaluated in a research setting. The same algorithm can also be applied for standardizing classifications of SARS-CoV-2 lineages as of interest or concerning. If the r<0.9, then the focus is on previous month's prevalence of AAS and deletions, as after a mutation is established, there is no longer need to evaluate emergence. If the previous month's prevalence is ≥90%, then the mutation is established in the SARS-CoV-2 genome and should be considered as concerning and be part of the next-generation vaccine. If the previous month's prevalence is >50%, then the mutation represents the majority, and needs to be considered as interesting and evaluated in a research setting. Again, the same algorithm can also be applied for standardizing classifications of SARS-CoV-2 lineages as of interest or concerning. Origin tracking was accomplished as described previously. 13 The sequences from Hawai'i were obtained through both GenBank and GISAID, along with SARS-CoV-2, Isolate USA-HI498 2020 and SARS-CoV-2, Isolate USA-HI708 2020 (from this study), to accomplish the origin determination. Virus was isolated from an OPS collected five days following symptom onset from an individual with PCR confirmed SARS-CoV-2 infection and propagated in Vero E6 cells. A stock of the SARS-CoV-2, Isolate USA-HI498 2020 was produced following three blind passages in Vero E6 cells and titered to 1.28 x 10 7 PFU/mL. Minimal CPE was observed at 12 hours, moderate CPE at 24 hours, and significant CPE at 48 hours ( RNA extraction was confirmed using RT-PCR as described previously. 14 There were 702,978 and 792,952 reads for SARS-CoV-2, Isolate USA-HI498 2020 and HI708, respectively (Table 1) . FastQC confirmed the quality of both the untrimmed and trimmed fastqc files. From GenBank, 317 full-genome SARS-CoV-2 sequences were obtained on July, 28, 2021. The output S gene alignment between the thirteen genomic sequences identified 49 SNPs. Pearson's correlation on logarithmically-transformed prevalence was calculated for the twelve SARS-CoV-2 variants in this study in order of highest to lowest r value as outlined in Table 2 and Figure 2 . Further, of the 49 identified SNPs in the S gene, 44 resulted in non-synonymous AAS and deletions in the protein, and 5 were synonymous as outlined in Figure 3A , 3B, 3Ci-xii and Tables 2 and 3 . Pearson's correlation on logarithmically-transformed prevalence was calculated for all identified SARS-CoV-2 AAS and deletions (Tables 2, 3, Figure 4 , and Supplementary Figure 1 ). The PubMed search for epitope predictions returned 42 publications. In total, 393 in silico predicted B and T cell epitopes corresponding to the spike protein were mapped from these publications ( Figure 3B ). Of these, 108 epitopes involved the N-terminal domain (NTD), 102 epitopes involved the receptor binding domain (RBD), 7 epitopes involved the S1/S2 furin cleavage site, 10 epitopes involved the fusion peptide, 20 involved heptad repeat 1, 12 involved heptad repeat 2, 12 involved the transmembrane region, and 8 involved the intracellular tail domain. The remaining 112 epitopes fell outside of these domains. Further, 239 and 151 epitopes were in the S1 and S2, respectively, with at least one predicted epitope covering 97% of the spike protein. The unique nucleotide mutations and resulting AAS and deletions for each of the twelve SARS-CoV-2 variants, in comparison to the reference sequence, are shown in Table 3 . sequences. Using this method, 13 we were able to define the origin of SARS-CoV-2, Isolate USA-HI498 2020 and HI708 ( Figure 5 ). In this report, we lay the foundation for an adaptive and rational algorithm for monitoring SARS-CoV-2 evolution, quantitating variants, substitutions, and deletions, in the context of the vaccine design. Further, we describe the isolation, genetic characterization, phylogenetic analysis, and immunogenetic epitopes of the spike protein based on the SARS-CoV-2 lineage B.1.243 from Hawai'i. We employed B.1.243 to establish and validate the algorithm, as well as analyze VOC and VOI in the context of emerging spike protein amino acid changes for surveillance and future vaccine design. Hawai'i has not been spared from this pandemic, and the Pacific Islander population here is disproportionately infected with SARS-CoV-2 as compared to other races and ethnicities. 12 Following isolation and identification of the B.1.243 lineage from the isolate SARS-CoV-2, Isolate USA-HI498 2020, and virus strain HI708, we found through curation and analysis of published sequences that the B.1.243 lineage was once the dominating lineage in Hawai'i, causing more than 40% of all cases. The Hawai'i B.1.243 lineage is not likely to escalate into a VOC, as the prevalence has decreased worldwide over the past several months. However, the disproportionate infection rates among the Pacific Islander population remains. 12 The B.1.243 lineage that once dominated in overall prevalence in Hawai'i was introduced from Washington, California, Pennsylvania, and New Mexico, with the majority of sequences arising from horizontal transfer within Hawai'i. SARS-CoV-2, Isolate USA-HI498 2020 was introduced in Hawai'i from New Mexico and the HI-708 SARS-CoV-2 strain originated from California. As this report is written, similar to the continental United States the SARS-CoV-2 Delta variant is rapidly spreading in Hawaii. 53 We were able to establish the algorithm described in this report based on the ten most emerged AAS and deletions. We evaluated nine of the AAS and deletions observed among the variants in this study, as of May 12, 2021, with r > 0.9 and >70% prevalence in April 2021. These AAS and deletions included, P681H, ΔV70, ΔH69, N501Y, S982A, D1118H, T716I, A570D, and ΔY144. The data show that the average time for an emerging substitution (r > 0.9) to go from >30% monthly prevalence percentage to >50% prevalence is 2.25 months. Extrapolating these findings, the timeframe for Pfizer to identify emerging changes and manufacture them into a new version of the BNT162b2 vaccine would be 60-110 days. This timeframe is roughly equivalent to the algorithm's predictive value. 58 Additionally, the tenth AAS used in the development of the algorithm was D614G that allowed us to discern the previous month prevalence, which is an important parameter, as once an emerging mutation is established in the genome the r value will decrease considerably. The nine substitutions and deletions also display an average of 4.75 months from >2% monthly prevalence to >50% monthly prevalence. Therefore, r > 0.9 and a prevalence of >2% is sufficient to establish a mutation as being of interest, whereas 30% prevalence escalates a mutation to the status of concern. From the evaluation of the ten total spike protein changes, the algorithm concludes that an r > 0.9 and a >30% prevalence percentage is an optimal time to classify a substitution as concerning, and consider the substitution for inclusion into vaccine primary structure for 60 day production time. These mutations of interest and concern can also serve to facilitate and focus research using infectious clone 59 and pseudoviruses 60 to determine the functional characteristics. Amino acid substitutions are responsible for, i) changing epitopes at a level to evade antibodies, 1,61,62 ii) allowing the virus to localize anatomically, 63 iii) increasing viral shedding, 62 iv) increasing binding affinity, 64 v) giving the virus binding protein the ability to change conformations more efficiently, 65 and vi) causing diagnostic false negatives. 66 Therefore, of great importance in vaccine development is identifying which spike protein changes are most prevalent worldwide across all sequenced genomes. Each substitution or deletion, or combination thereof, could potentially serve as an epitope as shown in the epitope map ( Figure 3C ). Booster vaccines using this algorithm can therefore prepare the vaccinated for any variant they are most likely to encounter by identifying and including the emerging and emerged amino acid changes representing the majority of SARS-CoV-2 worldwide. reactive, but need also to be preemptive and predictive. Preemptive and predictive is possible with the approach herein. S13I L18F T20N P26S Q52R A67V ΔH69 ΔV70 D80A T95I D138Y G142D Y144 ΔY145 W152C E154K P681H P681R I692V A701V T716I G838 A845S F888L S929 D950N S982A T1027I Q1071H D1118H D1146 V1176F Global initiative on sharing all influenza data -from vision to reality Tracking Changes in SARS-CoV-2 Spike: Evidence that D614G Increases Infectivity of the COVID-19 Virus Tracking SARS-CoV-2 variants. World Health Organization -variant-specific-vaccine-candidate-mrna-1273-351-to-nih-for-clinical-study Moderna Announces Positive Initial Booster Data Against SARS-CoV-2 Variants of Concern | Moderna Hawaii COVID-19 Data: Which Racial and Ethnic Groups Have Been Most Affected? State of Hawai'i -Department of Health Accessed Pacific Islanders in the Era of COVID-19: an Overlooked Community in Need. J Racial Ethn Health Disparities Genomic Analysis of SARS-CoV-2 Variants of Concern Circulating in Hawai'i to Facilitate Public-Health Policies. Res Sq Genetic Characteristics and Phylogeny of 969-bp S Gene Sequence of SARS-CoV-2 from Hawaii Reveals the Worldwide Emerging P681H Mutation The coronavirus is here to stay -here's what that means Collection of SARS-CoV-2 Virus from the Air of a Clinic within a University Student Health Care Center and Analyses of the Viral Genomic Sequence Serotype-Specific Detection of Dengue Viruses in a Fourplex Real-Time Reverse Transcriptase PCR Assay Protocol: Real-Time RT-PCR Assays for the Detection of SARS-CoV-2. World Health Organization Accessed Molecular epidemiology of SARS-CoV-2 clusters caused by asymptomatic cases in Anhui Province A Quality Control Tool for High Throughput Sequence Data Trimmomatic: a flexible trimmer for Illumina sequence data Immune and bioinformatics identification of T cell and B cell epitopes in the protein structure of SARS-CoV-2: A systematic review Bioinformatic prediction of potential T cell epitopes for SARS-Cov-2 Viral targets for vaccines against COVID-19 State of Hawaii D of H. Delta variant detected in all major counties Estimated transmissibility and impact of SARS-CoV-2 lineage B.1.1.7 in England Multiple SARS-CoV-2 variants escape neutralization by vaccine-induced humoral immunity An Infectious cDNA Clone of SARS-CoV-2 A real-time and high-throughput neutralization test based on SARS-CoV-2 pseudovirus containing monomeric infrared fluorescent protein as reporter. Emerg Microbes Infect The E484K Mutation in the SARS-CoV-2 Spike Protein Reduces but Does Not Abolish Neutralizing Activity of Human Convalescent and Post-Vaccination Sera. Infectious Diseases (except HIV/AIDS) Transmission, infectivity, and neutralization of a spike L452R SARS-CoV-2 variant Spike mutation D614G alters SARS-CoV-2 fitness The new SARS-CoV-2 strain shows a stronger binding affinity to ACE2 due to N501Y mutant. Med Drug Discov The SARS-CoV-2 Spike variant D614G favors an open conformational state Solutions for surveillance of the S gene mutation in the B.1.1.7 (501Y.V1) SARS-CoV-2 strain lineage. Behind the Bench A Single Immunization with Nucleoside-Modified mRNA Vaccines Elicits Strong Cellular and Humoral Immune Responses against SARS-CoV-2 in Mice Immunogenicity and structures of a rationally designed prefusion MERS-CoV spike antigen Selecting Viruses for the Seasonal Flu Vaccine