34320483 SpacePHARER: Sensitive identification of phages from CRISPR spacers in prokaryotic hosts Zhang R.,1 Mirdita M.,1 Levy Karin E.,1 Norroy C.,1 Galiez C.,1, 2 and Söding J.1, 3 1Quantitative and Computational Biology, Max Planck Institute for Biophysical Chemistry, Göttingen, Germany 2Univ. Grenoble Alpes, CNRS, Grenoble INP/Institute of Engineering Univ. Grenoble Alpes, Grenoble, France 3soeding@mpibpc.mpg.de Summary: SpacePHARER (CRISPR Spacer Phage-Host Pair Finder) is a sensitive and fast tool for de novo prediction of phage-host relationships via identifying phage genomes that match CRISPR spacers in genomic or metagenomic data. SpacePHARER gains sensitivity by comparing spacers and phages at the protein level, optimizing its scores for matching very short sequences, and combining evidence from multiple matches, while controlling for false positives. We demonstrate SpacePHARER by searching a comprehensive spacer list against all complete phage genomes. Availability and implementation: SpacePHARER is available as an open-source (GPLv3), user-friendly command-line software for Linux and macOS: spacepharer.soedinglab.org. I. INTRODUCTION Viruses of bacteria and archaea (phages) are the most abundant biological entities in nature. However, little is known about their roles in the microbial ecosystem and how they interact with their hosts, as cultivating most phages and hosts in the lab is challenging. Many prokaryotes (40% of bacteria and 81% of archaea) possess an adaptive immune system against phages, the Clus- tered Regularly Interspaced Short Palindromic Repeat (CRISPR) system [6]. After surviving a phage infec- tion, they can incorporate a short DNA fragment (28- 42 nt) as a spacer in a CRISPR array. The transcribed spacer will be used with other Cas components for a targeted destruction of future invaders. Some CRISPR- Cas systems require a 2-6 nucleotide long, highly con- served protospacer-adjacent motif (PAM) flanking the viral target to prevent autoimmunity. Multiple spac- ers targeting the same invader are not uncommon, due to either multiple infection events or the primed spacer acquisition mechanism identified in some CRISPR sub- types. CRISPR spacers have been previously exploited to identify phage-host relationship [3, 11, 12, 15]. These methods compare individual CRISPR spacers with phage genomes using BLASTN [1] and apply stringent filtering criteria, e.g. allowing only up to two mismatches. They are thus limited to identifying very close matches. How- ever, a higher sensitivity is crucial because phage refer- ence databases are very incomplete and often will not contain phages highly similar to those to be identified. To increase sensitivity, (1) we compare protein coding se- quences because phage genomes are mostly coding, and, to evade the CRISPR immune response, are under pres- sure to mutate their genome with minimal changes on the amino acid level; (2) we optimized a substitution matrix and gap penalties for short, highly similar protein frag- ments; (3) we combine evidence from multiple spacers matching to the same phage genome. II. METHODS Input. SpacePHARER accepts spacer sequences as multiple FASTA files each containing spacers from a sin- gle prokaryotic genome or as multiple output files from the CRISPR detection tools PILER-CR [7], CRT [5], MinCED [13] or CRISPRDetect [4]. Phage genomes are supplied as separate FASTA files or can be downloaded by SpacePHARER from NCBI GenBank [2]. Optionally, additional taxonomic labels can be provided for spacers or phages to be included in the final report. Algorithm. SpacePHARER is divided into five steps (Figure 1A, Supp. Materials). (0) Preprocess in- put: scan the phage genome and CRISPR spacers in six reading frames, extract and translate all putative coding fragments of at least 27 nt, with user-definable transla- tion tables. Each query set Q consists of the translated ORFs q of CRISPR spacers extracted from one prokary- otic genome, and each target set T comprises the puta- tive protein sequences t from a single phage. We refer to similar q and t as hit, and an identified host-phage relationship Q−T as match. (1) Search all q’s against all t’s using the fast, sensitive MMseqs2 protein search [14], with VTML40 substitution matrix [10], gap open cost of 16 and extension cost of 2 (Figure S1). We op- timized a short, spaced k-mer pattern for the prefilter stage (10111011) with six informative (‘1’) positions. In addition, align all q−t hits reported in previous search on nucleotide level and prioritize near-perfect nucleotide hits (Supp. Materials). (2) For each q−T pair, compute the P-value for the best hit pbh from first-order statis- tics. (3) Compute a combined score Scomb from best-hit P-values of multiple hits between Q and T using a modi- fied truncated-product method (Supp. Materials). (4) Compute the false discovery rate (FDR = FP /(TP + FP)) and only retain matches with FDR < 0.05. For that purpose, SpacePHARER is run on a null model database and the fraction of null matches with Scomb below a cut- off (empirical P-value) is used to estimate the FDR. (5) Scan 10 nt upstream and downstream of the phage’s pro- tospacer for a possible PAM. Output is a tab-separated text file. Each host-phage match spans two or more lines. The first starts with ‘#’: prokaryote accession, phage accession, Scomb, number of hits in the match. Each following line describes an indi- .CC-BY-NC-ND 4.0 International licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprintthis version posted February 10, 2021. ; https://doi.org/10.1101/2020.05.15.090266doi: bioRxiv preprint https://spacepharer.soedinglab.org https://doi.org/10.1101/2020.05.15.090266 http://creativecommons.org/licenses/by-nc-nd/4.0/ 2 CRISPR Locus Second best hit Best hit pbh1 pbh2 pbh3 PAM Protospacer Threshold p0 X Search ... ... (1) MMseqs2 search of six-frame translated ORFs (2) P-value of best hit per q-T (3) Compute combined score Q T q t (5) Scan for possible PAMs (4) Select true matches by FDR Hit List q1 q2 q3 # P re di ct io ns TP FP FN TN -log(p) S ort Empirical P-values 0 10.5 Extracted Spacer Sets (FASTA, PILER-CR, CRT, MinCED, CRISPRDetect) Phage Genomes A 0% 5% 10% 15% 0 50,000 100,000 150,000 200,000 250,000 #True Positives F a ls e D is co ve ry R a te BLASTN Eukaryotic Viral Control BLASTN Inverted Phage Control SpacePHARER Eukaryotic Viral ORF Control SpacePHARER Inverted Phage ORF Control B B L A S T N S p a ce P H A R E R Species Genus Family Order Class Phylum 0 100 200 300 400 F re q u e n cy Incorrect Correct C FIG. 1. (A) SpacePHARER algorithm. A query set Q consists of 6-frame translated ORFs (q) from CRISPR spacers, and a target set T consists of 6-frame translated ORFs (t) of phage proteins. (1) Search all qs against all ts using MMseqs2. Align the q −t hits on nucleotide level and prioritize near-perfect nucleotide hits. (2) For each q −T pair, compute the P-value for the best hit from first-order statistics. (3) Compute score Scomb by combining the best-hit P-values from multiple hits between Q and T using a modified truncated-product method. (4) Estimate the FDR by searching a null database. (5) Scan for possible protospacer adjacent motif (PAM). (B) Performance comparison between SpacePHARER (blue) and BLASTN (red) using inverted phage sequences (solid lines) or eukaryotic viral ORFs as null set (dashed lines) demonstrated by expected number of true positive (TP) predictions at different false discovery rates (FDRs). (C) Performance comparison between BLASTN (left), SpacePHARER using the weighted lowest common ancestor procedure (LCA, right) at FDR = 0.02, evaluated by the number of correct (blue) and incorrect (red) predictions, for all the host predictions made at each taxonomic rank or below. vidual hit: spacer accession, phage accession, pbh, spacer start and end, phage start and end, possible 5’ PAM|3’ PAM, possible 5’ PAM|3’ PAM on the reverse strand. If requested, the spacer–phage alignments are included. If taxonomic labels are provided, taxonomic reports based on the weighted lowest common ancestor (LCA) procedure described in [9] are created for host LCAs of each phage genome or phage LCAs of each spacer as ad- ditional tab-separated text files. III. RESULTS Datasets. We split a previously published spacer dataset [12] of 363,460 unique spacers from 30,389 prokaryotic genomes randomly into an optimization set (20%, 6,067 genomes) and a test set (80%, 24,322 genomes). The performance of SpacePHARER was eval- uated on the spacer test set against a target database of 7,824 phage genomes. We used two null databases: 11,304 eukaryotic viral genomes and the inverted trans- lated sequences of the target database. Viral genomes were downloaded from GenBank in 09/2018. The performance of SpacePHARER in Figure 1C was evaluated on a validation dataset of spacers from 1,066 bacterial genomes against 809 phage genomes with anno- tated host taxonomy [8]. For each phage, we predicted the host based on the host LCA. Prediction quality. At FDR = 0.05, SpacePHARER predicted 3 to 4 times more prokaryote-phage matches than BLASTN (Figure 1B, Figure S2). SpacePHARER predicted the correct host for more phages than BLASTN at all taxonomic ranks, while including most of the BLASTN predictions, at better precision (Figure 1C, Figure S3). If the host or a close relative of a phage is absent in the database (either because the host is uniden- tified or the host lacks a CRISPR-Cas system), the pre- dicted host may be correct only at a higher rank than species. Run time. SpacePHARER took 12 minutes to pro- cess the test dataset on 2×6-core 2.40 GHz CPUs, 47 times faster than BLASTN (575 minutes). IV. CONCLUSION SpacePHARER is 1.4 to 4× more sensitive than BLASTN in detecting phage-host pairs, due to searching with protein sequences, optimizing short sequence com- parisons, and combining statistical evidence, and it is fast enough to analyze large-scale genomic and metagenomic datasets. FUNDING ELK is a FEBS long-term fellowship recipient. The work was supported by the ERC’s Horizon 2020 Frame- work Programme [‘Virus-X’, project no. 685778] and the BMBF CompLifeSci project horizontal4meta. Conflict of Interest: none declared .CC-BY-NC-ND 4.0 International licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprintthis version posted February 10, 2021. ; https://doi.org/10.1101/2020.05.15.090266doi: bioRxiv preprint https://doi.org/10.1101/2020.05.15.090266 http://creativecommons.org/licenses/by-nc-nd/4.0/ 3 REFERENCES [1] Altschul, S.F. et al (1990). Basic local alignment search tool. J. Mol. Biol., 215(3), 403–410. [2] Benson, D.A. et al (2013). GenBank. Nucleic Acids Res., 41(D1), D36–D42. [3] Biswas, A. et al (2013). CRISPRTarget: bioinformatic prediction and analysis of crRNA targets. RNA Biol., 10(5), 817–827. [4] Biswas, A. et al (2016). CRISPRdetect: A flexible algorithm to define CRISPR arrays. BMC Genom., 17(1), 356. [5] Bland, C. et al (2007). CRISPR recognition tool (CRT): a tool for automatic detection of clustered regularly interspaced palindromic repeats. BMC Bioinform., 8(1), 209. [6] Burstein, D. et al (2016). Major bacterial lineages are essentially devoid of crispr-cas viral defence systems. Nature Communica- tions, 7(1), 10613. [7] Edgar, R.C. (2007). PILER-CR: Fast and accurate identification of CRISPR repeats. BMC Bioinform., 8(1), 18. [8] Edwards, R.A. et al (2015). Computational approaches to predict bacteriophage–host relationships. FEMS Microbiol. Rev., 40(2), 258–272. [9] Mirdita, M. et al (2020). Fast and sensitive taxonomic assignment to metagenomic contigs. bioRxiv. doi:10.1101/2020.11.27.401018. [10] Müller, T. et al (2002). Estimating amino acid substitution mod- els: A comparison of Dayhoff’s estimator, the resolvent approach and a maximum likelihood method. Mol. Biol. Evol., 19(1), 8–13. [11] Paez-Espino, D. et al (2016). Uncovering Earth’s virome. Nature, 536(7617), 425–430. [12] Shmakov, S.A. et al (2017). The CRISPR spacer space is domi- nated by sequences from species-specific mobilomes. mBio, 8(5), e01397–17. [13] Skennerton, C. (2016). Minced - mining CRISPRs in environmen- tal datasets. https://github.com/ctSkennerton/minced. [14] Steinegger, M. and Söding, J. (2017). MMseqs2 enables sensitive protein sequence searching for the analysis of massive data sets. Nat. Biotechnol., 35(11), 1026–1028. [15] Stern, A. et al (2012). CRISPR targeting reveals a reservoir of common phages associated with the human gut microbiome. Genome Res., 22(10), 1985–1994. .CC-BY-NC-ND 4.0 International licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprintthis version posted February 10, 2021. ; https://doi.org/10.1101/2020.05.15.090266doi: bioRxiv preprint https://github.com/ctSkennerton/minced https://doi.org/10.1101/2020.05.15.090266 http://creativecommons.org/licenses/by-nc-nd/4.0/ .CC-BY-NC-ND 4.0 International licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprintthis version posted February 10, 2021. ; https://doi.org/10.1101/2020.05.15.090266doi: bioRxiv preprint https://doi.org/10.1101/2020.05.15.090266 http://creativecommons.org/licenses/by-nc-nd/4.0/ .CC-BY-NC-ND 4.0 International licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprintthis version posted February 10, 2021. ; https://doi.org/10.1101/2020.05.15.090266doi: bioRxiv preprint https://doi.org/10.1101/2020.05.15.090266 http://creativecommons.org/licenses/by-nc-nd/4.0/ .CC-BY-NC-ND 4.0 International licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprintthis version posted February 10, 2021. ; https://doi.org/10.1101/2020.05.15.090266doi: bioRxiv preprint https://doi.org/10.1101/2020.05.15.090266 http://creativecommons.org/licenses/by-nc-nd/4.0/