key: cord-0309583-z1l0hy7u authors: Aptekmann, AA; Buongiorno, J; Giovannelli, D; Glamoclija, M; Ferreiro, DU; Bromberg, Y title: mebipred: identifying metal-binding potential in protein sequence date: 2021-08-13 journal: bioRxiv DOI: 10.1101/2021.08.12.456141 sha: 98b250244809b8f62f526f2fdf945d4fca5bd269 doc_id: 309583 cord_uid: z1l0hy7u Metal-binding proteins have a central role in maintaining life processes. Nearly one-third of known protein structures contain metal ions that are used for a variety of needs, such as catalysis, DNA/RNA binding, protein structure stability, etc. Identifying metal-binding proteins is thus crucial for understanding the mechanisms of cellular activity. However, experimental annotation of protein metal-binding potential is severely lacking, while computational techniques are often imprecise and of limited applicability. We developed a novel machine learning-based method, mebipred, for identifying metal-binding proteins from sequence-derived features. This method is nearly 90% accurate in recognizing proteins that bind metal ions and ion containing ligands. Moreover, the identity of ten ubiquitously present metal ions and ion-containing ligands can be annotated. mebipred is reference-free, i.e. no sequence alignments are involved, and outperforms other prediction methods, both in speed and accuracy. mebipred can also identify protein metal-binding capabilities from short sequence stretches and, thus, may be useful for the annotation of metagenomic samples metal requirements inferred from translated sequencing reads. We performed an analysis of microbiome data and found that ocean, hot spring sediments and soil microbiomes use a more diverse set of metals than human host-related ones. For human-hosted microbiomes, physiological conditions explain the observed metal preferences. Similarly, subtle changes in ocean sample ion concentration affect the abundance of relevant metal-binding proteins. These results are highlight mebipred’s utility in analyzing microbiome metal requirements. mebipred is available as a web server at services.bromberglab.org/mebipred and as a standalone package at https://pypi.org/project/mymetal/ Metal-binding proteins have a central role in maintaining life processes. Nearly one-third of 23 known protein structures contain metal ions that are used for a variety of needs, such as 24 catalysis, DNA/RNA binding, protein structure stability, etc. Identifying metal-binding 25 proteins is thus crucial for understanding the mechanisms of cellular activity. However, 26 experimental annotation of protein metal-binding potential is severely lacking, while 27 computational techniques are often imprecise and of limited applicability. 28 We developed a novel machine learning-based method, mebipred, for identifying metal- 29 binding proteins from sequence-derived features. This method is nearly 90% accurate in 30 recognizing proteins that bind metal ions and ion containing ligands. Moreover, the identity of 31 ten ubiquitously present metal ions and ion-containing ligands can be annotated. mebipred is 32 reference-free, i.e. no sequence alignments are involved, and outperforms other prediction 33 methods, both in speed and accuracy. mebipred can also identify protein metal-binding 34 capabilities from short sequence stretches and, thus, may be useful for the annotation of 35 metagenomic samples metal requirements inferred from translated sequencing reads. We 36 performed an analysis of microbiome data and found that ocean, hot spring sediments and soil 18 Different levels of sequence redundancy in distinct databases may be an underlying 19 cause for this discrepancy. However, another major reason is that we are still unable 20 to accurately identify metal-binding proteins directly from their sequences and, in 21 some cases, even from their high resolution structures [12] . Experiments, e.g. mass ranging from 59% to 88%. There are also structure-independent (purely sequence-based) methods to predict 11 metal binding. Function transfer by homology, i.e. the assumption that similar 12 sequences perform similar functions, is one of the simplest ways to infer metal 13 binding for protein sequences. Similarity is often established by alignment methods. all of the above methods report good performance, we were unable to validate these 3 reports using our own data as the webserver/standalone versions (where applicable) 4 were nonfunctional and downloadable scripts absent. Here we present mebipred (metal-binding predictor), a computational method for the 6 prediction of protein metal binding potential based on sequence information alone. Our method is widely applicable because it doesn't depend on the existence of a high 8 resolution structure, has a better performance (average precision/recall of 95/78% at 9 default cutoff) and is faster (17,000 sequences/minute) than existing sequence-based 10 tools, and can be used to predict metal binding using whole protein sequences as 11 well as short peptide fragments. The latter ability makes it potentially suitable for 12 annotation of shotgun-sequenced unassembled metagenomic data/reads. mebipred 13 is also alignment-free and thus useful for the analysis of newly identified proteins 14 (with no known homologs). Finally, as mentioned previously, mebipred is the only 15 currently publicly available method for sequence-based prediction of metal binding. Datasets. We explored proteins binding Na, K, Ca, Mg, Mn, Fe, Cu, Ni, and Zn 2 metal-containing ligands, regardless of their oxidation state (e.g. Fe2+ and Fe3+ are 3 both in the Fe class) or context (e.g. Fe-containing hemes are in the same class as 4 Fe ions). We retrieved all protein structures with these metal-containing ligands from 5 the PDB (July 2019) and parsed them using the BioPython PDB module [51] 6 (Supplementary Table 1 ). One naive approach to identify a set of metal-binding 7 proteins is to compile all structures that have a metal ion. However, in the case of 8 heteromers, i.e. protein complexes that contain multiple nonidentical chains, it is 9 possible that only one of the chains binds the metal. We thus considered as metal-10 binding only the amino acid sequences/chains with at least one heavy atom within 5Å 11 of the metal ion (METAL set). All other chains were included in the NO_METAL set, 12 along with all PDB structures that contained no metals at all. Note that this criterion 13 for the differentiation of metal-binding/nonbinding chains could lead to disagreement 14 with existing metal-binding annotations. Feature extraction. To describe the proteins in our METAL and NO_METAL sets, 16 we used only sequence-based features: 1) amino acid composition, 2) amino acid 17 physicochemical properties, and 3) a count of the metal-binding amino acid 5mers. and a default prediction (yes/no) cutoff set at 0.5. 29 We trained and tested our model for identifying metal-binding proteins using ten-fold 1 cross-validation as follows: (1) we clustered sequences at 70% identity using CD-HIT 2 [59] and used the representative sequences of each cluster for training; (2) we split 3 the resulting sequences into positives (metal binding) and negatives (nonbinding), 4 and further divided each set into ten equally populated groups; (3) we built ten 5 models by rotating through the ten splits using one positive and one negative group 6 for testing and training with the other nine positive and negative groups. Since 7 negatives in our set are more frequent than positives, we balanced the sets by 8 randomly down-sampling the negatives to the same number as the positives for each 9 model training. Note that the ten models were used to estimate the performance of 10 the method, while the final mebipred model was constructed using all sequences and 11 established parameters. 12 We additionally trained individual models with the same set of features to predict the 13 binding of specific metals. We used the same modeling procedure and parameters 14 as described above, only adding one more feature --the score of the general metal- Comparing model performance to existing tools. To compare our method to a simple 25 alignment-based approach, we extracted all sequences from the PDB. We generated 26 a database of these sequences using the makeblastdb (-blastdb_version=5 and The second layer of mebipred predicts binding specifically to a ligand containing one 28 of the ten ions under consideration. In cross-validation using our data set (Methods), 29 mebipred was accurate in predicting ion specificity of individual proteins (Table 1) . when the first layer predicts the protein to not be able to bind metals, while the 15 second identifies a specific ion preference. We evaluated the second layer's ability to (Table 1) . These observations suggest that in cases of disagreement between the layers, high scoring predictions of the second layer can be trusted to guide overall metal binding 23 predictions. Our evaluation of mebipred performance against that of other methods on our data 1 was complicated by the absence of available web servers/standalone packages. better agreement (vs 36%) than that for other designated metal non-binders. A closer inspection further informs the reasons for database annotation differences. vectors of individual samples (Fig. 4) . Sample metal-binding preferences appear 4 more similar below 170m (lower absolute value of slope). The difference between 5 consecutive depths until 1,000m is in line with the changes in the environment 6 described by the pH chemocline, changes in reduction potential, and reduced light Table 2 ). We observed a significant correlation between the relative Here we compiled a gold-standard experimentally-derived metal-binding protein set 26 and built mebipred -a sequence-based neural network predictor of metal binding. mebipred significantly outperforms existing sequence-based methods for annotation 28 of metal binding; it also detects specific metals bound by each protein. We expect 29 that the growth in the number of metal binding proteins with resolved structures will 1 make these types of approaches even more powerful in the near future. To the best 2 of our knowledge, mebipred is also the only reference-free sequence-based tool for 3 identifying metal-binding. Our method is faster than existing tools and can predict 4 metal binding using short protein fragments -both characteristics that make it useful 5 in analysis of metagenomic data. In evaluation of microbiome samples we found that 6 differences in the number of predicted metal-binding proteins were related to the 7 concentration of metal ions in the corresponding environments. In this toy example of a protein structure that binds a metal, the red circle depicts a sphere of 5Å radius around the metal; the residues within the red circle are marked red. Each of these red residues and their neighboring residues (two on each side) make up a feature 5mer. In this example, the four 5mers are: "ISISA", "SISAP", "EQUEN" and "QUENC". The query sequence (right panel) is decomposed to count the number of metal-binding 5mers present. The final score for this feature is the sum of the counts of all 5mers in the sequence. Figures Fig 3 Figure 3 . mebipred outperforms BLAST in identifying metal binding proteins and peptides. (A) At all cutoffs, mebipred (MBP; filled circles) is more precise than BLAST (empty circles). For example, at the default cutoff (score=0.4; red dot) it achieves 80% precision for half of the sequences (50% recall), as compared to 40% precision attained by BLAST. (B) mebipred also outperforms BLAST in identifying the metal binding propensity of proteins from their 50 amino acid fragment sequences. For example, for half of the fragments, it attains 67% accuracy, as compared to 39% attained by BLAST. Metal-mediated protein stabilization Learning capability and storage capacity of two-hidden-layer 7 feedforward networks CD-HIT: accelerated for clustering the next-generation 10 sequencing data BLAST+: architecture and applications. BMC 12 bioinformatics Basic local alignment search tool BioLiP: a semi-manually curated database for 16 biologically relevant ligand-protein interactions Trimmomatic: a flexible trimmer for 19 Illumina sequence data Base-calling of automated sequencer traces using 21 phred. II. Error probabilities Biopython: freely available Python tools for computational 23 molecular biology and bioinformatics metaSPAdes: a new versatile metagenomic assembler Genome research Prokka: rapid prokaryotic genome annotation Highly accurate protein structure prediction with AlphaFold. 30 Nature Exploration of uncharted regions of the protein 32 universe MetalPDB: a database of metal sites in biological 34 macromolecular structures. Nucleic acids research MetalPDB in 2018: a database of metal sites in biological 37 macromolecular structures BLAT-the BLAST-like alignment tool An experimental comparison 42 of performance measures for classification Updating benchtop sequencing performance 81 Electron transfer by domain movement in cytochrome bc 1 Cabello-Yeves Effect of tectonic processes on biosphere-geosphere 1 feedbacks across a convergent margin Predictive metabolomic profiling of microbial communities 4 using amplicon or metagenomic sequences Composition of the major elements and 7 trace elements of 10 methanogenic bacteria determined by inductively 8 coupled plasma emission spectrometry The essential trace elements Spectrochemical analysis of inorganic elements in bacteria Iron in innate immunity: starve the invaders. Current opinion in 15 immunology Recent insights into iron import by bacteria. Current 17 opinion in chemical biology Nickel recognition by bacterial importer proteins Competitive binding of chromium, cobalt and nickel to 21 serum proteins Mechanisms of nickel carcinogenesis. Scandinavian 23 journal of work, environment & health Copper toxicity and the origin of 25 bacterial resistance-new insights and applications Gut microbiota and iron: the crucial actors in health and 28 disease Embracing the unknown: disentangling the complexities of the soil 30 microbiome Dietary effects on human gut microbiome diversity Chemical composition of sweat. 34 Physiological reviews The excretion of trace metals in human sweat Sweat copper, zinc, iron, magnesium 38 and chromium levels in national wrestler The composition and stability of the vaginal microbiota of 41 normal pregnant women is different from that of non-pregnant women