key: cord-0025179-qfg16efx authors: Kozlowski, Lukasz Pawel title: Proteome-pI 2.0: proteome isoelectric point database update date: 2021-10-28 journal: Nucleic Acids Res DOI: 10.1093/nar/gkab944 sha: 532c40e4fd2cb5ae0f78a8e8ab616499f4095129 doc_id: 25179 cord_uid: qfg16efx Proteome-pI 2.0 is an update of an online database containing predicted isoelectric points and pK(a) dissociation constants of proteins and peptides. The isoelectric point—the pH at which a particular molecule carries no net electrical charge—is an important parameter for many analytical biochemistry and proteomics techniques. Additionally, it can be obtained directly from the pK(a) values of individual charged residues of the protein. The Proteome-pI 2.0 database includes data for over 61 million protein sequences from 20 115 proteomes (three to four times more than the previous release). The isoelectric point for proteins is predicted by 21 methods, whereas pKa values are inferred by one method. To facilitate bottom-up proteomics analysis, individual proteomes were digested in silico with the five most commonly used proteases (trypsin, chymotrypsin, trypsin + LysC, LysN, ArgC), and the peptides’ isoelectric point and molecular weights were calculated. The database enables the retrieval of virtual 2D-PAGE plots and customized fractions of a proteome based on the isoelectric point and molecular weight. In addition, isoelectric points for proteins in NCBI non-redundant (nr), UniProt, SwissProt, and Protein Data Bank are available in both CSV and FASTA formats. The database can be accessed at http://isoelectricpointdb2.org. The charge of a protein is one of its key physicochemical characteristics and is related to the pK a dissociation constant (pK a is a quantitative measure of the strength of an acid in solution). For proteins and peptides, the ionizable groups of seven charged amino acids should be considered: glutamate (␥ -carboxyl group), cysteine (thiol group), aspartate (␤-carboxyl group), tyrosine (phenol group), lysine (⑀ammonium group), histidine (imidazole side chains), and arginine (guanidinium group) (1) . Taken together, the pK a values of all charged groups can be used to calculate the overall charge of the molecule in any pH or to estimate the isoelectric point (pI, IEP), that is, the pH at which there is an equilibrium of positive and negative charges and therefore the total net charge of the molecule is equal to zero (2) . Both pK a and isoelectric point estimates have been used in numerous techniques, such as two-dimensional gel electrophoresis (2D-PAGE) (3, 4) , crystallization (5) , capillary isoelectric focussing (6) , and mass spectrometry (MS) (7, 8) . It should be stressed that experimental measurements of pK a values [PKAD database (9) ] and isoelectric point [SWISS-2DPAGE (10) ] are very limited (a few thousand records at most), but there are many computational methods that can be used to predict these features. In this work, I present a major update of the original Proteome-pI database ( Figure 1 ) (11) . The following changes have been introduced: -the number of proteomes included has been increased four-fold (from 5029 to 20 115); -new algorithms for isoelectric point prediction have been added (21 algorithms in total); -the prediction of pK a dissociation constants for over 61 million proteins have been included; -the prediction of isoelectric point for in silico digests of proteomes with the five most commonly used proteases (trypsin, chymotrypsin, trypsin + LysC, LysN, ArgC) have been added. Proteome-pI 2.0 is based on UniProt (12) reference proteomes (2021 03 release) and contains over 61 million protein sequences coming from 20 115 model organisms ( Table 1 and Supplementary Table S1 ). The data are divided according to the major kingdoms of the tree of life and include splicing variants for eukaryotic organisms. Additionally, the isoelectric point is predicted for the most commonly used protein sequence databases, such as the entire UniProt TrEMBL with 219 million sequences (12) , SwissProt with 561 000 proteins (13, 14) , NCBI nr (non-redundant) with 409 million sequences (15) , and Protein Data Bank with 601 000 protein chains (16) . Each proteome is analysed by various methods. The prediction of the isoelectric point is currently performed using 21 methods (including four new ones), which can be grouped into two categories. The simplest methods of isoelectric point prediction are based on experimentally derived pK a sets and the Henderson- (32, 33) ]. Moreover, in Proteome-pI 2.0, a completely new category of predictions has been introduced, namely the prediction of pK a dissociation constants. In this case, only one algorithm is used [IPC2.pKa (33) ], as other methods for pK a prediction are prohibitively slow and additionally require structural data (not available in Proteome-pI) (34) (35) (36) (37) . To facilitate bottom-up mass spectrometry analysis, in silico proteolytic digestion of proteins by the five most commonly used proteases (trypsin, chymotrypsin, trypsin + LysC, LysN, ArgC) has been introduced (38) . The proteolytic products (i.e. peptides) are treated as the surrogates of the parent proteins for further qualitative or quantitative analysis. The proteases generally cleave proteins at specific amino acid residue sites, but digestion is frequently incomplete (missed cleavage sites are widespread). To predict proteolysis, the Rapid Peptides Generator (RPG) program was used (with a 1.4% miscleavage rate) (39) . The resulting five datasets are further categorized according to the molecular mass of the peptides (Figure 1 and Supplementary Figure S1 ). In the next panel, the user can find in silico digests of the whole proteome with trypsin, chymotrypsin, trypsin + LysC, LysN and ArgC proteases suitable for different mass spectrometry machines, such as the ESI Ion Trap (600-3500 Da), LTQ Orbitrap (600-4000 Da), MALDI TOF/TOF (750-5500 Da), MS low (narrow range of mass, 800-3500 Da), and MS high (wide range of mass, 600-5500 Da). This can result in a huge number of potential peptides (e.g. for human proteins, trypsin digests can exceed two million peptides; Supplementary Table S2 ). At the bottom, general statistics such as amino acid and di-amino acid frequencies can be found. Additionally, each page is interconnected to external databases, such as UniProt and NCBI Taxonomy. Furthermore, Proteome-pI 2.0 provides global analyses related to the distribution of molecular weight and isoelectric points across kingdoms, or amino and di-amino acid statistics (Table 2 and Supplementary Table S3 ). Such data can be useful for high-throughput analysis of specific taxons, such as plants (40), fungi (41) or groups of interacting proteins (42) . The Proteome-pI 2.0 database update is a significant improvement upon the previous version, both quantitatively (covering more proteomes and using more algorithms) and qualitatively (including peptide digests and pK a predictions). Nevertheless, apart from the technical extension of the database (analysing more organisms), it is always worth checking how the addition of new data may have affected some global conclusions drawn from the data available at the time of evaluation. For instance, one of the scientifically important byproducts of creating Proteome-pI was the observation that the isoelectric points and molecular weights of proteins in different kingdoms vary considerably. For example, Archaea have the smallest proteins (except for viruses), but the isoelectric point of the proteome can differ greatly among individual species. This may be because Archaea are known for living in extreme environments (e.g. low or high pH), which affects the range of isoelectric point in their proteomes. In 2016, when the first version of the database was created, only 135 Archaeal organisms were included, whereas in the current version we have 331 such proteomes. Careful comparison of Figure 2 from Kozlowski (11) with Supplementary Figure S2 shows that indeed the trend is following an analysis of more Archaea, highlighting how unique and diverse these organisms can be in terms of their proteins' charge (see also Supplementary Figure S1) . Similarly, many statistics calculated previously have been repeated on the larger dataset, using a new version of a proteome or extending the calculation from the statistical perspective. For instance, two auxiliary statistics that Proteome-pI provides are amino and di-amino acid frequencies for whole proteomes. In the current version, we added error estimates (with × 100 bootstrapping at the protein level) to assess the possible variability of the calculations. This is not a purely technical aspect, as our knowledge about what constitutes the proteome of a given organism changes over time, and consequently we can draw conclusions different to those based on the data from the past. This is a highly dynamic situation, even for intensively studied organisms. For example, the human proteome in 2016 constituted 21 006 proteins with 71 173 splicing isoforms (92 179 in total). Now, we have 20 600 protein annotations with 79 500 splicing isoforms (100 100 in total), and this does not take into account the recent T2T-CHM13 reference genome update (43) . The situation may be even more dramatic for proteomes that may have been only recently studied intensively in terms of proteomics. For example, Xenopus tropicalis in 2016 had 18,252 annotated proteins, with an average isoelectric point of 6.70 and an average molecular mass of 60.1 kDa, accompanied by 5346 splicing isoforms (23 598 in total). Now, it has 22 514 proteins (average isoelectric point of 6.64 and average mass of 71.9 kDa), and 23 799 splicing isoforms have been identified. Accordingly, we decided to maintain the previous version of Proteome-pI (http://isoelectricpointdb.org) and present the new release as a completely new resource (http: //isoelectricpointdb2.org). The number of reference proteomes has increased 4-fold during the last five years (5029 in Proteome-pI 1.0 versus 20 115 in the current release); therefore, constant addition of new proteomes is of great interest. Furthermore, users frequently request respective data for proteomes of interest to them, such as a particular strain of bacteria or virus not included in the official release but relevant to their ongoing studies (44) . In parallel, the addition of new algorithms for isoelectric point and pK a prediction is foreseen. The latter is especially worth consideration, as the database currently includes the prediction of pK a values by only one method. This limitation will not be easy to overcome, as most of the pK a predictors [e.g. Rosetta pKa (45) , H++ (35), MCCE (36) ] rely on protein structure information. However, the advance of the SWISS-MODEL Repository (46) and recently the AlphaFold Protein Structure Database (47) gives hope that Proteome-pI could be also extended by 3Dbased protein predictions. It is worth mentioning here that there are already some efforts for making predictions of isoelectric points and pK a values based on available protein structures [pKPDB database (48) ]. Finally, one of the most important additions to the Proteome-pI database was introducing in silico proteome digests derived from the five most commonly used proteases. Furthermore, the resulting datasets were categorized by molecular mass to facilitate analysis with specific mass spectrometry techniques. Such an approach could be seen as highly simplistic, and further grinding of in silico digests is possible. Future plans in this respect include adding the prediction of peptides' hydrophobicity, retention time (49) , electrophoretic mobility (50) , and the use of more sophisticated methods than can be utilized for the prediction of in silico digests [e.g. Deep-Digest (51) ]. Finally, adding information about the uniqueness of peptides versus coverage after digestion would be also valuable. We would be grateful for any contribution or ideas from the community with respect to future improvements to the database. All data in the Proteome-pI 2.0 database are available for download free of charge. For more information see Supple-Nucleic Acids Research, 2022, Vol. 50, Database issue D1539 mentary Data. The database will be maintained for at least 10 years and can be accessed at http://isoelectricpointdb2. org or http://isoelectricpointdb2.mimuw.edu.pl (mirror). Supplementary Data are available at NAR Online. Protein ionizable groups: pK values and their contribution to protein stability and solubility The Henderson-Hasselbalch equation: its history and limitations Protein mapping by combined isoelectric focusing and electrophoresis of mouse tissues High resolution two-dimensional electrophoresis of proteins Using isoelectric point to determine the pH for initial protein crystallization trials Optimizing separation parameters in capillary isoelectric focusing HiRIEF LC-MS enables deep proteome coverage and unbiased proteogenomics Combining isoelectric point-based fractionation, liquid chromatography and mass spectrometry to improve peptide detection and protein identification PKAD: a database of experimentally measured pKa values of ionizable groups in proteins SWISS-2DPAGE, ten years later Proteome-pI: proteome isoelectric point database UniProt: the universal protein knowledgebase in 2021 The SIB Swiss Institute of Bioinformatics' resources: focus on curated databases The SWISS-PROT protein sequence database and its supplement TrEMBL in 2000 Reference sequence (RefSeq) database at NCBI: current status, taxonomic expansion, and functional annotation RCSB Protein Data Bank: powerful new tools for exploring 3D structures of biological macromolecules for basic and applied research and education in fundamental biology, biomedicine, biotechnology, bioengineering and energy sciences Polypeptide amino acid composition and isoelectric point. II. Comparison between experiment and theory Solomons' Organic Chemistry Lehninger Principles of Biochemistry EMBOSS: the European Molecular Biology Open Software Suite Data for Biochemical Research PPD v1.0-an integrated, web-accessible database of experimentally determined protein pKa values Isoelectric points of proteins: theoretical determination pK values of the ionizable groups of proteins Heterogeneity of component bands in isoelectric focusing patterns DTASelect and Contrast: tools for assembling and comparing protein identifications from shotgun proteomics The solubility of amino acids and two glycine peptides in aqueous ethanol and dioxane solutions. Establishment of a hydrophobicity scale A summary of the measured pK values of the ionizable groups in folded proteins Reference points for comparisons of two-dimensional maps of proteins from different human cell types defined in a pH scale where isoelectric points correlate with polypeptide compositions Protein identification and analysis tools in the ExPASy server ProMoST (Protein Modification Screening Tool): a web-based tool for mapping protein modifications on two-dimensional gels IPC -Isoelectric Point Calculator IPC 2.0: prediction of isoelectric point and pKa dissociation constants DelPhiPKa: Including salt in the calculations and enabling polar residues to titrate H++ 3.0: automating p K prediction and the preparation of biomolecular structures for atomistic molecular modeling and simulations MCCE2: improving protein pKa calculations with extensive side chain rotamer sampling PypKa: a flexible Python module for Poisson-Boltzmann-based pKa calculations Six alternative proteases for mass spectrometry-based proteomics beyond trypsin Rapid Peptides Generator: fast and efficient in silico protein digestion The molecular mass and isoelectric point of plant proteomes Virtual 2-D map of the fungal proteome Protein isoelectric point distribution in the interactomes across the domains of life The complete sequence of a human genome Physicochemical properties of SARS-CoV-2 for drug targeting, virus inactivation and attenuation, vaccine formulation and quality control Rapid calculation of protein pKa values using Rosetta The SWISS-MODEL Repository--new features and functionality Highly accurate protein structure prediction with AlphaFold 2021) pKPDB: a protein data bank extension database of pKa and pI theoretical values Sequence-specific retention calculator. a family of peptide retention time prediction algorithms in reversed-phase HPLC: applicability to various chromatographic conditions and columns Predicting electrophoretic mobility of proteoforms for large-scale top-down proteomics DeepDigest: prediction of protein proteolytic digestion with deep learning I would like to thank all authors of the previous works related to isoelectric point and pK a set measurements and computational methods. Special acknowledgement is extended to the developers of the UniProt database, upon which Proteome-pI depends heavily.