key: cord-103320-2rpr7aph
authors: Bhandari, Bikash K.; Gardner, Paul P.; Lim, Chun Shen
title: Solubility-Weighted Index: fast and accurate prediction of protein solubility
date: 2020-03-26
journal: bioRxiv
DOI: 10.1101/2020.02.15.951012
sha: 
doc_id: 103320
cord_uid: 2rpr7aph

Motivation Recombinant protein production is a widely used technique in the biotechnology and biomedical industries, yet only a quarter of target proteins are soluble and can therefore be purified. Results We have discovered that global structural flexibility, which can be modeled by normalised B-factors, accurately predicts the solubility of 12,216 recombinant proteins expressed in Escherichia coli. We have optimised B-factors, and derived a new set of values for solubility scoring that further improves prediction accuracy. We call this new predictor the ‘Solubility-Weighted Index’ (SWI). Importantly, SWI outperforms many existing protein solubility prediction tools. Furthermore, we have developed ‘SoDoPE’ (Soluble Domain for Protein Expression), a web interface that allows users to choose a protein region of interest for predicting and maximising both protein expression and solubility. Availability The SoDoPE web server and source code are freely available at https://tisigner.com/sodope and https://github.com/Gardner-BinfLab/TISIGNER-ReactJS, respectively. The code and data for reproducing our analysis can be found at https://github.com/Gardner-BinfLab/SoDoPE_paper2020.

High levels of protein expression and solubility are two major requirements of successful recombinant protein production (Esposito and Chatterjee 2006) . However, recombinant protein production is a challenging process. Almost half of recombinant proteins fail to be expressed and half of the successfully expressed proteins are insoluble ( http://targetdb.rcsb.org/metrics/ ). These failures hamper protein research, with particular implications for structural, functional and pharmaceutical studies that require soluble and concentrated protein solutions (Kramer et al. 2012; Hou et al. 2018) . Therefore, solubility prediction and protein engineering for enhanced solubility is an active area of research. Notable protein engineering approaches include mutagenesis, truncation (i.e., expression of partial protein sequences), or fusion with a solubility-enhancing tag (Waldo 2003; Esposito and Chatterjee 2006; Trevino, Martin Scholtz, and Nick Pace 2007; Chan et al. 2010; Kramer et al. 2012; Costa et al. 2014 ) .

Protein solubility, at least in part, depends upon extrinsic factors such as ionic strength, temperature and pH, as well as intrinsic factors-the physicochemical properties of the protein sequence and structure, including molecular weight, amino acid composition, hydrophobicity, aromaticity, isoelectric point, structural propensities and the polarity of surface residues (Wilkinson and Harrison 1991; Chiti et al. 2003; Tartaglia et al. 2004; Diaz et al. 2010) . Many solubility prediction tools have been developed around these features using statistical models (e.g., linear and logistic regression) or other machine learning models (e.g., support vector machines and neural networks) (Hirose and Noguchi 2013; Habibi et al. 2014; Hebditch et al. 2017; Sormanni et al. 2017; Heckmann et al. 2018; Z. Wu et al. 2019; Yang, Wu, and Arnold 2019) .

In this study, we investigated the experimental outcomes of 12,216 recombinant proteins expressed in Escherichia coli from the 'Protein Structure Initiative:Biology' (PSI:Biology) (Chen et al. 2004; Acton et al. 2005) . We showed that protein structural flexibility is more accurate than other protein sequence properties in predicting solubility (Craveur et al. 2015; M. Vihinen, Torkkila, and Riikonen 1994) . Flexibility is a standard feature that appears to have been overlooked in previous solubility prediction attempts. On this basis, we derived a set of 20 values for the standard amino acid residues and used them to predict solubility. We call this new predictor the 'Solubility-Weighted Index' (SWI). SWI is a powerful predictor of solubility, and a good proxy for global structural flexibility. In addition, SWI outperforms many existing de novo protein solubility prediction tools.

We sought to understand what makes a protein soluble, and develop a fast and accurate approach for solubility prediction. To determine which protein sequence properties accurately predict protein solubility, we analysed 12,216 target proteins from over 196 species that were expressed in E. coli (the PSI:Biology dataset; see Supplementary Fig S1 and Table S1A ) (Chen et al. 2004; Acton et al. 2005) . These proteins were expressed either with a C-terminal or N-terminal polyhistidine fusion tag (pET21_NESG and pET15_NESG expression vectors, N=8,780 and 3,436, respectively). They were previously curated and labeled as 'Protein_Soluble' or 'Tested_Not_Soluble' (Seiler et al. 2014) , based on the soluble analysis of cell lysate using SDS-PAGE (R. Xiao et al. 2010) . A total of 8,238 recombinant proteins were found to be soluble, in which 6,432 of them belong to the pET21_NESG dataset. Both the expression system and solubility analysis method are commonly used (Costa et al. 2014) . Therefore, this collection of data captures a broad range of protein solubility issues. 

Protein structural flexibility, in particular, the flexibility of local regions, is often associated with function (Craveur et al. 2015) . The calculation of flexibility is usually performed by assigning a set of 20 normalised B-factors-a measure of vibration of C-alpha atoms (see Supplementary Notes)-to a protein sequence and averaging the values by a sliding window approach (Ragone et al. 1989; Karplus and Schulz 1985; M. Vihinen, Torkkila, and Riikonen 1994; Smith et al. 2003) . We reasoned that such sliding window approach can be approximated by a more straightforward arithmetic mean for calculating global structural flexibility (see Supplementary Notes). We determined the correlation between flexibility (Vihinen et al. 's sliding window approach as implemented in Biopython) and solubility scores calculated as follows:

where is the normalised B-factor of the amino acid residue at the position , and is the B i i L sequence length. We obtained a strong correlation for the PSI:Biology dataset (Spearman's rho = 0.98, P-value below machine's underflow level). Therefore, we reasoned that the sliding window approach is not necessary for our purpose.

We applied this arithmetic mean approach (i.e., sequence composition scoring) to the PSI:Biology dataset and compared four sets of previously published, normalised B-factors (Bhaskaran and Ponnuswamy 1988; Ragone et al. 1989; M. Vihinen, Torkkila, and Riikonen 1994; Smith et al. 2003 ) Among these sets of B-factors, sequence composition scoring using the most recently published set of normalised B-factors produced the highest AUC score ( To improve the prediction accuracy of solubility, we iteratively refined the weights of amino acid residues using the Nelder-Mead optimisation algorithm (Nelder and Mead 1965) . To avoid testing and training on similar sequences, we generated 10 cross-validation sets with a maximised heterogeneity between these subsets (i.e. no similar sequences between subsets). We first clustered all 12,216 PSI:Biology protein sequences using a 40% similarity threshold using USEARCH to produce 5,050 clusters with remote similarity (see Methods and Supplementary Fig S4) . The clusters were grouped into 10 cross-validation sets of approximately 1,200 sequences each manually. We did not select a representative sequence for each cluster as about 12% of clusters contain a mix of soluble and insoluble proteins (Supplementary Fig S4C) . More importantly, to address the issues of sequence similarity and imbalanced classes, we performed 1,000 bootstrap resamplings for each cross-validation step (Fig 2A and Supplementary Fig S5) . We calculated the solubility scores using the optimised weights as Equation 1 and the AUC scores for each cross-validation step. Our training and test AUC scores were 0.72 ± 0.00 and 0.71 ± 0.01, respectively, showing an improvement over flexibility in solubility prediction (mean ± standard deviation; Fig 2B and Supplementary Table S3 ).

The final weights were derived from the arithmetic means of the weights for individual amino acid residues obtained cross-validation (Supplementary Table S4) . We observed over a 20% change on the weights for cysteine (C) and histidine (H) residues (Fig 2C and  Supplementary Table S4 ). These results are in agreement with the contributions of cysteine and histidine residues as shown in Supplementary Fig S2B. We call the solubility score of a protein sequence calculated using the final weights the Solubility-Weighted Index (SWI).

Flow chart shows an iterative refinement of the most recently published set of normalised B-factors for solubility prediction (Smith et al. 2003) . The solubility score of a protein sequence was calculated using a sequence composition scoring approach (Equation 1, using optimised weights , W instead of normalised B-factors ). These scores were used to compute the AUC scores for B training and test datasets. (B) Training and test performance of solubility prediction using optimised weights for 20 amino acid residues in a 10-fold cross-validation (mean AUC ± standard deviation). Related data and figures are available as Supplementary Table S3 and Supplementary Fig S4 and S5 . (C) Comparison between the 20 initial and final weights for amino acid residues. The final weights are derived from the arithmetic mean of the optimised weights from cross-validation. These weights are used to calculate SWI, the solubility score of a protein sequence, in the subsequent analyses. Filled circles, which represent amino acid residues, are colored by hydrophobicity (Kyte and Doolittle 1982) . Solid black circles denote aromatic amino acid residues phenylalanine (F), tyrosine (Y), tryptophan (W). Dotted diagonal line represents no change in weight. See also Supplementary Table S4 and Fig S4. AUC, Area Under the ROC Curve; ROC, Receiver Operating Characteristic; , arithmetic W mean of the weights of an amino acid residue optimised from 1,000 bootstrap samples in a cross-validation step.

To validate the cross-validation results, we used a dataset independent of the PSI:Biology data known as eSOL (Niwa et al. 2009 ) . This dataset consists of the solubility percentages of E. coli proteins determined using an E. coli cell-free system (N = 3,198) . Our solubility scoring using the final weights showed a significant improved correlation with E. coli protein solubility over the initial weights (Smith et al. 's normalised B-factors) [Spearman's rho of 0.50 (P = 9.46 ✕ 10 -206 ) versus 0.40 (P = 4.57 ✕ 10 -120 )]. We repeated the correlation analysis by removing extra amino acid residues including His-tags from the eSOL sequences (MRGSHHHHHHTDPALRA and GLCGR at the N-and C-termini, respectively). This artificial dataset was created based on the assumption that His-tags have little effect on solubility. We observed a slight decrease in correlation for this artificial dataset (Spearman's rho = 0.47, P= 3.67 ✕ 10-176), which may be due to the effects of His-tag in solubility and/or the limitation(s) of our approach that may overfit to His-tag fusion proteins.

We performed Spearman's correlation analysis for both the PSI:Biology and eSOL datasets. SWI shows the strongest correlation with solubility compared to the standard and 9,920 protein sequence properties (Fig 3 and Supplementary Fig S2, respectively) . SWI also strongly correlates with flexibility, suggesting that SWI is also a good proxy for global structural flexibility. We asked whether protein solubility can be predicted by surface amino acid residues. To address this question, we examined a previously published dataset for the protein surface 'stickiness' of 397 E. coli proteins (Levy, De, and Teichmann 2012) . This dataset has the annotation for surface residues based on previously solved protein crystal structures. We observed little correlation between the protein surface 'stickiness' and the solubility data from eSOL (Spearman's rho = 0.05, P = 0.34, N = 348; Supplementary Fig S6A) . Next, we evaluated if amino acid composition scoring using surface residues is sufficient, optimising only the weights of surface residues should achieve similar or better results than SWI. As above, we iteratively refined the weights of surface residues using the Nelder-Mead optimisation algorithm. The method was initialised with Smith et al. 's normalised B-factors and a maximised correlation coefficient was the target. However, a low correlation was obtained upon convergence (Spearman's rho = 0.18, P = 7.20 ✕ 10 -4 ; Supplementary Fig  S6B) . In contrast, the SWI of the full-length sequences has a much stronger correlation with solubility (Spearman's rho = 0.46, P = 2.97 ✕ 10 -19 ; Supplementary Fig S6C) . These results suggest that the full-length of sequences contributes to protein solubility, not just surface residues, in which solubility is modulated by cotranslational folding (Natan et al. 2018 ) .

To understand the properties of soluble and insoluble proteins, we determined the enrichment of amino acid residues in the PSI:Biology targets relative to the eSOL sequences (see Methods). We observed that the PSI:Biology targets are enriched in charged residues lysine (K), glutamate (E) and aspartate (D), and depleted in aromatic residues tryptophan (W), albeit to a lesser extend for insoluble proteins (Supplementary Fig S7A) . As expected, cysteine residues (C) are enriched in the PSI:Biology insoluble proteins, supporting previous findings that cysteine residues contribute to poor solubility in the E. coli expression system (Diaz et al. 2010; Wilkinson and Harrison 1991) .

In addition, we compared the SWI of random sequences with the PSI:Biology and eSOL sequences. We included an analysis of random sequences to confirm whether SWI can distinguish between biological and random sequences. We found that the SWI scores of soluble proteins are higher than those of insoluble proteins (Supplementary Fig S7B) , and that true biological sequences also tend to have higher SWI scores than random sequences, highlighting a potential evolutionary selection for solubility.

To confirm the usefulness of SWI in solubility prediction, we compared it with the existing tools Protein-Sol (Hebditch et al. 2017 ) , CamSol v2.1 (Sormanni, Aprile, and Vendruscolo 2015; Sormanni et al. 2017) , PaRSnIP ) , DeepSol v0.3 (Khurana et al. 2018) , the Wilkinson-Harrison model (Davis et al. 1999; Harrison 2000; Wilkinson and Harrison 1991) , and ccSOL omics (Agostini et al. 2014 ) . We did not include the specialised tools that model protein structural information such as surface geometry, surface charges and solvent accessibility because these tools require prior knowledge of protein tertiary structure. For example, Aggrescan3D and SOLart accept only PDB files that can be downloaded from the Protein Data Bank or produced using a homology modeling program (Kuriata et al. 2019; Hou et al. 2019) . SWI outperforms other tools except for Protein-Sol in predicting E. coli protein solubility (Table 1, Fig 4A) . Our SWI C program is also the fastest solubility prediction algorithm (Table 1, Fig 4B and Supplementary Table S7 ). 

Prediction accuracy of solubility prediction tools using the above cross-validation sets (Fig 2A) . For SWI, the test AUC scores were calculated from a 10-fold cross-validation (i.e., a boxplot representation of Fig 2B) 

Protein structural flexibility has been associated with conformal variations, functions, thermal stability, ligand binding and disordered regions (Mauno Vihinen 1987; Teague 2003; Ma 2005; Radivojac 2004; Schlessinger and Rost 2005; Yuan, Bailey, and Teasdale 2005; Yin, Li, and Li 2011) . However, the use of flexibility in solubility prediction has been overlooked although their relationship has previously been noted (Tsumoto et al. 2003) . In this study, we have shown that flexibility strongly correlates with solubility (Fig 3) . Based on the normalised B-factors used to compute flexibility, we have derived a new position and length independent weights to score the solubility of a given protein sequence (i.e., sequence composition based score). We call this protein solubility score as SWI.

Upon further inspection, we observe some interesting properties in SWI. SWI anti-correlates with helix propensity, GRAVY, aromaticity and isoelectric point (Fig 2C and 3) , suggesting that SWI incorporates the key propensities affecting solubility. Amino acid residues with a lower aromaticity or hydrophilic are known to improve protein solubility (Trevino, Martin Scholtz, and Nick Pace 2007; Niwa et al. 2009; Kramer et al. 2012; Warwicker, Charonis, and Curtis 2014; Han et al. 2019; Wilkinson and Harrison 1991) . Consistent with previous studies, the charged residues aspartate (D), glutamate (E) and lysine (K) are associated with high solubility, whereas the aromatic residues phenylalanine (F), tryptophan (W) and tyrosine (Y) are associated with low solubility (Fig 2C and Supplementary Fig S7A) . Cysteine residue (C) has the lowest weight probably because disulfide bonds couldn't be properly formed in the E. coli expression hosts (Stewart, Aslund, and Beckwith 1998; Rosano and Ceccarelli 2014; Jia and Jeon 2016; Aslund and Beckwith 1999) . The weights are likely different if the solubility analysis was done using the reductase-deficient, E. coli Origami host strains, or eukaryotic hosts.

Higher helix propensity has been reported to increase solubility (Idicula-Thomas and Balaji 2005; Huang et al. 2012 ) . However, our analysis has shown that helical and turn propensities anti-correlate with solubility, whereas sheet propensity lacks correlation with solubility, suggesting that disordered regions may tend to be more soluble (Fig 3) . In accordance with these, SWI has stronger negative correlations with helix and turn propensities. These findings also suggest that protein solubility can be largely explained by overall amino acid composition, not just the surface amino acid residues. This idea aligns with our understanding that protein solubility and folding are closely linked, and folding occurs cotranscriptionally, a complex process that is driven various intrinsic and extrinsic factors (Wilkinson and Harrison 1991; Chiti et al. 2003; Tartaglia et al. 2004; Diaz et al. 2010) . However, it is unclear why sheet propensity has little contribution to solubility because β-sheets have been shown to link closely with protein aggregation (Idicula-Thomas and Balaji 2005) .

We conclude that SWI is a well-balanced index that is derived from a simple sequence composition scoring method. To demonstrate the usefulness of SWI, we developed a web server called SoDoPE (Soluble Domain for Protein Expression; https://tisigner.com/sodope ). SoDoPE calculates the probability of solubility of a user-selected region based on SWI, which can either be a full-length or a partial sequence (see Methods and Supplementary  Table S8 ). This implementation is based on our observation that some protein domains tend 345 to be more soluble than the others. To demonstrate this point, we have analysed three commercial monoclonal antibodies and the severe acute respiratory syndrome coronavirus proteomes (SARS-CoV and SARS-CoV-2) (Wang et al. 2009; Marra et al. 2003; F. Wu et al. 2020 ) ( Supplementary Fig S8 and S9 ). These soluble domains may enhance protein solubility as a whole. SoDoPE also provides options for solubility prediction at the presence of solubility fusion tags. Similarly, solubility tags may act as soluble 'protein domains' that can outweigh the aggregation propensity of insoluble proteins. However, some soluble fusion proteins may become insoluble after proteolytic cleavage of solubility tags (Lebendiker and Danieli 2014) . In addition, SoDoPE is integrated with TIsigner, a gene optimisation web service for protein expression. This pipeline provides a holistic approach to improve the outcome of recombinant protein expression.

The standard protein sequence properties were calculated using the Bio.SeqUtils.ProtParam module of Biopython v1.73 (Cock et al. 2009 ) . All miscellaneous protein sequence properties were computed using the R package protr v1.6-2 (N. Xiao et al. 2015) .

We used the standard and miscellaneous protein sequence properties to predict the solubility of the PSI:Biology and eSOL targets (N=12,216 and 3,198 , respectively) (Seiler et al. 2014; Niwa et al. 2009 ) . For method comparison, we chose the protein solubility prediction tools that are scalable (Table 1) . Default configurations were used for running the command line tools.

To benchmark the wall time of solubility prediction tools, we selected 10 sequences that span a large range of lengths from the PSI:Biology and eSOL datasets (from 36 to 2389 residues). All the tools were run and timed using a single process without using GPUs on a high performance computer [ /usr/bin/time -f '%E' <command> ; CentOS Linux 7 (Core) operating system, 72 cores in 2× Broadwell nodes (E5-2695v4, 2.1 GHz, dual socket 18 cores per socket), 528 GiB memory]. Single sequence fasta files were used as input files.

To improve protein solubility prediction, we optimised the most recently published set of normalised B-factors using the PSI:Biology dataset (Smith et al. 2003 ) (Fig 2) . To avoid including homologous sequences in the test and training sets, we clustered the PSI:Biology targets using USEARCH v11.0.667, 32-bit (Edgar 2010) . His-tag sequences were removed from all sequences before clustering to avoid false cluster inclusions. We obtained 5,050 clusters using the parameters: -cluster_fast <input_file> -id 0.4 -msaout <output_file> -threads 4 . These clusters were divided into 10 subsets with approximately 1,200 sequences per subsets manually . The subsequent steps were done with His-tag sequences. We used Smith et al. 's normalised B-factors as the initial weights to 391 maximise AUC using these 10 subsets with a 10-fold cross-validation. Since AUC is non-differentiable, we used the Nelder-Mead optimisation method (implemented in SciPy v1.2.0), which is a derivative-free, heuristic, simplex-based optimisation (Oliphant 2007; Millman and Aivazis 2011; Nelder and Mead 1965) . For each step in cross-validation, we used 1,000 bootstrap resamplings containing 1,000 soluble and 1,000 insoluble proteins. Optimisation was carried out for each sample, giving 1,000 sets of weights. The arithmetic mean of these weights was used to determine the training and test AUC for the cross-validation step (Fig 2A) .

To examine the enrichment of amino acid residues in soluble and insoluble proteins, we compute the bit scores for each amino acid residue in the PSI:Biology soluble and insoluble groups ( Supplementary Fig S7A) , we normalised the count of each residue in each x) ( group by the total number of residues in that group. We used the normalised count of amino acid residues using the eSOL E. coli sequences as the background. The bit score of residue for soluble or insoluble group is then given by the following equation:

where is the normalised count of residue in the PSI:Biology soluble or insoluble (x) f i x) ( group and is the normalised count in the eSOL sequences. (x) f eSOL For a control, random protein sequences were generated by incrementing the length of sequence, starting from a length of 50 residues to 6,000 residues with a step size of 50 residues. A hundred random sequences were generated for each length, giving a total of 12,000 unique random sequences.

To estimate the probability of solubility using SWI, we fitted the following logistic regression to the PSI:Biology dataset:

(3) robability of solubility

where, is the SWI of a given protein sequence, and . The x 81.05812 a = − 2.7775 b = 6 P-value of log-likelihood ratio test was less than machine precision. Equation 3 can be used to predict the solubility of a protein sequence given that the protein is successfully expressed in E. coli ( Supplementary Table S8 ).

On this basis, we developed a solubility prediction webservice called the Soluble Domain for Protein Expression (SoDoPE). Our web server accepts either a nucleotide or amino acid sequence. Upon sequence submission, a query is sent to the HMMER web server to annotate protein domains ( https://www.ebi.ac.uk/Tools/hmmer/ ) (Potter et al. 2018) . Once the protein domains are identified, users can choose a domain or any custom region (including full-length sequence) to examine the probability of solubility, flexibility and GRAVY. This functionality enables protein biochemists to plan their experiments and opt for the domains or regions with high probability of solubility. Furthermore, we implemented a simulated annealing algorithm that maximised the probability of solubility for a given region by generating a list of regions with extended boundaries. Users can also predict the improvement in solubility by selecting a commonly used solubility tag or a custom tag.

We linked SoDoPE with TIsigner, which is our existing web server for maximising the accessibility of translation initiation sites (Bhandari, Lim, and Gardner 2019) . This pipeline allows users to predict and optimise both protein expression and solubility for a gene of interest. The SoDoPE web server is freely available at https://tisigner.com/sodope . 

Jupyter notebook of our analysis can be found at https://github.com/Gardner-BinfLab/SoDoPE_paper_2020 . The source code for our solubility prediction server (SoDoPE) can be found at https://github.com/Gardner-BinfLab/TISIGNER-ReactJS .  532  533  534  535  536  537  538  539  540  541  542  543  544  545  546  547  548  549  550  551  552  553  554  555  556  557  558  559  560  561  562  563  564  565  566  567  568  569  570  571  572  573  574  575  576  577  578  579  580  581  582  634  635  636  637  638  639  640  641  642  643  644  645  646  647  648  649  650  651  652  653  654  655  656  657  658  659  660  661  662  663  664  665  666  667  668  669  670  671  672  673  674  675  676  677  678  679  680  681  682  683  684 

Robotic Cloning and Protein Production Platform of the Northeast Structural Genomics Consortium

ccSOL Omics: A Webserver for Solubility Prediction of Endogenous and Heterologous Expression in Escherichia Coli

The Thioredoxin Superfamily: Redundancy, Specificity, and Gray-Area Genomics

Highly Accessible Translation Initiation Sites Are Predictive of Successful Heterologous Protein Expression

Positional Flexibilities of Amino Acid Residues in Globular Proteins

Learning to Predict Expression Efficacy of Vectors in Recombinant Protein Production

TargetDB: A Target Registration Database for Structural Genomics Projects

Rationalization of the Effects of Mutations on Peptide and Protein Aggregation Rates

Biopython: Freely Available Python Tools for Computational Molecular Biology and Bioinformatics

Fusion Tags for Protein Solubility, Purification and Immunogenicity in Escherichia Coli: The Novel Fh8 System

Protein Flexibility in the Light of Structural Alphabets

New Fusion Protein Systems Designed to Give Soluble Expression in Escherichia Coli

Prediction of Protein Solubility in Escherichia Coli Using Logistic Regression

Search and Clustering Orders of Magnitude Faster than BLAST

Enhancement of Soluble Protein Expression through the Use of Fusion Tags

Prediction of Peptide and Protein Propensity for Amyloid Formation

A Review of Machine Learning Methods to Predict the Solubility of Overexpressed Recombinant Proteins in Escherichia Coli

Improve Protein Solubility and Activity Based on Machine Learning Models

Expression of Soluble Heterologous Proteins via Fusion with NusA Protein

Protein-Sol: A Web Tool for Predicting Protein Solubility from Sequence

Machine Learning Applied to Enzyme Turnover Numbers Reveals Protein Structural Correlates and Improves Metabolic Models

ESPRESSO: A System for Estimating Protein Expression and Solubility in Protein Expression Systems

Computational Analysis of the Amino Acid Interactions That Promote or Decrease Protein Solubility

SOLart: A Structure-Based Method to Predict Protein Solubility and Aggregation

Prediction and Analysis of Protein Solubility Using a Novel Scoring Card Method with Dipeptide Composition

Understanding the Relationship between the Primary Structure of Proteins and Its Propensity to Be Soluble on Overexpression in Escherichia Coli

High-Throughput Recombinant Protein Expression in Escherichia Coli: Current Status and Future Perspectives

Prediction of Chain Flexibility in Proteins

DeepSol: A Deep Learning Framework for Sequence-Based Protein Solubility Prediction

Toward a Molecular Understanding of Protein Solubility: Increased Negative Surface Charge Correlates with Increased Solubility

Aggrescan3D (A3D) 2.0: Prediction and Engineering of Protein Solubility

A Simple Method for Displaying the Hydropathic Character of a Protein

Production of Prone-to-Aggregate Proteins

Cellular Crowding Imposes Global Constraints on the Chemistry and Evolution of Proteomes

Usefulness and Limitations of Normal Mode Analysis in Modeling Dynamics of Biomolecular Complexes

The Genome Sequence of the SARS-Associated Coronavirus

Data Structures for Statistical Computing in Python

Python for Scientists and Engineers

Cotranslational Protein Assembly Imposes Evolutionary Constraints on Homomeric Proteins

A Simplex Method for Function Minimization

Bimodal Protein Solubility Distribution Revealed by an Aggregation Analysis of the Entire Ensemble of Escherichia Coli Proteins

Python for Scientific Computing

Scikit-Learn: Machine Learning in Python

Protein Flexibility and Intrinsic Disorder

Protein Engineering, Design and Selection

PaRSnIP: Sequence-Based Protein Solubility Prediction Using Gradient Boosting Machine

Recombinant Protein Expression in Escherichia Coli: Advances and Challenges

Protein Flexibility and Rigidity Predicted from Sequence

Statsmodels: Econometric and Statistical Modeling with Python

DNASU Plasmid and PSI:Biology-Materials Repositories: Resources to Accelerate Biological Research

Improved Amino Acid Flexibility Parameters

Rapid and Accurate in Silico Solubility Screening of a Monoclonal Antibody Library

The CamSol Method of Rational Design of Protein Mutants with Enhanced Solubility

Disulfide Bond Formation in the Escherichia Coli Cytoplasm: An in Vivo Role Reversal for the Thioredoxins

The Role of Aromaticity, Exposed Surface, and Dipole Moment in Determining Protein Aggregation Rates

Implications of Protein Flexibility for Drug Discovery

Amino Acid Contribution to Protein Solubility: Asp, Glu, and Ser Contribute More Favorably than the Other Hydrophilic Amino Acids in RNase Sa

Practical Considerations in Refolding Proteins from Inclusion Bodies

Relationship of Protein Flexibility to Thermostability

Accuracy of Protein Flexibility Predictions

Genetic Screens and Directed Evolution for Protein Solubility

The NumPy Array: A Structure for Efficient Numerical Computation

Potential Aggregation Prone Regions in Biotherapeutics: A Survey of Commercial Monoclonal Antibodies

Lysine and Arginine Content of Proteins: Computational Analysis Suggests a New Tool for Solubility Design

Predicting the Solubility of Recombinant Proteins in Escherichia Coli

Complete Genome Characterisation of a Novel Coronavirus Associated with Severe Human Respiratory Disease in Wuhan, China

Proceedings of the National Academy of Sciences of the United States of America

protr/ProtrWeb: R Package and Web Server for Generating Various Numerical Representation Schemes of Protein Sequences

The High-Throughput Protein Sample Production Platform of the Northeast Structural Genomics Consortium

Machine-Learning-Guided Directed Evolution for Protein Engineering

On the Relation between Residue Flexibility and Residue Interactions in Proteins

Prediction of Protein B-Factor Profiles

We evaluated nine standard and 9,920 miscellaneous protein sequence properties using the Biopython's ProtParam module and 'protr' R package, respectively (Cock et al. 2009; N. Xiao et al. 2015) . For example, the standard properties include the Grand Average of Hydropathy (GRAVY), secondary structure propensities, protein structural flexibility etc., whereas miscellaneous properties include amino acid composition, autocorrelation, etc.

We thank New Zealand eScience Infrastructure for providing a high performance computing platform. We are grateful to Harry Biggs for proofreading our manuscript and providing feedback for the web server. This work was supported by the Ministry of Business, Innovation and Employment, New Zealand (MBIE grant: UOOX1709).